Disk read test failing - breaking RAID array

How, what, where and why - when using the software.
InquiringMind
Posts: 14
Joined: 2024.10.20. 22:03

Disk read test failing - breaking RAID array

Post by InquiringMind »

I tried running HDSentinel's Disk Surface (Read) test on a Samsung 470 series SSD which was part of a RAID-0 array controlled by an LSI (now Avago/Broadcom) MegaRAID 9260-8i. The read tests failed (even when Windows was started from a separate hard disk) and MegaRAID Storage Manager (MSM) reported that the drive under test had been taken offline, breaking the RAID array.

It was possible to bring the drive back online with MSM and no data appears to have been lost, but surely a read test shouldn't be so traumatic?

(Further debug details sent to info@hdsentinel.com - on reconsideration, this thread should perhaps have been opened in the Bugs forum so feel free to relocate it).
Attachments
Failed Read.jpg
Failed Read.jpg (315.87 KiB) Viewed 173 times
User avatar
hdsentinel
Site Admin
Posts: 3106
Joined: 2008.07.27. 17:00
Location: Hungary
Contact:

Re: Disk read test failing - breaking RAID array

Post by hdsentinel »

Seems really weird.
Personally tested many RAID arrays on similar RAID controllers with all kind of hard disks (both SATA / SAS) and SSDs and never encountered any similar - and no other user ever reported any similar.

Generally yes, the Read test is the safest: it simply starts reading all sectors exported to the OS, from the Master Boot Record to all data sectors. It simply never causes any issue directly which could lead to similar situation.

According the image yes, I see the error message responded by the OS: "Error 2: The system cannot find the file specified" which means that the file (= the complete array in this case) removed from the system by the RAID controller - would be nice to know why, what is the "bug" in the RAID controller operation.

I received the report file, I'm still checking. Not sure if the driver of the RAID controller may be related: maybe somehow it does not "tolerate" that Hard Disk Sentinel attempted to lock the drive for exclusive use (to prevent other apps and Windows itself from accessing the partition on the RAID array. As this is the first step by default - maybe I can only imagine this situation.

So maybe you can try to disable it, just for a quick test (if you still prefer):
- try to open Disk menu -> Surface test and select the drive and test type
- but before starting the test, select the Configuration tab in this window and uncheck the enabled "Lock drive during test (unmount volumes)" option
- then proceed the test. Maybe on the Configuration tab, you can also use the Limit testing to specific data blocks and specify (for example) first block to be tested = 5000 to test only 2nd half of the array, just to check if there is any difference.

As I see from the report, you use Windows XP - and while I personally like and still actively use Windows XP, maybe the combination of the XP driver of the RAID controller and this particular SSD model can be related too, as (according the experiences) not all controllers have proper XP drivers prepared for SSDs.

I'll surely try to reproduce, make similar array and check / inspect the results, verify if the combination of Windows XP + controller + driver + SSD can be somehow related - and check if there is anything to do to avoid issues.
Thanks for increasing attention - and sorry for the possible troubles.
InquiringMind
Posts: 14
Joined: 2024.10.20. 22:03

Re: Disk read test failing - breaking RAID array

Post by InquiringMind »

I tried again, disabling the Lock Drive option and received similar results (new screenshot attached).

The driver used by the RAID card is version 4.32.0.32 of megasas.sys, dated 17/9/2010 - not a spring chicken but the latest version I could find with WinXP support. In terms of controller SSD support, the only thing I'm aware of are two firmware addons - Cachecade (using SSDs as a cache for HDDs) and FastPath (for better performance with SSDs), neither of which I have.

Aside from one sudden SSD failure (where it ceased to be visible even in the controller BIOS screen), I've not had any issues with this RAID setup since starting it in June 2018, aside from it not handling low-power (S3) suspend-to-RAM.
Attachments
20241024-174726_R_SAMSUNG_470_Series_SSD_S0SWNEAB400617_AXM09B1Q-surface-full.jpg
20241024-174726_R_SAMSUNG_470_Series_SSD_S0SWNEAB400617_AXM09B1Q-surface-full.jpg (650.4 KiB) Viewed 158 times
InquiringMind
Posts: 14
Joined: 2024.10.20. 22:03

Re: Disk read test failing - breaking RAID array

Post by InquiringMind »

Did a bit of experimenting (after taking another backup). With the 9260-8i, it isn't possible to "de-RAID" disks as such, any that aren't part of a RAID array just aren't available (they don't show up under Disk Management, etc). I did manage to create a RAID-0 array with just one disk though, and tried re-running the Read test. No change - same behaviour as above.
User avatar
hdsentinel
Site Admin
Posts: 3106
Joined: 2008.07.27. 17:00
Location: Hungary
Contact:

Re: Disk read test failing - breaking RAID array

Post by hdsentinel »

Yesterday finally had time to try to check the situation.
Yes, I noticed what you wrote: even a standalone drive needs to be configured as a RAID-0 "array" with one disk as member.

The closest controller what I could use for testing is an Intel branded LSI 9260-4i. I installed this under Windows XP SP 3 and installed 100% same driver what you use: version 4.27.1.32 (date: 6-11-2010). According the developer report, this is what you have installed too.
I downloaded the driver from
https://www.broadcom.com/support/knowledgebase/1211161491782/megaraid-sas-9260-4i---9260-8i---9260de-8i---9261-8i-downloads
"Signed driver v4.27.1.32 and v4.27.1.64 MID_1401090_all_Windows_Signed_v4.27.1.32_and_v4.27.1.64.zip"

I'm afraid I have no Samsung 470 Series SSD at all for testing, so used very random SSDs now (2x Kingston to make a RAID-0 array and also a Samsung SSD to make a RAID-0 "array" as single drive.

Image

I tried to perform all tests you mentioned: first the Disk menu -> Extended self test on a RAID member. The test ran for 30+ minutes and then completed without error:

Image

Then tried to perform Disk menu -> Surface test -> Read test on the the real array (2x SSDs in RAID-0), everything worked as expected:

Image

Image

and also then on the Samsung SSD too. This is generally a failing SSD with very low Health, so the disk surface map shows many slower blocks - but generally the test could complete:

Image

Personally I'm so sad that I could not reproduce the issue - as then I could immediately begin checking / improving in all possible ways.

To be honest, not really sure what can cause any issue on your system, why the controller puts the array offline, as generally there is no reason for that. I'm still trying with other possible SSDs. Not sure if the controller firmware and/or the SSD model itself would cause troubles or so. Would be nice to get some 470 SSDs for testing, but because of their age, I do not think I can get any working drives.

Ps. I used simple SFF-8087 -> 4xSATA cable to connect the drives to the controller. Do you use some backplane/enclosure or so? Should not be problem but maybe...
InquiringMind
Posts: 14
Joined: 2024.10.20. 22:03

Re: Disk read test failing - breaking RAID array

Post by InquiringMind »

Thanks for the follow-up.
hdsentinel wrote: 2024.10.29. 15:53 ...I installed this under Windows XP SP 3 and installed 100% same driver what you use: version 4.27.1.32 (date: 6-11-2010). According the developer report, this is what you have installed too.
Sorry, but the driver version I have installed is, as noted above, 4.32.0.32 which can be downloaded from:

https://docs.broadcom.com/docs/12349696
hdsentinel wrote: 2024.10.29. 15:53To be honest, not really sure what can cause any issue on your system, why the controller puts the array offline, as generally there is no reason for that. I'm still trying with other possible SSDs. Not sure if the controller firmware and/or the SSD model itself would cause troubles or so. Would be nice to get some 470 SSDs for testing, but because of their age, I do not think I can get any working drives.
Aside from the driver difference, SSD type may be a factor. I tried to force HDS to do a read test on the Crucial drives but it kept listing the Samsungs (presumably since they were first in the array?) so I tried re-ordering the RAID array so the Crucials were first - no go. Dismantled the array and created a new RAID-0 with just the two Crucials (leaving the Samsungs unassigned) but the Samsungs were still listed. So I then created a second RAID-0 with the five Samsungs, and this time I could select the Crucial array. So I ran a read test on that and it worked:
Result of testing array of Samsung 470s
Result of testing array of Samsung 470s
SSD_Test_Success.png (31.09 KiB) Viewed 57 times
I then ran a read test on the Samsung array and it failed again:
Result of testing array of Crucial C300s
Result of testing array of Crucial C300s
SSD_Test_Fail.png (18.99 KiB) Viewed 57 times
hdsentinel wrote: 2024.10.29. 15:53Ps. I used simple SFF-8087 -> 4xSATA cable to connect the drives to the controller. Do you use some backplane/enclosure or so? Should not be problem but maybe...
Nope - I have the same cables as yourself.

So unless the driver version change makes a difference, it would seem there is an issue with the Samsung 470s.
User avatar
hdsentinel
Site Admin
Posts: 3106
Joined: 2008.07.27. 17:00
Location: Hungary
Contact:

Re: Disk read test failing - breaking RAID array

Post by hdsentinel »

> Sorry, but the driver version I have installed is, as noted above, 4.32.0.32 which can be downloaded from:
> https://docs.broadcom.com/docs/12349696

Hm... From the report file you sent I see (line 1229):

PCI\VEN_1000&DEV_0079&SUBSYS_92611000&REV_05\4&9784C61&0&0048
4.27.1.32 6-11-2010 LSI MegaRAID SAS 9260-8i

That's why I tried with 4.27.1.32 driver

I'll examine with 4.32.0.32 too.


> I tried to force HDS to do a read test on the Crucial drives but it kept listing the Samsungs (presumably since they were first in the array?)

Generally the Surface test function tests the whole array configured, regardless of which drive is the first in the array.
The purpose of the RAID is exactly to prevent accessing drives independently in all possible ways. Ideally (as you can see) we have some methods to access the drives one-by-one at least to check their S.M.A.R.T. status (and if possible, use internal hardware short/extended self tests).
But for Surface tests, we can only test the complete array, as exported to the OS, similarly as (for example) Windows can read/write the logical drive (the complete array) during any file operation or during format.


> created a new RAID-0 with just the two Crucials (leaving the Samsungs unassigned) but the Samsungs were still listed.

Hard Disk Sentinel automatically detects unsassigned drives - and it may show them like if they'd all part of an array, but of course then the surface test would not touch the unassigned drives (exactly because they are not exported to the OS in any ways). The Information page of the main window displays these drives as "unassigned" or "hot spare" or similar, to indicate that these drives are not part of the configuration (even if detected/listed). Sorry for the possible confusion.

> So I then created a second RAID-0 with the five Samsungs, and this time I could select the Crucial array.
> So I ran a read test on that and it worked

Thanks, good to hear.


> I then ran a read test on the Samsung array and it failed again.

Yes, then I really worry that the SSD model is the important factor.

Not sure, but is it possible to check what happens if you configure only one such Samsung 470 as RAID-0 (so generally one member)?
Just to see if the issue happens then too.
I'm still trying to check where can I order at least one Samsung 470 for testing - would be nice to know if this would be "enough" (so no more drives required for a real RAID array). Do you offer one for sale? ;) Then I'd surely able to check with a such drive.

Not sure if there can be anything to do if there is a minor compatibility between this specific model and the controller and the software, but I'd be happy to examine, reproduce and check for any possible solutions/workarounds (and also try with other driver versions etc.) so investiate the situation.

Thanks so much for increasing attention and time on investigation!
Post Reply