3 of 4 new disk bad? Scsi parity errors on 3 SATA disks.

Gilbo · Jun 9, 2006

I was installing a new software RAID 5 (linux) in a fileserver of mine. I've done this twice before. I managed to partition the disks just fine. I successfully built the array once using mdadm --create --force (my third attempt), but every other time it gets built, 3 out of the five disks are considered failed immediately. The one time that it built properly I copied 364GB of my ripped DVD collection to it before it froze. When it rebooted the three disks were failed.

cat /proc/mdstat:

Code:

Personalities : [raid5] [raid4]
md0 : active raid5 sdd1[5](F) sdc1[3] sdb1[6](F) sda1[7](F) hda1[0]
      980446464 blocks level 5, 64k chunk, algorithm 0 [5/2] [U__U_]

unused devices: <none>

Obviously that doesn't make for a very useful RAID 5...

dmesg notes the problem as being SCSI parity errors on the three bad disks:

Code:

ata4: command 0x35 timeout, stat 0xd0 host_stat 0x21
ata4: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata4: status=0xd0 { Busy }
sd 3:0:0:0: SCSI error: return code = 0x8000002
sdd: Current: sense key: Aborted Command
    Additional sense: Scsi parity error
end_request: I/O error, dev sdd, sector 490234559
ATA: abnormal status 0xD0 on port 0x967
ATA: abnormal status 0xD0 on port 0x967
ATA: abnormal status 0xD0 on port 0x967
ata2: command 0x35 timeout, stat 0xd0 host_stat 0x21
ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata2: status=0xd0 { Busy }
sd 1:0:0:0: SCSI error: return code = 0x8000002
sdb: Current: sense key: Aborted Command
    Additional sense: Scsi parity error
end_request: I/O error, dev sdb, sector 490223295
ATA: abnormal status 0xD0 on port 0x977
ATA: abnormal status 0xD0 on port 0x977
ATA: abnormal status 0xD0 on port 0x977
ata1: command 0x35 timeout, stat 0xd0 host_stat 0x21
ata1: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata1: status=0xd0 { Busy }
sd 0:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key: Aborted Command
    Additional sense: Scsi parity error
end_request: I/O error, dev sda, sector 490234559
ATA: abnormal status 0xD0 on port 0x9F7
ATA: abnormal status 0xD0 on port 0x9F7
ATA: abnormal status 0xD0 on port 0x9F7
ata4: command 0xea timeout, stat 0xd0 host_stat 0x0
ata4: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata4: status=0xd0 { Busy }
raid5: Disk failure on sdd1, disabling device. Operation continuing on 4 devicesata2: command 0xea timeout, stat 0xd0 host_stat 0x0
ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata2: status=0xd0 { Busy }
raid5: Disk failure on sdb1, disabling device. Operation continuing on 3 devicesata1: command 0xea timeout, stat 0xd0 host_stat 0x0
ata1: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata1: status=0xd0 { Busy }
raid5: Disk failure on sda1, disabling device. Operation continuing on 2 devicesRAID5 conf printout:

I have difficulty accessing the disks. I can't erase the superblocks or build filesystems on the partitions. Now, dmesg makes me believe that it's a hardware error, however,
1. I consider 3/4 new disks being bad very poor luck.
2. I don't know enough about linux to discount the possibility of a software error (SCSI support is enabled. Low-level SATA driver for AMD/NVidia is too).

I'm leaning towards cables right now, or motherboard headers, but I wanted to get opinions while I continue trouble-shooting, and advice on what to test.

P.S. I haven't once managed to shut the system down without it freezing on "Remounting remaining filesystems readonly ..." I don't know if that's related, or useful, but hitting reset all the time is annoying.

Gilbo · Jun 9, 2006

Hmmm. 2 of my kernels no longer boot. Lucky I just put a third testing one on the boot partition. They freeze on Starting lo...

There's something about this... I can just tell it's going to be hell to figure out.

Gilbo · Jun 9, 2006

P.P.S. Yes, the RAM is okay. Used this system for a year now & Memtested it again when everything started going to hell.

Mercutio · Jun 9, 2006

I experienced something similar with a Promise SATA controller. After lots of troubleshooting, I replaced my controller and my problems went away.

Gilbo · Jun 9, 2006

It's definitely a thought that was on my mind Merc. But one disk seems to be working. This is also on the onboard SATA, which works correctly with the same kernel and kernel configuration on another identical box.

Someone on the Gentoo forums (where I posted when I thought it was a software issue) just pointed out that the errors occur at 250GB into the disk. Since the failed disks are all 250GB, the kernel appears to be trying to read off the end of the disk.

At the moment I'm operating on the premise that partitioning is the problem. There are problems with that theory. Particularly that the partitions on two of the failed disks are bigger than those on the two good disks, and that the third failed disk is partitioned to exactly the same size in sectors.

Since I'm now cross-posting, I'm going to ask you guys to ignore this until I've checked out this idea. I'll post back if this is a red herring. I'll come back and beg for your help when I'm sure its hardware --I don't want to waste anyone's time and cross-posting tends to do that.

Code:

Server01 gilbo # fdisk -lu /dev/hda

Disk /dev/hda: 300.0 GB, 300090728448 bytes
255 heads, 63 sectors/track, 36483 cylinders, total 586114704 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hda1              63   490223474   245111706   fd  Linux raid autodetect
/dev/hda2       490223475   492223475     1000000+  82  Linux swap / Solaris
/dev/hda3   *   492223476   492295859       36192   83  Linux
/dev/hda4       492295860   586099394    46901767+  83  Linux
Server01 gilbo # fdisk -lu /dev/sda

Disk /dev/sda: 251.0 GB, 251000193024 bytes
255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1              63   490234751   245117344+  fd  Linux raid autodetect
Server01 gilbo # fdisk -lu /dev/sdb

Disk /dev/sdb: 251.0 GB, 251000193024 bytes
255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1              63   490223474   245111706   fd  Linux raid autodetect
Server01 gilbo # fdisk -lu /dev/sdc

Disk /dev/sdc: 251.0 GB, 251000193024 bytes
255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1              63   490223474   245111706   fd  Linux raid autodetect
Server01 gilbo # fdisk -lu /dev/sdd

Disk /dev/sdd: 251.0 GB, 251000193024 bytes
255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1              63   490234751   245117344+  fd  Linux raid autodetect

Sol · Jun 13, 2006

If you don't much care what the problem is as long as it's fixed you could try implimenting your array using evms, it does much of the partitioning and such for you and might get arround your problem.

Gilbo · Jun 15, 2006

Small update. I have repartitioned the disks in the array so that all the partitions involved are exactly the same size.

The array was syncing for three days and then the power went out and it had to restart. It's been going for about 30 hours now, and is at 47.5%. Should be done in a couple days, and I'll tell you all how it went.

It turns out that either the disks were shutting themselves down due to whatever error was happening. This made it appear like they had failed. Here's the only way I found to fix it:
1. Powering down & unplugging the SATA cables was the only way to reset the disks.
2. I then had to boot with a kernel that didn't support RAID so the detection wouldn't make the drives puke again (and make me pull the cables again).
3. I zeroed the superblocks on all the disks originally involved (since the good disks would tell the kernel to look at and mess up the bad disks (requiring more cable resets).
4. I repartitioned the disks so the RAID partitions were exactly equal in sector count.
5. I rebooted with a RAID-capable kernel and created a new array.

It's annoying to test something that takes 3.5 days between tests to reset itself. Anyone know any tricks for faster syncing of a linux kernel RAID5 array?

time · Jun 15, 2006

Gilbo said:
It's annoying to test something that takes 3.5 days between tests to reset itself. Anyone know any tricks for faster syncing of a linux kernel RAID5 array?

Yep ... don't use RAID 5. Do you really feel secure with the recovery process you're going through? Consider RAID 10 or even RAID 0+1 instead.

Gilbo · Jun 15, 2006

It's certainly a bitch. It's worth mentioning that a single-disk rebuild from a failure, is quicker than syncing the whole array which you only have to when you create it.

In the future, I'm not even going to bother with anything involving striping. I'm just going to add disks as I need them --at the last minute-- and use LVM if I want them to appear as one volume. The biggest thing is the money wasted on buying multiple disks of the same capacity all together. You spend far less if you buy capacity as you need it, since the price/GB drops so rapidly.

ddrueding · Jun 15, 2006

Gilbo said:
It's certainly a bitch. It's worth mentioning that a single-disk rebuild from a failure, is quicker than syncing the whole array which you only have to when you create it.

In the future, I'm not even going to bother with anything involving striping. I'm just going to add disks as I need them --at the last minute-- and use LVM if I want them to appear as one volume. The biggest thing is the money wasted on buying multiple disks of the same capacity all together. You spend far less if you buy capacity as you need it, since the price/GB drops so rapidly.

Yep, I've been doing JBOD arrays for this reason. Add drives as you need them, they don't need to be the same size, and you get one big (and growing) volume.

P5-133XL · Jun 16, 2006

I don't believe in spanned volumes in Windows any more than raid-0: It's too risky because if one drive dies, the entire volume fails. One needs very good backups, just in case.

My solution has been to assign partions to paths, as opposed to drive letters. As long as no single directory needs to be larger than the partitioned assigned it works really well. If a drive dies (even the root drive), then all that is lost is what is on that specific drive/directory. Thus, you have the benefits of extensible spanned volumes, without the risk of losing everything.

ddrueding · Jun 16, 2006

I just build 2 servers and keep a manual copy on each. That protects from all possible failure points.

1. Drive/array failure
2. System Failure
3. Housefire (other system is in another building)
4. Accidental deletion (when I bother to maintain my backups.

Bozo · Jun 16, 2006

Stupid question: How can you have a SCSI error on a SATA drive?

Bozo :mrgrn:

P5-133XL · Jun 16, 2006

The drivers simulate the SCSI command set. The SCSI interface into the OS is well known so the programmers of the drivers simply use that API. So when something fails, the OS reports a SCSI error that will percolate back upto the user.

Gilbo · Jul 6, 2006

I finally had some spare time to do some hardware testing.

All the disks are good. So far it's either bad cables or bad motherboard headers. I hope it's the former. The latter would be irritating to say the least... I would not be enthused to be down a server for however many months it'll take me to get a replacement.

Gilbo · Jul 6, 2006

Bah. Cables.

All 3 are bad. I have 5 more of these I bought & I have a feeling they'll all be pieces of crap. I checked and NCIX doesn't sell "Generic SATA cables" anymore otherwise I would link to them and tell you all what pieces of crap they are. I can't even vent violently in the buyer's comments. Oh well...

Gilbo · Jul 7, 2006

I got new cables. Everything seems to be fine now.

The RAID5 array is syncing at over 65MB/s which is quite nice. Shouldn't take nearly as long at this rate. I guess it was going so slow because of the bad cables.

Happy times.

timwhit · Jul 7, 2006

I would have thought it would be pretty difficult to make that crappy of a cable. I wonder if they were ever tested. Maybe they went directly from design to production to shipping without ever being plugged in...

3 of 4 new disk bad? Scsi parity errors on 3 SATA disks.

Gilbo

Storage is cool

Gilbo

Storage is cool

Gilbo

Storage is cool

Mercutio

Fatwah on Western Digital

Gilbo

Storage is cool

Sol

Storage is cool

Gilbo

Storage is cool

time

Storage? I am Storage!

Gilbo

Storage is cool

ddrueding

Fixture

P5-133XL

Xmas '97

ddrueding

Fixture

Bozo

Storage? I am Storage!

P5-133XL

Xmas '97

Gilbo

Storage is cool

Gilbo

Storage is cool

Gilbo

Storage is cool

timwhit

Hairy Aussie