problem Fixing mdadm RAID0 with a lost partition table

pcbye · Dec 16, 2010

I'm close to giving up, but thought I'd at least try and see if anyone has had a similar problem. This is rather long but the information is pertinent.

Five days ago, I rebooted my server and the BIOS for my RAID controller came up and reported that one of the two RAID5 arrays was in an "INOPERABLE" state. Here are the specifications for my setup:

RAID Controller: 3Ware 9500S-8 Escalade PCI-X card
Drives: 8x 750GB Seagate ST375033AS configured as follows:
-Two RAID5 arrays of 4 drives each (controlled/managed by the controller card)
-The two arrays are RAID'd together as a RAID0 (controlled by software - mdadm in Ubuntu Linux)

In the INOPERABLE array, 3 drives reported OK by the BIOS, and one didn't show up at all. I shut down, pulled out the drive that wasn't showing up, and popped it into another machine to see if there was something wrong with it. The BIOS of the other machine didn't even recognize it as a drive, and I could hear that it wasn't really spinning.

But when I created this machine and the arrays, I had purchased an extra identical drive knowing that someday I would probably need one as a replacement. Thinking at this point that the array in error was just degraded and needed to rebuild, I popped in the spare identical drive, rebooted, and went into the BIOS. The BIOS would not let me add the new drive to the array or rebuild (which I had had to do once in the past on this controller, so I was fairly sure I was following the right procedure).

A night of increasingly panicked google searching made me realize that "INOPERABLE" was not "DEGRADED", and it meant that two drives had actually failed, even though the BIOS was still marking 3 of them as "OK". I immediately submitted a problem report to 3Ware's technical support website, and got an email response in an hour, asking me to run their diagnostic tool off of a bootable flash drive, gather the logs, and send them to them to analyze, which I did as soon as possible.

The tech support emailed back and said someone would contact me on Monday (their normal hours of regular operation are M-F 7-5). I got some email on Monday asking some more questions, and then talked to the guy on the phone on Tuesday, and his assessment was that one of the drives had went offline at least long enough for the array to go into degraded state some time prior to my complete failure last Friday, and then the other drive went offline. He asked me to look through the Linux system logs, and indeed on 11/21 the 3Ware driver sent a message saying that the array in question went to "DEGRADED" state. So he said I was probably left with two options - try to get the drive that wouldn't boot at all working again, or he could give me a program to force the other drive (the one that went offline but was still apparently functioning) back into the array, where its data would be from November 21st, but maybe still good enough to recover some data, given the low level of write I actually do to this array.

With 9 identical drives, I thought it was worth trying to swap out the PCB of the dead drive with one of the others just to see if that was the problem, and if it was I had found a site I could order an identical PCB (or just use the one off the spare drive I had) and get the dead drive back to life and get my data off. I tried all 8 of the PCBs from the other drives, but none caused the drive to power up (and I tried the PCB of the dead drive on one of the other drives, and while the drive did spin up, it was having a horrible time doing anything). So I think that option is out.

I went ahead and ran the program he gave me to force the 11-21 drive back into the array. Crossing my fingers I booted back into Linux, and ran several mdadm commands to see if there was any chance the RAID0 (of the two RAID5 arrays) would come back to life, even with some corrupted files. No immediate luck, and the bad array (/dev/hdc) was showing up as an empty drive to the system. (the good array, /dev/hdb, still showed up as part of a RAID0 with all its headers and superblocks). I was able to see the specifications of the original RAID0 by running mdadm --examine on /dev/hdb1 (chunk size, etc.).

I found this thread and attempted some of the commands but I think /dev/hdc had lost not only its superblocks but it's partition table, because any of the recreate commands I tried failed, not finding /dev/hdc1 (/dev/hdb1 obviously still available). I tried zeroing the superblocks as the post describes, and recreating with /dev/hdb and /dev/hdc (and --assume-clean) but that didn't seem to take either.

I realize the chances of any of this working are extremely slim...but I'd still like to keep trying. I am fortunate to have had a full backup of all the data from 5/28, and the most difficult loss will be 7 months of pictures, the rest can be restored with time and money. But I hate to give up without trying everything I can. RAID is not backup...RAID is not backup...RAID is not backup...

Any additional thoughts would be much appreciated.

problem Fixing mdadm RAID0 with a lost partition table

pcbye

What is this storage?