Crucial M4 firmware bug

CougTek

Hairy Aussie
Joined
Jan 21, 2002
Messages
8,729
Location
Québec, Québec
Apparently, the Crucial M4 SSD has a firmware bug that cause it to crash after 5200 hours of operation. I read this on a French news site, but you can always read that thread in Crucial's forum in which they acknowledge the problem. Crucial's employees are supposed to be working on a solution.
 

LiamC

Storage Is My Life
Joined
Feb 7, 2002
Messages
2,016
Location
Canberra
I laughed at the Sandforce haters holding up Crucial & Intel as the bastions of reliability. They all have bugs. SSD's, by the nature of the controllers will be buggier than mechanical for some time to come.
 

ddrueding

Fixture
Joined
Feb 4, 2002
Messages
19,742
Location
Horsens, Denmark
I laughed at the Sandforce haters holding up Crucial & Intel as the bastions of reliability. They all have bugs. SSD's, by the nature of the controllers will be buggier than mechanical for some time to come.

Indeed. I've had two SSDs fail so far, an OCZ Vertex after 18+ months and an Intel 5xx after 2 months.
 

DrunkenBastard

Storage is cool
Joined
Jan 21, 2002
Messages
775
Location
on the floor
Ok I'm running a little slow tonight.

Does this mean that after 5200 power on hours, the turdburger starts pooping the bed after 1 hour of power? I understand that SSDs are finite life devices unlike a rotating HDD, but this is taking the cake.

My 128GB M4 is still waiting for installation in a computer, prob a good thing I didn't hurry on that...
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,932
Location
Brisbane, Oz
The official reps are saying it affects "some" customers. Any fool can see that only "some" customers will have been running their drives 24x7, so very likely it's only a matter of time before *all* customers are affected.

How many have they sold? A million? More?

I'm expecting Crucial (Micron) to go into denial mode - the lawsuits are going to kill the company. This isn't something beyond their control at all.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,932
Location
Brisbane, Oz
I have a site with 7 of these drives in continuous operation since June (plus others that get switched off). That means they may *all* blow up sometime in the next couple of weeks. Unfortunately, the site is in a different f**king country, so f**k knows how I'm going to deal with this - Crucial isn't planning on releasing a firmware fix for another couple of weeks yet, and officially, there's no information whether it even *can* be fixed with a firmware update, or for that matter, the nature of the problem, or even that there is a a problem!

Even a firmware update is non-trivial; it doesn't work with AHCI and 4 of the drives will have to be sequentially removed from an array and connected to an IDE port in a suitable host. After lots of backups, of course.

F****************************************************CK!
 

CougTek

Hairy Aussie
Joined
Jan 21, 2002
Messages
8,729
Location
Québec, Québec
Time, I'm sorry if I ruined your day by posting this thread. I figured that you could at least be prepared knowing what's coming rather than waking up one morning and be utterly screwed.

You should contact your customer immediately to backup the drives and put replacement drives before they fail.
 

Bozo

Storage? I am Storage!
Joined
Feb 12, 2002
Messages
4,396
Location
Twilight Zone
I have 10 of those drives to be installed in new computers. Looks like I'll wait a few weeks.
 

LiamC

Storage Is My Life
Joined
Feb 7, 2002
Messages
2,016
Location
Canberra
Given that Sandforce, Marvell and Intel controllers have all had issues lately, that leaves .... Samsung as being reliable. Funny that.
 

CougTek

Hairy Aussie
Joined
Jan 21, 2002
Messages
8,729
Location
Québec, Québec
Given that Sandforce, Marvell and Intel controllers have all had issues lately, that leaves .... Samsung as being reliable. Funny that.
I remember a link to a reliability survey or study some months ago and Samsung wasn't very good on the SSD front. I would still buy Crucial or Intel drives over Samsung's.

The 5200 POH bug on the Curcial drives is odd though as it should have been easy to spot during the reliability check of the controller.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,932
Location
Brisbane, Oz
Spoke to Crucial. They officially confirm that:

  • It's 5200 hours and not 5000 hours
  • The firmware update will fix it even if the problem has started
  • The firmware update will be available Monday
I estimate I will need to allow 2 days onsite (1 day if everything goes perfectly, and an extra day to rebuild the server when things go tits up in the real world). Because of time differences, limited flights, etc, it will take 3.5 days of my time plus about $1500 in costs.

I should probably try to find someone local on the ground, but for ad hoc jobs, that usually ends in tears. No, we didn't want a retail copy of Windows Ultimate, thanks very much. No, we don't want a malware disinfection or a data recovery. No, the Kingston RAM is just fine, we don't need the overclocked OCZ with giant heatsinks. What's that you say? The PC wouldn't boot so you used your special repair disk, which reformatted the drive, so you need to take it away for a week and make magical passes over it?

I suppose I should test this out on a customer nearby first ...
 

CougTek

Hairy Aussie
Joined
Jan 21, 2002
Messages
8,729
Location
Québec, Québec
If it affected the C300, it would have been all over the news a while ago since most of those drives have had more than 5200 POH.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,932
Location
Brisbane, Oz
Based on feedback in the Crucial forums, the firmware update appears to install successfully except for a couple of oddities such as a Macbook.

It does not work with SAS expanders, and possibly not with any hardware RAID.

We found that it also didn't work with Marvell SATA (had to move the drives to Intel SATA ports).

AMD SATA (780G) also worked fine, with all 5 drives hanging off one controller identified and updated okay. It was just a case of switching from RAID to AHCI and back again after the update.

The updater was able to handle both IDE and AHCI modes.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,932
Location
Brisbane, Oz
So no biggies then. Just 3 days and a half of your time and 1500$. Almost a non-event.

No, once I was satisfied that Crucial had actually admitted the cause of the problem as a software bug (in handling SMART values), and that the update worked for all normal people, I elected to get someone local to do it. Another plus was the support for both AHCI and IDE. Working around users, it only took about half a day (with me helping by running backups remotely and supervising over the phone), so pretty much best case scenario.

I estimated that we had 8 days to go before the first of the drives went belly up. So I really didn't want to put it off for another week.

With the benefit of hindsight, I'd consider taking a risk on multiple workstations without any new data and just skip the backup. So you could really rip through quite a few non-Marvell machines with that method: insert CD, reboot, choose drive, remove CD, reboot. Of course, explaining to the one person in 50 with a bricked drive that they'll have to wait for a rebuild ...

Having said all that, the stress took a couple of years off my life, and I don't have any cost recovery for all my time on this. So f*ck you, Crucial.
 

LunarMist

I can't believe I'm a Fixture
Joined
Feb 1, 2003
Messages
17,497
Location
USA
Is that 5200 hours continuously or cumulatively? How can one determine the hours used?
 

Bozo

Storage? I am Storage!
Joined
Feb 12, 2002
Messages
4,396
Location
Twilight Zone
I believe that hours 'on' is in the SMART data on the hard drive.
It's hours cumulatively.
 

LunarMist

I can't believe I'm a Fixture
Joined
Feb 1, 2003
Messages
17,497
Location
USA
I probably don't need to worry until the summer in that case. :)
 

LunarMist

I can't believe I'm a Fixture
Joined
Feb 1, 2003
Messages
17,497
Location
USA
Is that cumulative to include the other bug fix? It has not happened yet to the M4 I bought, but time may be running out.
 
Top