Follow up on our nVidia RAID problems

I had posted on problems with the nVidia Raid on our SunFire servers. Well I think I now have the root cause of the problems: not the Sun hardware, the nVidia RAID, or Windows 64 bit drivers.

All the problems we had were when we used mirrored pairs of Western Digital 500Gb SATA drives that we had bought in a single batch of four drives. Identical drives bought on another day were fine, as were 500Gb drives from Maxtor and Hitachi.

After testing these four drives we found three of them kept developing low level unfixable bad blocks, irrespective of the PC, Sun or any other brand, they were used in. It seems when one of these bad blocks was hit:

  • the nVidia RAID caused the mirrored pair to loose their sync and the server hung, when rebooted the server had two drives and the mirror had to be recreated.
  • if Windows Software Mirroring was used we just lost the mirrored pair, at least there was no server hang. The mirror would try to recover – with mixed success, usually failing at the same point each time. However, sometimes working, hence all our confusion in finding the root cause of the problem.

Given this experience we are staying with Windows Software Mirroring as at least the server does not hang.

Now in twenty odd years in this business I have never had three out of four drives fail in single batch, in fact I don’t think I have ever had a ‘dead on arrival’ hard disk from any of the big name brands.

My guess is these four drives were dropped at some point after they left the factory QA department and before we bought them. The faulty three are off back to WD under warranty, I wonder if the fourth will survive? It is certainly not going into any system that is critical.