Why the FAA NOTAM system crashed.
As we all know, on May 22, the FAA NOTAM system went down. There is an interesting interview / article at Computerworld.com about why it happened. According to the article, it was a hard disk failure. Here’s a question? Why was the entire FAA NOTAM system dependent on one hard drive? I know it was a server with multiple hard drives but hasn’t the FAA ever heard of RAID? Isn’t that the whole point of having multiple hard drives in a server box so that if one of them fails, the system won’t be compromised?
Fred Coral on Jun 26, 2008
You’re absolutely right that RAID handles a lot of disk problems without data loss. However, the report I read regarding the disk problem is vague: “What happened was the drive in an end-of-life Sun box failed in the middle of updating the information on the hard drive, so it screwed up the database”. I’m not convinced what part of the storage system triggered the problem even though they did say “the drive”. I’d love more detail.
The RAID controller itself could fail. Also, the disk could fail in such a way that the RAID controller could not tell that it was failing. A malfunctioning piece of hardware could effectively lie and say “I’m fine” when it isn’t. It’s like determining if a person is sane by observing them briefly and listening to what they tell you: often you can tell, but there’s no guarantee. RAID reduces storage problems, and when done well, reduces them to nearly none, but it cannot possibly eliminate the possibility.
I expect both the FAA and Sun are staying as mum as they can manage regarding details, and would rather focus on how well they are prepared now and in the future.