Why is mdadm unable to deal with an “almost failed” disk?

In general, the purpose of a RAID, depending on the chosen Raid level, provides a different balance among the key goals
data redundancy,
availability,
performance and capacity.

Based on the specific requirements, it is the responsibility of the storage owner to decide which balance of the various factors is the right one for the given purpose, to create a reliable solution.

The job of the chosen Raid solution (here in this case we talk about the software mdadm) is to ensure data protection first and foremost. With that in mind, it becomes clear that it is not the job of the raid solution to weight business continuity higher than data integrity.

To put it in other words: The job of mdadm is to avoid a failed raid. As long as a “weird behaving disk” is not completely broken, it still contributes to the raid.

So why not just knocking a weirdly behaving disk out of the array, drop a message in the log and keep going? Because doing so would increase the risk of losing data.

I mean, you are right, for the given moment, from a business perspective, it seems the better solution just to continue. In reality however, the message which has been dropped to the log may remains undetected, so the degraded state of the raid remains undetected. Some time later, eventually another disk in the same raid fails, as result the stored data on the failed raid is eventually gone.

In addition to that: It is hard to exactly define what’s a “weirdly behaving disk”. Expressed the other way: What is still an acceptable operating behavior of a single disk, operated within an disk array?

Some of us may answer “disk shows some errors”. Others may answer: “As long as the errors can be corrected, all is fine”. Others may answer: “As long as the disk answers to all commands in a given time, all is fine”. Others say “in case the disk temperature differs more than 5°C compared to the average temperature of all disks within the same array”. Another answer could be “as long as a scrub reveals no errors”, or “as long as SMART does not shows errors”.

What is written is not a long and also not a complete list.

The point is that the definition of “acceptable behavior of a disk” is a matter of interpretation, and therefore also the responsibility of the storage owner, and not something that mdadm is supposed to decide on its own.

Leave a Comment Cancel reply