Red Hat Bugzilla – Bug 474436
FutureFeature Include a script to do Data Scrubbing on Software RAID
Last modified: 2009-06-26 12:15:12 EDT
Description of problem:
I reported this as an enhancement to RHEL5 but it probably should go into Fedora first in the hope it eventually makes it into the Enterprise Release.
Disks that are run as a Software RAID can develop bad blocks on unaccessed
sectors of the disk. When a disk fails in the array and you replace the drive,
it can fail to rebuild due to previously hidden bad blocks on the remaining
disks (we've recently been bitten by this). As disks get larger this problem
becomes more likely. This can be mitigated on suitably up to date kernels by so
called "Data Scrubbing". This is a very serious issue as without being scrubbed
a RAID 5 can be less reliable than a RAID 0 with 2 drives (this stat it's off
one of the links below).
Debian has a script checkarray that they cron weekly (I'm told), that simply calls,
echo check > /sys/block/mdX/md/sync_action
,for each of the Software RAID's in use on the system.
A similar script should probably be added to Fedora.
Originally logged at bug #233972 on RHEL5.
check doesn't actually fix anything, and unless you go looking in /sys/block/md* for the mismatch count, errors still exist and are not repaired. I added a cron job to the cron.weekly directory that will run a repair operation on all active md raid arrays at the time the cron job is run. This made it into mdadm-3.0-0.devel3.1.fc11.
Is there any danger doing a "repair" by default rather than a "check"? Esp if a RAID 1? Rather than just letting it report via "mdadm --monitor". I'm assuming most failures will be a bad block on one disk so the "repair" will try to rewrite the bad block on the bad disk only (and will get a block remap on the disk that failed). And the random pick of a disk with valid data caused by an inconsistent array will be a very rare case. Or haven't I explained well.
From reading commends from the raid developers I don't think repair should be done automatically. The problem is that there is no way for the RAID subsystem to know which of the blocks is the correct one and may overwrite the good data with bad. That sort of recovery, if necessary, should be a manual process. In some cases it might be better to restore the affected files from backup.
However, even "check" will trigger the normal RAID bad block handling when the read fails (bad block handling meaning recover the data from the other drives and write it back to the drive that failed read). So even the safer "check" has useful scrubbing behavior.
Before adding a script to do an automatic "repair" I would talk to the raid developers
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.
More information and reason for this action is here:
Now it looks like this is in Fedora 11, i.e the script
I'd imagine this bug can close.
Thanks for putting this in
It is, and it's a check instead of a repair as you request. So, yeah, I'll close this out.