Bug 474436 - FutureFeature Include a script to do Data Scrubbing on Software RAID
FutureFeature Include a script to do Data Scrubbing on Software RAID
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: mdadm (Show other bugs)
11
All Linux
low Severity medium
: ---
: ---
Assigned To: Doug Ledford
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-03 15:27 EST by Colin Simpson
Modified: 2009-06-26 12:15 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-06-26 12:15:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Colin Simpson 2008-12-03 15:27:34 EST
Description of problem:
I reported this as an enhancement to RHEL5 but it probably should go into Fedora first in the hope it eventually makes it into the Enterprise Release.

Disks that are run as a Software RAID can develop bad blocks on unaccessed 
sectors of the disk. When a disk fails in the array and you replace the drive, 
it can fail to rebuild due to previously hidden bad blocks on the remaining 
disks (we've recently been bitten by this). As disks get larger this problem
becomes more likely. This can be mitigated on suitably up to date kernels by so
called "Data Scrubbing". This is a very serious issue as without being scrubbed
a RAID 5 can be less reliable than a RAID 0 with 2 drives (this stat it's off
one of the links below).

Debian has a script checkarray that they cron weekly (I'm told), that simply calls,

echo check > /sys/block/mdX/md/sync_action

,for each of the Software RAID's in use on the system.


See:
http://www.gentoo-wiki.com/HOWTO_Install_on_Software_RAID#Data_Scrubbing
http://www.ashtech.net/~syntax/blog/archives/53-Data-Scrub-with-Linux-RAID-or-Die.html
http://linux-raid.osdl.org/index.php/RAID_Administration


A similar script should probably be added to Fedora. 

Originally logged at bug #233972 on RHEL5.
Comment 1 Doug Ledford 2009-03-18 14:32:55 EDT
check doesn't actually fix anything, and unless you go looking in /sys/block/md* for the mismatch count, errors still exist and are not repaired.  I added a cron job to the cron.weekly directory that will run a repair operation on all active md raid arrays at the time the cron job is run.  This made it into mdadm-3.0-0.devel3.1.fc11.
Comment 2 Colin Simpson 2009-03-18 16:36:07 EDT
Is there any danger doing a "repair" by default rather than a "check"? Esp if a RAID 1? Rather than just letting it report via "mdadm --monitor". I'm assuming most failures will be a bad block on one disk so the "repair" will try to rewrite the bad block on the bad disk only (and will get a block remap on the disk that failed). And the random pick of a disk with valid data caused by an inconsistent array will be a very rare case. Or haven't I explained well.
Comment 3 Jeffrey Siegal 2009-03-27 00:17:25 EDT
From reading commends from the raid developers I don't think repair should be done automatically.  The problem is that there is no way for the RAID subsystem to know which of the blocks is the correct one and may overwrite the good data with bad.  That sort of recovery, if necessary, should be a manual process.  In some cases it might be better to restore the affected files from backup.

However, even "check" will trigger the normal RAID bad block handling when the read fails (bad block handling meaning recover the data from the other drives and write it back to the drive that failed read).  So even the safer "check" has useful scrubbing behavior.

Before adding a script to do an automatic "repair" I would talk to the raid developers
Comment 4 Bug Zapper 2009-06-09 06:06:17 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 5 Colin Simpson 2009-06-26 12:12:57 EDT
Now it looks like this is in Fedora 11, i.e the script

/etc/cron.weekly/raid-check

I'd imagine this bug can close.

Thanks for putting this in
Comment 6 Doug Ledford 2009-06-26 12:15:12 EDT
It is, and it's a check instead of a repair as you request.  So, yeah, I'll close this out.

Note You need to log in before you can comment on or make changes to this bug.