Bug 474436 - FutureFeature Include a script to do Data Scrubbing on Software RAID
Summary: FutureFeature Include a script to do Data Scrubbing on Software RAID
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: mdadm
Version: 11
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Doug Ledford
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-03 20:27 UTC by Colin.Simpson
Modified: 2009-06-26 16:15 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-06-26 16:15:12 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Colin.Simpson 2008-12-03 20:27:34 UTC
Description of problem:
I reported this as an enhancement to RHEL5 but it probably should go into Fedora first in the hope it eventually makes it into the Enterprise Release.

Disks that are run as a Software RAID can develop bad blocks on unaccessed 
sectors of the disk. When a disk fails in the array and you replace the drive, 
it can fail to rebuild due to previously hidden bad blocks on the remaining 
disks (we've recently been bitten by this). As disks get larger this problem
becomes more likely. This can be mitigated on suitably up to date kernels by so
called "Data Scrubbing". This is a very serious issue as without being scrubbed
a RAID 5 can be less reliable than a RAID 0 with 2 drives (this stat it's off
one of the links below).

Debian has a script checkarray that they cron weekly (I'm told), that simply calls,

echo check > /sys/block/mdX/md/sync_action

,for each of the Software RAID's in use on the system.


See:
http://www.gentoo-wiki.com/HOWTO_Install_on_Software_RAID#Data_Scrubbing
http://www.ashtech.net/~syntax/blog/archives/53-Data-Scrub-with-Linux-RAID-or-Die.html
http://linux-raid.osdl.org/index.php/RAID_Administration


A similar script should probably be added to Fedora. 

Originally logged at bug #233972 on RHEL5.

Comment 1 Doug Ledford 2009-03-18 18:32:55 UTC
check doesn't actually fix anything, and unless you go looking in /sys/block/md* for the mismatch count, errors still exist and are not repaired.  I added a cron job to the cron.weekly directory that will run a repair operation on all active md raid arrays at the time the cron job is run.  This made it into mdadm-3.0-0.devel3.1.fc11.

Comment 2 Colin.Simpson 2009-03-18 20:36:07 UTC
Is there any danger doing a "repair" by default rather than a "check"? Esp if a RAID 1? Rather than just letting it report via "mdadm --monitor". I'm assuming most failures will be a bad block on one disk so the "repair" will try to rewrite the bad block on the bad disk only (and will get a block remap on the disk that failed). And the random pick of a disk with valid data caused by an inconsistent array will be a very rare case. Or haven't I explained well.

Comment 3 Jeffrey Siegal 2009-03-27 04:17:25 UTC
From reading commends from the raid developers I don't think repair should be done automatically.  The problem is that there is no way for the RAID subsystem to know which of the blocks is the correct one and may overwrite the good data with bad.  That sort of recovery, if necessary, should be a manual process.  In some cases it might be better to restore the affected files from backup.

However, even "check" will trigger the normal RAID bad block handling when the read fails (bad block handling meaning recover the data from the other drives and write it back to the drive that failed read).  So even the safer "check" has useful scrubbing behavior.

Before adding a script to do an automatic "repair" I would talk to the raid developers

Comment 4 Bug Zapper 2009-06-09 10:06:17 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 5 Colin.Simpson 2009-06-26 16:12:57 UTC
Now it looks like this is in Fedora 11, i.e the script

/etc/cron.weekly/raid-check

I'd imagine this bug can close.

Thanks for putting this in

Comment 6 Doug Ledford 2009-06-26 16:15:12 UTC
It is, and it's a check instead of a repair as you request.  So, yeah, I'll close this out.


Note You need to log in before you can comment on or make changes to this bug.