Bug 474436

Summary:	FutureFeature Include a script to do Data Scrubbing on Software RAID
Product:	[Fedora] Fedora	Reporter:	Colin.Simpson
Component:	mdadm	Assignee:	Doug Ledford <dledford>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	11	CC:	dledford, jbs
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-06-26 16:15:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Colin.Simpson 2008-12-03 20:27:34 UTC

Description of problem:
I reported this as an enhancement to RHEL5 but it probably should go into Fedora first in the hope it eventually makes it into the Enterprise Release.

Disks that are run as a Software RAID can develop bad blocks on unaccessed 
sectors of the disk. When a disk fails in the array and you replace the drive, 
it can fail to rebuild due to previously hidden bad blocks on the remaining 
disks (we've recently been bitten by this). As disks get larger this problem
becomes more likely. This can be mitigated on suitably up to date kernels by so
called "Data Scrubbing". This is a very serious issue as without being scrubbed
a RAID 5 can be less reliable than a RAID 0 with 2 drives (this stat it's off
one of the links below).

Debian has a script checkarray that they cron weekly (I'm told), that simply calls,

echo check > /sys/block/mdX/md/sync_action

,for each of the Software RAID's in use on the system.


See:
http://www.gentoo-wiki.com/HOWTO_Install_on_Software_RAID#Data_Scrubbing
http://www.ashtech.net/~syntax/blog/archives/53-Data-Scrub-with-Linux-RAID-or-Die.html
http://linux-raid.osdl.org/index.php/RAID_Administration


A similar script should probably be added to Fedora. 

Originally logged at bug #233972 on RHEL5.

Comment 1 Doug Ledford 2009-03-18 18:32:55 UTC

check doesn't actually fix anything, and unless you go looking in /sys/block/md* for the mismatch count, errors still exist and are not repaired.  I added a cron job to the cron.weekly directory that will run a repair operation on all active md raid arrays at the time the cron job is run.  This made it into mdadm-3.0-0.devel3.1.fc11.

Comment 2 Colin.Simpson 2009-03-18 20:36:07 UTC

Is there any danger doing a "repair" by default rather than a "check"? Esp if a RAID 1? Rather than just letting it report via "mdadm --monitor". I'm assuming most failures will be a bad block on one disk so the "repair" will try to rewrite the bad block on the bad disk only (and will get a block remap on the disk that failed). And the random pick of a disk with valid data caused by an inconsistent array will be a very rare case. Or haven't I explained well.

Comment 3 Jeffrey Siegal 2009-03-27 04:17:25 UTC

From reading commends from the raid developers I don't think repair should be done automatically.  The problem is that there is no way for the RAID subsystem to know which of the blocks is the correct one and may overwrite the good data with bad.  That sort of recovery, if necessary, should be a manual process.  In some cases it might be better to restore the affected files from backup.

However, even "check" will trigger the normal RAID bad block handling when the read fails (bad block handling meaning recover the data from the other drives and write it back to the drive that failed read).  So even the safer "check" has useful scrubbing behavior.

Before adding a script to do an automatic "repair" I would talk to the raid developers

Comment 4 Bug Zapper 2009-06-09 10:06:17 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 5 Colin.Simpson 2009-06-26 16:12:57 UTC

Now it looks like this is in Fedora 11, i.e the script

/etc/cron.weekly/raid-check

I'd imagine this bug can close.

Thanks for putting this in

Comment 6 Doug Ledford 2009-06-26 16:15:12 UTC

It is, and it's a check instead of a repair as you request.  So, yeah, I'll close this out.