Description of problem:
Disks that are run as a Software RAID can develop bad blocks on unaccessed
sectors of the disk. When a disk fails in the array and you replace the drive,
it can fail to rebuild due to previously hidden bad blocks on the remaining
disks (we've recently been bitten by this). As disks get larger this problem
becomes more likely. This can be mitigated on suitably up to date kernels by so
called "Data Scrubbing". This is a very serious issue as without being scrubbed
a RAID 5 can be less reliable than a RAID 0 with 2 drives (this stat it's off
one of the links below).
Debian has a script checkarray that they cron weekly (I'm told) that simply calls,
echo check > /sys/block/mdX/md/sync_action
,for each of the Software RAID's.
A similar script should probably be added to RH EL and Fedora.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Any thoughts on this ticket?
The check capability is present in rhel5 already, but we don't automatically
initiate check events as those can have negative impacts on both performance and
power consumption. It is left to the user to initiate an event if they choose.
I would highly recommend initiating an event prior to any planned modifications
of the array.
However, I can certainly see shipping a cron.weekly script that simply defaults
to off, but can be enabled by the user for exactly this purpose.
This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release. If you would like
this request to be reviewed for the next minor release, ask your
support representative to set the next rhel-x.y flag to "?".
Not so bothered about it making it into a RH minor release, I think it should be
on your radar for a future major release.
Should I (or can you, as I'm not sure exactly how) put this as a suggestion to
the Fedora team so it may make it into RH release down the line.
Release note added. If any revisions are required, please set the
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.
The Linux software raid stack supports data scrubbing (reading disks in the raid array and looking for bad sectors, and when bad sectors are found using information from other disks or from parity to rewrite the bad sectors with good data). However, the mdadm package did not make use of this functionality. This package adds a cron job to /etc/cron.weekly to check disks for bad sectors and repair them when found.
Small note to relnotes:
- change sectors to blocks
- actual version of script just runs "check", which means that array will be checked whether it's consistent, but nothing will be repaired
/me slaps his face, to read better next time, please ignore comment #9
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.