Bug 1325654

Summary: raid-6 bit-rot detection & repair
Product: Red Hat Enterprise Linux 7 Reporter: Frank Ch. Eigler <fche>
Component: kernelAssignee: Nigel Croxon <ncroxon>
kernel sub component: Multiple Devices (MD) QA Contact: guazhang <guazhang>
Status: CLOSED WONTFIX Docs Contact:
Severity: unspecified    
Priority: unspecified CC: fweimer, mbenitez, ncroxon, xni, yizhan
Version: 7.2Keywords: FutureFeature
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-02 15:51:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Frank Ch. Eigler 2016-04-10 12:08:48 UTC
md-raid6's "repair" scan mode is documented as possibly repairing raid6 mismatches amongst the component drives.  But this is implemented by assuming that if a verification scan failed, the parity drives must have been both erroneous, and thus the parity drives are rewritten.  If the actual bit-rot was on one of the data drives, this then ***propagates that data corruption*** irreversibly.

md-raid6 has enough redundancy to correct any one drive's worth of bitrot.  "repair" mode should be changed to exploit that redundancy: it should attempt to rewrite exactly the bad areas - maybe even on a byte-by-byte basis - not necessarily the parity drives.

md-raid6 has probably enough redundancy to detect two drives' worth of overlapping bitrot errors, which it could signal, and refuse to propagate / make-worse.  More than two drives' worth of overlapping errors are probably not reliably diagnosable.

This change would make md-raid6 a reasonable defence against bit-rot, even with overlying filesystems that have no data checksumming features, and with normal applications that cannot do error detection/correction on their files.

Comment 2 Jes Sorensen 2016-04-11 12:52:05 UTC
If you want to see something like this happening, you need to report it against
upstream where feature development is actually going on - not against RHEL.

Jes

Comment 6 Nigel Croxon 2017-05-02 15:51:42 UTC
Moving to closed. 

-Nigel