Bug 456575 - Mirror corruption after one of three legs fail simultaneously on more than 1 mirror
Mirror corruption after one of three legs fail simultaneously on more than 1 ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror (Show other bugs)
5.2
All Linux
high Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On: 359341
Blocks: 533192 483701 525215
  Show dependency treegraph
 
Reported: 2008-07-24 14:52 EDT by Jonathan Earl Brassow
Modified: 2011-01-13 17:48 EST (History)
12 users (show)

See Also:
Fixed In Version: cmirror-1.1.39-9.el5
Doc Type: Bug Fix
Doc Text:
A data corruption may have occurred when using 3 or more mirrors. With this update, the underlying cluster code has been modified to address this issue, and the data corruption no longer occurs.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 17:48:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Comment 1 Jonathan Earl Brassow 2008-10-01 09:46:57 EDT
Mirror corruption issues where found in the cluster logging code and fixed in 5.3.  During the investigation, there were other issues identified.  So, there is still a problem in the kernel.  It does not need fixing until device-mapper mirror failures are handled differently (which is planned for the future).  Currently, when a mirror device fails, it is removed.  Later releases will only remove the failed device if the failure is persistent.

Description of what will cause the failure:
In drivers/md/dm-raid1.c, after a leg fails and a write returns, '__bio_mark_nosync' is used to mark the region out-of-sync.  This state is stored in a region structure that remains in the region hash.  It is not removed from the region hash until the mirror is destroyed because it never goes on the clean_regions list.  Right now, this is not a problem because when a device fails, the mirror is destroyed and a new mirror is created w/o the failed device.  In the future, when we wish to handle transient failures, we would simply suspend and resume to restart recovery.  In that case, some machines in the cluster would only write to the primary for regions that are cached as not-in-sync - due to the '__bio_mark_nosync'.  The fix is to simply clear out the region hash when a mirror is suspended.
Comment 2 RHEL Product and Program Management 2009-01-27 15:44:08 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 3 RHEL Product and Program Management 2009-02-16 10:25:43 EST
Updating PM score.
Comment 4 Jonathan Earl Brassow 2009-04-21 15:36:57 EDT
Handling of device-mapper mirror failures has not changed, and therefore, no change is required for kernel code at this time.  Pushing out.  (See comment #1 for more detail.)
Comment 6 Jonathan Earl Brassow 2009-10-14 11:27:28 EDT
I'll try to figure this out.  These bugs take a long time to decipher, so it will have to be 'conditional nack - capacity' vs. devel_ack.  Note the configuration when setting severity/priority scores.
Comment 7 Jonathan Earl Brassow 2010-01-26 16:16:47 EST
Please verify this bug still exists with latest rhel5.5 kernel and userspace packages.... Many things have changed which would have a direct impact on this bug:
1) kernel handles write failures differently now
2) userspace cleans up LVs on an individual basis now vs on a VG scale
Comment 9 RHEL Product and Program Management 2010-08-25 12:09:54 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 12 Corey Marthaler 2010-10-15 18:12:06 EDT
This bug is no longer reproducible with the latest rpms. Marking verified.

2.6.18-225.el5

lvm2-2.02.74-1.el5    BUILT: Fri Oct 15 10:26:21 CDT 2010
lvm2-cluster-2.02.74-1.el5    BUILT: Fri Oct 15 10:27:02 CDT 2010
device-mapper-1.02.55-1.el5    BUILT: Fri Oct 15 06:15:55 CDT 2010
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009
Comment 13 Jaromir Hradilek 2010-11-17 09:29:21 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A data corruption may have occurred when using 3 or more mirrors. With this update, the underlying cluster code has been modified to address this issue, and the data corruption no longer occurs.
Comment 15 errata-xmlrpc 2011-01-13 17:48:56 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0057.html

Note You need to log in before you can comment on or make changes to this bug.