Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 621301 - Data corruption on primary device failure in a cluster mirror (cmirror)
Data corruption on primary device failure in a cluster mirror (cmirror)
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Jonathan Earl Brassow
Corey Marthaler
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-04 13:51 EDT by Jonathan Earl Brassow
Modified: 2010-11-10 16:08 EST (History)
9 users (show)

See Also:
Fixed In Version: lvm2-2.02.72-4.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-10 16:08:38 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jonathan Earl Brassow 2010-08-04 13:51:14 EDT
Running the test helter_skelter/kill_primary_synced_2_legs on cluster mirrors elicits an easily reproducible data corruption bug.

When the primary device is removed during the repair operation, the linear device that remains does not contain a valid file system - many points of meta-data and data corruption.
Comment 2 Jonathan Earl Brassow 2010-08-04 14:13:27 EDT
After a few days of debugging, it has boiled down to a misunderstanding of the return value of 'dm_bit'.  'dm_bit' is only ever used as a boolean operation within LVM, but it can return a range of values.  If the bit is set, a power of 2 is returned.  If the bit is unset, 0 is returned.

'log_test_bit' (a function in the cluster mirror log daemon code) has switched to using the dm bit operations in rhel6.  There are two places in the daemon code where 'log_test_bit' is not used merely as a boolean, but rather the return value is used as the return value for the log functions 'is_clean' and 'in_sync' - having assumed that 'dm_bit' was returning 0 or 1 only.

One place the 'in_sync' function is utilized is in 'dm_rh_get_state' - a function that informs the mirroring code how to treat I/O and which devices to read/write from.  'dm_rh_get_state' was checking if the return value of 'in_sync' was 1 to determine if the region was DM_RH_CLEAN.  Since 'dm_bit' (and by extension 'log_test_bit' and 'in_sync') was returning powers of 2, DM_RH_CLEAN was rarely being reported as it should have been.  Thinking the region was out-of-sync, the mirroring code would write only to the primary device.  When the primary device was failed, all of those writes were lost - leaving the entire mirror corrupted.

After much debugging, the patch is simple (and in userspace) :(
 static int log_test_bit(dm_bitset_t bs, int bit)
 {
-       return dm_bit(bs, bit);
+       return dm_bit(bs, bit) ? 1 : 0;
 }
Comment 4 Corey Marthaler 2010-08-13 17:40:44 EDT
The helter_skelter test case kill_primary_synced_2_legs runs without any corruption issues. Marking this bug verified in the latest build.

2.6.32-59.1.el6.x86_64

lvm2-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-libs-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-cluster-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
cmirror-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
Comment 5 releng-rhel@redhat.com 2010-11-10 16:08:38 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.