Running the test helter_skelter/kill_primary_synced_2_legs on cluster mirrors elicits an easily reproducible data corruption bug.
When the primary device is removed during the repair operation, the linear device that remains does not contain a valid file system - many points of meta-data and data corruption.
After a few days of debugging, it has boiled down to a misunderstanding of the return value of 'dm_bit'. 'dm_bit' is only ever used as a boolean operation within LVM, but it can return a range of values. If the bit is set, a power of 2 is returned. If the bit is unset, 0 is returned.
'log_test_bit' (a function in the cluster mirror log daemon code) has switched to using the dm bit operations in rhel6. There are two places in the daemon code where 'log_test_bit' is not used merely as a boolean, but rather the return value is used as the return value for the log functions 'is_clean' and 'in_sync' - having assumed that 'dm_bit' was returning 0 or 1 only.
One place the 'in_sync' function is utilized is in 'dm_rh_get_state' - a function that informs the mirroring code how to treat I/O and which devices to read/write from. 'dm_rh_get_state' was checking if the return value of 'in_sync' was 1 to determine if the region was DM_RH_CLEAN. Since 'dm_bit' (and by extension 'log_test_bit' and 'in_sync') was returning powers of 2, DM_RH_CLEAN was rarely being reported as it should have been. Thinking the region was out-of-sync, the mirroring code would write only to the primary device. When the primary device was failed, all of those writes were lost - leaving the entire mirror corrupted.
After much debugging, the patch is simple (and in userspace) :(
static int log_test_bit(dm_bitset_t bs, int bit)
- return dm_bit(bs, bit);
+ return dm_bit(bs, bit) ? 1 : 0;
The helter_skelter test case kill_primary_synced_2_legs runs without any corruption issues. Marking this bug verified in the latest build.
lvm2-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-libs-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-cluster-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
udev-147-2.22.el6 BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-libs-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-libs-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
cmirror-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.