621301 – Data corruption on primary device failure in a cluster mirror (cmirror)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 621301 - Data corruption on primary device failure in a cluster mirror (cmirror)

Summary: Data corruption on primary device failure in a cluster mirror (cmirror)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Corey Marthaler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-08-04 17:51 UTC by Jonathan Earl Brassow
Modified:	2010-11-10 21:08 UTC (History)
CC List:	9 users (show)
Fixed In Version:	lvm2-2.02.72-4.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-11-10 21:08:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jonathan Earl Brassow 2010-08-04 17:51:14 UTC

Running the test helter_skelter/kill_primary_synced_2_legs on cluster mirrors elicits an easily reproducible data corruption bug.

When the primary device is removed during the repair operation, the linear device that remains does not contain a valid file system - many points of meta-data and data corruption.

Comment 2 Jonathan Earl Brassow 2010-08-04 18:13:27 UTC

After a few days of debugging, it has boiled down to a misunderstanding of the return value of 'dm_bit'.  'dm_bit' is only ever used as a boolean operation within LVM, but it can return a range of values.  If the bit is set, a power of 2 is returned.  If the bit is unset, 0 is returned.

'log_test_bit' (a function in the cluster mirror log daemon code) has switched to using the dm bit operations in rhel6.  There are two places in the daemon code where 'log_test_bit' is not used merely as a boolean, but rather the return value is used as the return value for the log functions 'is_clean' and 'in_sync' - having assumed that 'dm_bit' was returning 0 or 1 only.

One place the 'in_sync' function is utilized is in 'dm_rh_get_state' - a function that informs the mirroring code how to treat I/O and which devices to read/write from.  'dm_rh_get_state' was checking if the return value of 'in_sync' was 1 to determine if the region was DM_RH_CLEAN.  Since 'dm_bit' (and by extension 'log_test_bit' and 'in_sync') was returning powers of 2, DM_RH_CLEAN was rarely being reported as it should have been.  Thinking the region was out-of-sync, the mirroring code would write only to the primary device.  When the primary device was failed, all of those writes were lost - leaving the entire mirror corrupted.

After much debugging, the patch is simple (and in userspace) :(
 static int log_test_bit(dm_bitset_t bs, int bit)
 {
-       return dm_bit(bs, bit);
+       return dm_bit(bs, bit) ? 1 : 0;
 }

Comment 4 Corey Marthaler 2010-08-13 21:40:44 UTC

The helter_skelter test case kill_primary_synced_2_legs runs without any corruption issues. Marking this bug verified in the latest build.

2.6.32-59.1.el6.x86_64

lvm2-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-libs-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-cluster-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
cmirror-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010

Comment 5 releng-rhel@redhat.com 2010-11-10 21:08:38 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.