Red Hat Bugzilla – Bug 217581
device-mapper mirror: Bad sync status change
Last modified: 2007-11-30 17:07:27 EST
A previous patch designed to update the sync status of a mirror was to hasty.
The original idea was to make a machine immediately aware of a change from sync
-> out-of-sync when dealing with cluster mirrors. However, the machines need to
discover this on their own, otherwise they cannot tell if they can switch a
primary device when it fails... From the patch header:
We must only allow do_recovery to mark ms->in_sync as 1. It is
the job of the fault handling code (like __bio_mark_nosync) to
mark ms->in_sync as 0, if necessary.
If do_recovery handles this, it is possible for us not to be able
to switch primary devices in the case of cluster mirroring. The
0) Mirror is in-sync
1) Node1 writes to disk, but write fails to the primary device
2) Node1 increments the error count for that device
3) Node1 checks ms->in_sync to see if it is safe to switch the
primary. (We cannot switch the primary if other devices are
not in-sync. This would lead to bad data being read.)
4) Node1 switches the primary because the mirror is in-sync, then
marks the region out-of-sync and ms->in_sync = 0.
5) Node2 writes and fails to the primary device
6) Node2 increments the error count for that device
7) Node2 checks ms->in_sync to see if it is safe to switch the
primary. It isn't because do_recovery has stepped in and changed
ms->in_sync when it shouldn't have.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
QE ack for RHEL4.5.
committed in stream U5 build 42.38. A test kernel with this patch is available
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.