Bug 217581 - device-mapper mirror: Bad sync status change
device-mapper mirror: Bad sync status change
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Brian Brock
Depends On:
Blocks: 217582
  Show dependency treegraph
Reported: 2006-11-28 15:07 EST by Jonathan Earl Brassow
Modified: 2007-11-30 17:07 EST (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-05-08 00:17:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Jonathan Earl Brassow 2006-11-28 15:07:42 EST
A previous patch designed to update the sync status of a mirror was to hasty. 
The original idea was to make a machine immediately aware of a change from sync
-> out-of-sync when dealing with cluster mirrors.  However, the machines need to
discover this on their own, otherwise they cannot tell if they can switch a
primary device when it fails...   From the patch header:

We must only allow do_recovery to mark ms->in_sync as 1.  It is
the job of the fault handling code (like __bio_mark_nosync) to
mark ms->in_sync as 0, if necessary.

If do_recovery handles this, it is possible for us not to be able
to switch primary devices in the case of cluster mirroring.  The
scenario is:

0) Mirror is in-sync
1) Node1 writes to disk, but write fails to the primary device
2) Node1 increments the error count for that device
3) Node1 checks ms->in_sync to see if it is safe to switch the
   primary.  (We cannot switch the primary if other devices are
   not in-sync.  This would lead to bad data being read.)
4) Node1 switches the primary because the mirror is in-sync, then
   marks the region out-of-sync and ms->in_sync = 0.
5) Node2 writes and fails to the primary device
6) Node2 increments the error count for that device
7) Node2 checks ms->in_sync to see if it is safe to switch the
   primary.  It isn't because do_recovery has stepped in and changed
   ms->in_sync when it shouldn't have.
Comment 1 RHEL Product and Program Management 2006-11-28 15:39:03 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 2 Jay Turner 2007-01-02 08:48:15 EST
QE ack for RHEL4.5.
Comment 3 Jason Baron 2007-01-05 11:25:57 EST
committed in stream U5 build 42.38. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 6 Red Hat Bugzilla 2007-05-08 00:17:34 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.