Bug 217581 - device-mapper mirror: Bad sync status change
device-mapper mirror: Bad sync status change
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Brian Brock
:
Depends On:
Blocks: 217582
  Show dependency treegraph
 
Reported: 2006-11-28 15:07 EST by Jonathan Earl Brassow
Modified: 2007-11-30 17:07 EST (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-08 00:17:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jonathan Earl Brassow 2006-11-28 15:07:42 EST
A previous patch designed to update the sync status of a mirror was to hasty. 
The original idea was to make a machine immediately aware of a change from sync
-> out-of-sync when dealing with cluster mirrors.  However, the machines need to
discover this on their own, otherwise they cannot tell if they can switch a
primary device when it fails...   From the patch header:

We must only allow do_recovery to mark ms->in_sync as 1.  It is
the job of the fault handling code (like __bio_mark_nosync) to
mark ms->in_sync as 0, if necessary.

If do_recovery handles this, it is possible for us not to be able
to switch primary devices in the case of cluster mirroring.  The
scenario is:

0) Mirror is in-sync
1) Node1 writes to disk, but write fails to the primary device
2) Node1 increments the error count for that device
3) Node1 checks ms->in_sync to see if it is safe to switch the
   primary.  (We cannot switch the primary if other devices are
   not in-sync.  This would lead to bad data being read.)
4) Node1 switches the primary because the mirror is in-sync, then
   marks the region out-of-sync and ms->in_sync = 0.
5) Node2 writes and fails to the primary device
6) Node2 increments the error count for that device
7) Node2 checks ms->in_sync to see if it is safe to switch the
   primary.  It isn't because do_recovery has stepped in and changed
   ms->in_sync when it shouldn't have.
Comment 1 RHEL Product and Program Management 2006-11-28 15:39:03 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 2 Jay Turner 2007-01-02 08:48:15 EST
QE ack for RHEL4.5.
Comment 3 Jason Baron 2007-01-05 11:25:57 EST
committed in stream U5 build 42.38. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 6 Red Hat Bugzilla 2007-05-08 00:17:34 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html

Note You need to log in before you can comment on or make changes to this bug.