A previous patch designed to update the sync status of a mirror was to hasty. The original idea was to make a machine immediately aware of a change from sync -> out-of-sync when dealing with cluster mirrors. However, the machines need to discover this on their own, otherwise they cannot tell if they can switch a primary device when it fails... From the patch header: We must only allow do_recovery to mark ms->in_sync as 1. It is the job of the fault handling code (like __bio_mark_nosync) to mark ms->in_sync as 0, if necessary. If do_recovery handles this, it is possible for us not to be able to switch primary devices in the case of cluster mirroring. The scenario is: 0) Mirror is in-sync 1) Node1 writes to disk, but write fails to the primary device 2) Node1 increments the error count for that device 3) Node1 checks ms->in_sync to see if it is safe to switch the primary. (We cannot switch the primary if other devices are not in-sync. This would lead to bad data being read.) 4) Node1 switches the primary because the mirror is in-sync, then marks the region out-of-sync and ms->in_sync = 0. 5) Node2 writes and fails to the primary device 6) Node2 increments the error count for that device 7) Node2 checks ms->in_sync to see if it is safe to switch the primary. It isn't because do_recovery has stepped in and changed ms->in_sync when it shouldn't have.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
QE ack for RHEL4.5.
committed in stream U5 build 42.38. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html