Description of problem: Disconnecting the shared storage from a member when it is running services causes the member to reboot (expected). However, services do not fail-over to the other member. Version-Release number: 1.2.6-1 How reproducible: 30% Steps to Reproduce: 1. Start cluster on two members. 2. Move all services to member A 3. Disconnect shared storage from member A. Actual results: Member A still "runs" all services, even after it is marked 'Inactive'. Expected results: Member B should take over services. Additional info: Tested on shared SCSI RAID array; fibre channel may have different behavior given that the timeouts are different. For instance, SCSI cable disconnects are more adequately handled in the drivers, whereas some FC drivers continue to retry for several minutes before giving up.
Created attachment 96584 [details] Patch to fix failover This was due to a strange condition which may exist outside of the shared-storage disconnect, but is certainly exascerbated by it.
This fix breaks rolling upgrade in real-world testing. Fix it.
Ok, rolling upgrade still works - but only if it is done in a special way. The lowest-ordered member must be upgraded *last*. This is because the new code checks for a cluster quorum incarnation number, while the old code does not. Since the lowest-ordered member is the lock keeper, the check must be disabled until all other members have restarted their services. This workaround should be sufficient; as I believe this fix is necessary for long-term support.
(Official) Fix for this should appear in Update 1
There was a second issue causing services not to fail over when the high node was unplugged because the lock daemon was waiting for a message from the quorum daemon, who was waiting for the STONITH operations to complete. The lock client would give up prematurely - where it should have retried the locking operation during a failover. Fixed in 1.2.9-1
Verified, closed as errata.
The latest CVS build fixes this: http://people.redhat.com/lhh/packages.html
We are still experiencing this issue with 1.2.9 on top of RHEL3 U1.
Appropriate bug for your specific issue is: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=113226
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3