Bug 112300
Summary: | services do not failover after a member is disconnected from shared storage and reboots | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Lon Hohberger <lhh> | ||||
Component: | clumanager | Assignee: | Lon Hohberger <lhh> | ||||
Status: | CLOSED ERRATA | QA Contact: | David Lawrence <dkl> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3 | CC: | cluster-maint, tao, us_linux_engineering | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-01-23 17:59:11 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Lon Hohberger
2003-12-17 13:48:27 UTC
Created attachment 96584 [details]
Patch to fix failover
This was due to a strange condition which may exist outside of the
shared-storage disconnect, but is certainly exascerbated by it.
This fix breaks rolling upgrade in real-world testing. Fix it. Ok, rolling upgrade still works - but only if it is done in a special way. The lowest-ordered member must be upgraded *last*. This is because the new code checks for a cluster quorum incarnation number, while the old code does not. Since the lowest-ordered member is the lock keeper, the check must be disabled until all other members have restarted their services. This workaround should be sufficient; as I believe this fix is necessary for long-term support. (Official) Fix for this should appear in Update 1 There was a second issue causing services not to fail over when the high node was unplugged because the lock daemon was waiting for a message from the quorum daemon, who was waiting for the STONITH operations to complete. The lock client would give up prematurely - where it should have retried the locking operation during a failover. Fixed in 1.2.9-1 Verified, closed as errata. The latest CVS build fixes this: http://people.redhat.com/lhh/packages.html We are still experiencing this issue with 1.2.9 on top of RHEL3 U1. Appropriate bug for your specific issue is: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=113226 Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3 |