Bug 112300 - services do not failover after a member is disconnected from shared storage and reboots
Summary: services do not failover after a member is disconnected from shared storage a...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: clumanager   
(Show other bugs)
Version: 3
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: David Lawrence
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-12-17 13:48 UTC by Lon Hohberger
Modified: 2009-04-16 20:14 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-01-23 17:59:11 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to fix failover (7.04 KB, patch)
2003-12-17 13:52 UTC, Lon Hohberger
no flags Details | Diff

Description Lon Hohberger 2003-12-17 13:48:27 UTC
Description of problem: Disconnecting the shared storage from a member
when it is running services causes the member to reboot (expected).
However, services do not fail-over to the other member.

Version-Release number: 1.2.6-1

How reproducible: 30%

Steps to Reproduce:
1. Start cluster on two members.
2. Move all services to member A
3. Disconnect shared storage from member A.
  
Actual results: Member A still "runs" all services, even after it is
marked 'Inactive'.

Expected results: Member B should take over services.

Additional info: Tested on shared SCSI RAID array; fibre channel may
have different behavior given that the timeouts are different.  For
instance, SCSI cable disconnects are more adequately handled in the
drivers, whereas some FC drivers continue to retry for several minutes
before giving up.

Comment 1 Lon Hohberger 2003-12-17 13:52:52 UTC
Created attachment 96584 [details]
Patch to fix failover

This was due to a strange condition which may exist outside of the
shared-storage disconnect, but is certainly exascerbated by it.

Comment 3 Lon Hohberger 2003-12-17 14:07:29 UTC
This fix breaks rolling upgrade in real-world testing. Fix it.

Comment 4 Lon Hohberger 2003-12-17 17:21:41 UTC
Ok, rolling upgrade still works - but only if it is done in a special
way.  The lowest-ordered member must be upgraded *last*.  

This is because the new code checks for a cluster quorum incarnation
number, while the old code does not.  Since the lowest-ordered member
is the lock keeper, the check must be disabled until all other members
have restarted their services.

This workaround should be sufficient; as I believe this fix is
necessary for long-term support.

Comment 5 Lon Hohberger 2003-12-18 21:09:25 UTC
(Official) Fix for this should appear in Update 1

Comment 8 Lon Hohberger 2004-01-15 21:21:38 UTC
There was a second issue causing services not to fail over when the
high node was unplugged because the lock daemon was waiting for a
message from the quorum daemon, who was waiting for the STONITH
operations to complete.  The lock client would give up prematurely -
where it should have retried the locking operation during a failover.
 Fixed in 1.2.9-1


Comment 9 Suzanne Hillman 2004-01-15 21:41:33 UTC
Verified, closed as errata.

Comment 10 Lon Hohberger 2004-01-15 22:47:28 UTC
The latest CVS build fixes this:

http://people.redhat.com/lhh/packages.html

Comment 12 Gary Lerhaupt 2004-01-23 17:06:39 UTC
We are still experiencing this issue with 1.2.9 on top of RHEL3 U1.

Comment 13 Lon Hohberger 2004-01-23 17:59:11 UTC
Appropriate bug for your specific issue is:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=113226



Comment 14 Lon Hohberger 2007-12-21 15:10:21 UTC
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3


Note You need to log in before you can comment on or make changes to this bug.