This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 112261 - No failover with IP tiebreaker, cluquorumd does not exit
No failover with IP tiebreaker, cluquorumd does not exit
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
3
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-12-16 14:37 EST by WhidbeyNet
Modified: 2009-04-16 16:14 EDT (History)
1 user (show)

See Also:
Fixed In Version: 1.2.6-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-12-16 17:37:15 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Includes before and after logs, and small code patch (7.54 KB, text/plain)
2003-12-16 14:58 EST, WhidbeyNet
no flags Details

  None (edit)
Description WhidbeyNet 2003-12-16 14:37:12 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.0

Description of problem:
In a 2-member cluster using IP tiebreaker.  When network communication 
between the two members is interupted, they ping the tiebreaker.  The 
"clusvcmgrd" daemon on the member who cannot reach the tiebreaker 
succesfully stops services and exits.  However,  "cluquorumd" believes 
"clusvcmgrd" has crashed, and starts it back up.

The continued operation of services on the "down" member, and continued 
updates to the shared storage, result in a permanent PANIC state.

The "up" member, who won the tiebreaker, will not actually shoot a member 
that is in PANIC mode. The "up" member also will not take over services, 
because the "down" member is still reporting itself as "UP" on shared storage.

Version-Release number of selected component (if applicable):
clumanager-1.2.3-1

How reproducible:
Always

Steps to Reproduce:
1. Configure a 2-member cluster with NFS and IP tiebreaker.
2. Unplug the single network cable from the member running NFS.
    

Actual Results:  The member who cannot ping the tiebreaker does not remove 
itself from the cluster. The member who can ping the tiebreaker cannot take 
over services. No failover.

Expected Results:  The member who cannot ping the tiebreaker should stop 
all services, and report itself as DOWN on the disk, or shutdown clustering 
entirely. The member who is up should then take over services.

Additional info:

We can cause a failover by changing the behavior of "cluquorumd". This 
involved patching it with code to continue exiting even if "clusvcmgrd" has a 
non-zero exit. However, it breaks the ability of "cluquorumd" to restart the 
service manager in the event of a real software crash. 

Please see attachment for detailed log entries.
Comment 1 Suzanne Hillman 2003-12-16 14:50:56 EST
Um. There is no attachment...
Comment 2 Lon Hohberger 2003-12-16 14:53:10 EST
This is because of a bug which causes the quorum daemon to start and
check the disk tie breaker information even when the IP-based one is
enabled.

Try using the Update 1 beta code from RHN (1.2.6-1).

If it is not available to you via RHN, you can try this one (which is
basically the same, save for version and the fact that the following
is unofficial):

http://people.redhat.com/lhh/clumanager-1.2.6-0.1.89.2.13.i386.rpm
http://people.redhat.com/lhh/clumanager-1.2.6-0.1.89.2.13.src.rpm

This should solve the problem.  Note - if you are not using power
switches, members in the minority set will reboot immediately (this is
what it should have done) to try and preserve data integrity.

The "PANIC" state should only occur in two scenarios:
(1) Disk tie-breaker in use.  Disk reports member as 'up' even though
the network membership reports as down.  This is a broken cluster, but
no STONITH action takes place unless the member stops updating its
timestamp on shared storage.

(2) Failure to power-cycle a member after it is seen to be out of the
majority.  This only happens on members which have power controllers.
Comment 3 WhidbeyNet 2003-12-16 14:58:19 EST
Created attachment 96567 [details]
Includes before and after logs, and small code patch
Comment 4 WhidbeyNet 2003-12-16 15:01:18 EST
Thank you for the incredibly fast response (even before the attachment was 
posted)! We will try patching to the updated code.
Comment 5 Lon Hohberger 2007-12-21 10:09:57 EST
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.