Bug 173462

Summary: quorum partition can not be found
Product: [Retired] Red Hat Cluster Suite Reporter: dhe
Component: clumanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 3CC: cluster-maint, eguan, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-12-06 16:18:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 174689    

Description dhe 2005-11-17 08:45:41 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.7.12) Gecko/20050921 Red Hat/1.0.7-1.4.1 Firefox/1.0.7

Description of problem:
Link to  Issue 83460

Version-Release number of selected component (if applicable):
original RHEL 3 ES and cluster suite

How reproducible:
Couldn't Reproduce


Additional info:

Comment 1 Lon Hohberger 2005-11-17 14:25:18 UTC
Issue:

After running for a year and rebooting a node, the cluster displayed abnormal
messages:

[root@gserver4 root]# clustat
Cluster Status - servers                                              23:38:09
Cluster Quorum Status Unknown (Connection refused)
 
 Member             Status     
 ------------------ ----------
 server             Unknown               
 server_2           Unknown               
 
No Quorum - Service States Unknown
----------------------------------

[root@server root]# service clumanager status
clumembd is stopped
cluquorumd is stopped
clulockd is stopped
clusvcmgrd is stopped
---------------------

Then I started clumanager manually:
-----------------------------------
[root@server root]# service clumanager start
Starting Red Hat Cluster Manager...
Starting Quorum Daemon:                                    [  OK  ] 


[root@server root]# service clumanager status
clumembd (pid 13043) is running...
cluquorumd (pid 13041) is running...
clulockd (pid 13049) is running...
clusvcmgrd is stopped
Note: Service manager is not running because this member
     is not participating in the cluster quorum.

Then, the really weird stuff:

[root@server root]# clustat
Cluster Status - server                                               23:39:51
Incarnation #0
(This member is not part of the cluster quorum)
 
 Member             Status     
 ------------------ ----------
 server             Inactive              
 server_2           Inactive   <-- You are here
 
No Quorum - Service States Unknown 

The quorum partitions and participation in the cluster quorum are different
things.  If the quorum partitions were not accessible, cluquorumd would not
successfully start, and clulockd could not run.

If it had been running for a year, chances are good that it's a time to upgrade
clumanager.  Looks like you've got 1.2.22 -- which is actually a fine version;
but I'd upgrade to fix the IP tiebreaker bug.  I don't think this is your
problem, but it's best to eliminate all possible causes.

Bottom line, it looks like the nodes are not communicating properly with
one-another.  Check things like network cables, switch ports, iptables rules,
etc.  It looks like a /quorum/ convergence problem, oddly, not a membership
convergence problem.

e.g.  Try running "clumembd -fd" on both nodes with the cluster stopped and see
if they converge.  You will see something like "Membership View #2:0x00000003"
if they converge, and 0x00000001 or 0x00000002 if not.  I'd like to see that output.

Comment 2 dhe 2005-11-22 09:53:18 UTC
[root@gserver4 root]# clumembd -fd 
[3161] debug: Starting up 
[3161] debug: Setting configuration parameters. 
[3161] debug: Overriding interval to be 750000 
[3161] debug: Transmit thread set to ON 
[3161] debug: Overriding TKO count to be 20 
[3161] debug: Broadcast hearbeating set to OFF 
[3161] debug: Multicast hearbeat ON 
[3161] debug: Multicast address is 225.0.0.11 
[3161] debug: I am member #1 
[3161] debug: Interface IP is 127.0.0.1 
[3161] debug: Interface IP is 10.0.0.2 
[3161] debug: Interface IP is 10.53.80.38 
[3161] debug: Setting up multicast 225.0.0.11 on 10.53.80.38 
[3161] debug: Interface IP is 172.21.80.38 
[3161] debug: Cluster I/F: eth1 [10.53.80.38] 
[3161] debug: clumembd_start_watchdog: set duration to 14. 
[3161] debug: Waiting for requests. 
[3161] debug: Transmit thread: pulsar 
[3161] notice: Member gserver3 UP 
[3161] debug: Connect: Member #0 (10.53.80.37) [IPv4] 
[3161] debug: MB: New connect: fd8 
[3161] debug: MB: Received VF_MESSAGE, fd8 
[3161] debug: VF_JOIN_VIEW from member #0! Key: 0x27456381 #2 
[3161] debug: VF: Voting YES 
[3161] debug: MB: Received VF_MESSAGE, fd8 
[3161] debug: VF: Received VF_VIEW_FORMED, fd8 
[3161] debug: VF: Commit Key 0x27456381 #2 from member #0 
[3161] info: Membership View #2:0x00000003 
########################################################### 
# Output was stopped here, entered CTRL-C to exit manually. # 
########################################################### 
[3161] debug: clumembd_sw_watchdog_stop: successfully stopped wat

 
[root@gserver4 /]# clumembd -fd 
[19281] debug: Starting up 
[19281] debug: Setting configuration parameters. 
[19281] debug: Overriding interval to be 750000 
[19281] debug: Transmit thread set to ON 
[19281] debug: Overriding TKO count to be 20 
[19281] debug: Broadcast hearbeating set to OFF 
[19281] debug: Multicast hearbeat ON 
[19281] debug: Multicast address is 225.0.0.11 
[19281] debug: I am member #1 
[19281] debug: Interface IP is 127.0.0.1 
[19281] debug: Interface IP is 10.0.0.2 
[19281] debug: Interface IP is 10.53.80.38 
[19281] debug: Setting up multicast 225.0.0.11 on 10.53.80.38 
[19281] debug: Interface IP is 172.21.80.38 
[19281] debug: Cluster I/F: eth1 [10.53.80.38] 
[19281] debug: clumembd_start_watchdog: set duration to 14. 
[19281] debug: Waiting for requests. 
[19281] debug: Transmit thread: pulsar 
[19281] notice: Member gserver4 UP 
[1

Comment 3 Lon Hohberger 2005-11-22 18:20:25 UTC
Excellent, so, the nodes can see each other (and are communicating properly). 
If they could not, you never would see 0x00000003.

Starting clumanager and immediately running clustat will give you inquorate /
both nodes inactive, and is expected behavior...  Could that have been the problem?

Or did this persist for a long time after (requiring a restart) ?



Comment 4 dhe 2005-11-28 08:30:34 UTC
Yes, it persisted for a long time.



Comment 5 Eric Paris 2005-12-06 16:18:17 UTC
Lon, I'm not sure why this BZ got opened out of process.  It appears that they
had iptables rules blocking the traffic which means it couldn't finish joining.
 I'm closing the BZ