Bug 222995 - 3nodes cluster - problem if all nodes lost quorum
Summary: 3nodes cluster - problem if all nodes lost quorum
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-17 13:41 UTC by Tomasz Jaszowski
Modified: 2009-04-16 20:01 UTC (History)
1 user (show)

(edit)
Clone Of:
(edit)
Last Closed: 2007-01-17 13:49:44 UTC


Attachments (Terms of Use)

Description Tomasz Jaszowski 2007-01-17 13:41:05 UTC
Description of problem:
3nodes cluster using bonding and redundant switch. Only one of them has gfs
mounted. After restarting both swithes at the same moment all nodes loosing
connection and cluster is not quorate so nodes are refusing connection.

Version-Release number of selected component (if applicable):


How reproducible:
everytime

Steps to Reproduce:
1.3nodes cluster
2.shutdown/restart switch

  
Actual results:
as result I have 3separate nodes:

Jan 17 12:09:46 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:46 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:09:50 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:50 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:09:53 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:54 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:09:56 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:56 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:10:00 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:00 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:10:03 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:03 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:10:06 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:06 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:10:10 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:10 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:10:13 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:13 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:10:16 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:16 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:10:20 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:20 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:10:23 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:24 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:10:26 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.

after fencing one of them (reboot) it's rejoining to another node and then they
are restarting last node.

Expected results:
manual (or auto?) rejoin node to specified node without restart

Additional info:

Comment 1 Christine Caulfield 2007-01-17 13:49:44 UTC
This is working as expected. 

Think of it this way: If all of the nodes lose connection to each other then
they can't know if the other two can form a valid cluster. If that was the case
then those two would carry on working.

So when the remaining node comes back online it must be kicked out of the
cluster because it cannot reconcile its state with the other two.

Now in this case you don't have two working nodes, but none of the nodes knows
that for sure because they have all been disconnected from each other. Neither
does any node have quorum so it can't be allowed to fence any other node.

So what happens is that they just sit and stare at each other forever. This is
why you need a properly redundant network switch.


Note You need to log in before you can comment on or make changes to this bug.