Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 222995 - 3nodes cluster - problem if all nodes lost quorum
3nodes cluster - problem if all nodes lost quorum
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman (Show other bugs)
4
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-01-17 08:41 EST by Tomasz Jaszowski
Modified: 2009-04-16 16:01 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-01-17 08:49:44 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Tomasz Jaszowski 2007-01-17 08:41:05 EST
Description of problem:
3nodes cluster using bonding and redundant switch. Only one of them has gfs
mounted. After restarting both swithes at the same moment all nodes loosing
connection and cluster is not quorate so nodes are refusing connection.

Version-Release number of selected component (if applicable):


How reproducible:
everytime

Steps to Reproduce:
1.3nodes cluster
2.shutdown/restart switch

  
Actual results:
as result I have 3separate nodes:

Jan 17 12:09:46 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:46 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:09:50 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:50 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:09:53 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:54 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:09:56 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:09:56 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:10:00 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:00 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:10:03 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:03 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:10:06 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:06 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:10:10 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:10 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:10:13 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:13 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:10:16 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:16 tedse-wls3 ccsd[2548]: Error while processing connect:
Connection refused
Jan 17 12:10:20 tedse-wls1 ccsd[2979]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:20 tedse-wls1 ccsd[2979]: Error while processing connect:
Connection refused
Jan 17 12:10:23 tedse-wls2 ccsd[2682]: Cluster is not quorate.  Refusing connection.
Jan 17 12:10:24 tedse-wls2 ccsd[2682]: Error while processing connect:
Connection refused
Jan 17 12:10:26 tedse-wls3 ccsd[2548]: Cluster is not quorate.  Refusing connection.

after fencing one of them (reboot) it's rejoining to another node and then they
are restarting last node.

Expected results:
manual (or auto?) rejoin node to specified node without restart

Additional info:
Comment 1 Christine Caulfield 2007-01-17 08:49:44 EST
This is working as expected. 

Think of it this way: If all of the nodes lose connection to each other then
they can't know if the other two can form a valid cluster. If that was the case
then those two would carry on working.

So when the remaining node comes back online it must be kicked out of the
cluster because it cannot reconcile its state with the other two.

Now in this case you don't have two working nodes, but none of the nodes knows
that for sure because they have all been disconnected from each other. Neither
does any node have quorum so it can't be allowed to fence any other node.

So what happens is that they just sit and stare at each other forever. This is
why you need a properly redundant network switch.

Note You need to log in before you can comment on or make changes to this bug.