Bug 733736

Summary: After reboot of one node in a cluster, if the version of cluster.conf has changed the node cannot rejoin the cluster
Product: Red Hat Enterprise Linux 6 Reporter: Jose Castillo <jcastillo>
Component: clusterAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: ccaulfie, cluster-maint, lhh, rpeterso, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-26 17:48:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jose Castillo 2011-08-26 16:24:42 UTC
Description of problem:
When a node in a cluster is rebooted and the running version of cluster.conf changes, the newest version does not get transferred and the node cannot join the cluster, or sometimes creates its own 1-node cluster.

Version-Release number of selected component (if applicable):
RHEL 6.1
cman-3.0.12-41.el6_1.1
clusterlib-3.0.12-41.el6_1.1

How reproducible:
Easily

Steps to Reproduce:
1. In a two nodes cluster, switch off node 2.
2. Increase the version of CMAN in node 1 and update with "cman_tool version -r".
3. Boot node 2.
  
Actual results:
In node 2 we see the following messages:

Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [CMAN  ] CMAN 3.0.12 (built Jul 11 2011 04:18:42) started
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: corosync configuration service
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: corosync profile loading service
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [QUORUM] Using quorum provider quorum_cman
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [CMAN  ] quorum regained, resuming activity
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [QUORUM] This node is within the primary component and will provide service.
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [QUORUM] Members[1]: 1
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [QUORUM] Members[1]: 1
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [CPG   ] downlist received left_list: 0
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [CPG   ] chosen downlist from node r(0) ip(192.168.122.120) 
Aug 26 15:38:28 jc_rhcs6_B corosync[1217]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CMAN  ] Can't get updated config version 26: New configuration version has to be newer than current running configuration#012.
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CMAN  ] Activity suspended on this node
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CMAN  ] Error reloading the configuration, will retry every second
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CMAN  ] Node 2 conflict, remote config version id=26, local=20
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CPG   ] downlist received left_list: 0
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CPG   ] downlist received left_list: 0
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [CPG   ] chosen downlist from node r(0) ip(192.168.122.120) 
Aug 26 15:38:29 jc_rhcs6_B corosync[1217]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 26 15:38:30 jc_rhcs6_B corosync[1217]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Aug 26 15:38:30 jc_rhcs6_B corosync[1217]:   [CMAN  ] Can't get updated config version 26: New configuration version has to be newer than current running configuration#012.
Aug 26 15:38:30 jc_rhcs6_B corosync[1217]:   [CMAN  ] Activity suspended on this node
Aug 26 15:38:30 jc_rhcs6_B corosync[1217]:   [CMAN  ] Error reloading the configuration, will retry every second
Aug 26 15:38:31 jc_rhcs6_B corosync[1217]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Aug 26 15:38:31 jc_rhcs6_B corosync[1217]:   [CMAN  ] Can't get updated config version 26: New configuration version has to be newer than current running configuration#012.
Aug 26 15:38:31 jc_rhcs6_B corosync[1217]:   [CMAN  ] Activity suspended on this node
Aug 26 15:38:31 jc_rhcs6_B corosync[1217]:   [CMAN  ] Error reloading the configuration, will retry every second
Aug 26 15:38:32 jc_rhcs6_B corosync[1217]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Aug 26 15:38:32 jc_rhcs6_B corosync[1217]:   [CMAN  ] Can't get updated config version 26: New configuration version has to be newer than current running configuration#012.
Aug 26 15:38:32 jc_rhcs6_B corosync[1217]:   [CMAN  ] Activity suspended on this node
Aug 26 15:38:32 jc_rhcs6_B corosync[1217]:   [CMAN  ] Error reloading the configuration, will retry every second
Aug 26 15:38:33 jc_rhcs6_B corosync[1217]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Aug 26 15:38:33 jc_rhcs6_B corosync[1217]:   [CMAN  ] Can't get updated config version 26: New configuration version has to be newer than current running configuration#012.
Aug 26 15:38:33 jc_rhcs6_B corosync[1217]:   [CMAN  ] Activity suspended on this node
Aug 26 15:38:33 jc_rhcs6_B corosync[1217]:   [CMAN  ] Error reloading the configuration, will retry every second


Node 2 creates a 1-node cluster:
# cman_tool status
Version: 6.2.0
Config Version: 20
Cluster Name: cluster
Cluster Id: 63628
Cluster Member: Yes
Cluster Generation: 376
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1  
Active subsystems: 7
Flags: 2node Error 
Ports Bound: 0  
Node name: jc_rhcs6_B
Node ID: 1
Multicast addresses: 239.192.248.133 
Node addresses: 192.168.122.120 

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    372   2011-08-26 15:38:28  jc_rhcs6_B
   2   X      0                        jc_rhcs6_A

While node 1 became a 2-node cluster:
# cman_tool status
Version: 6.2.0
Config Version: 26
Cluster Name: cluster
Cluster Id: 63628
Cluster Member: Yes
Cluster Generation: 376
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1  
Active subsystems: 7
Flags: 2node 
Ports Bound: 0  
Node name: jc_rhcs6_A
Node ID: 2
Multicast addresses: 239.192.248.133 
Node addresses: 192.168.122.117 

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    376   2011-08-26 15:38:28  jc_rhcs6_B
   2   M    356   2011-08-26 15:10:32  jc_rhcs6_A


Expected results:

I expected that if the version that is being used in the running cluster is newer than the one the rebooted node has, it will be copied, or at least it won't create its own cluster. Are my expectations wrong?

Additional info:

A similar bug was already reported on bz 680155 but I don't know if both are completely related. The resolution was in errata http://rhn.redhat.com/errata/RHBA-2011-0537.html and involved upgrading to:

cman-3.0.12-41.el6
clusterlib-3.0.12-41.el6

But as you can see above I have newer versions running in my test. Let me know if you need other data.

Comment 2 Fabio Massimo Di Nitto 2011-08-26 17:48:42 UTC
(In reply to comment #0)
> Description of problem:
> When a node in a cluster is rebooted and the running version of cluster.conf
> changes, the newest version does not get transferred and the node cannot join
> the cluster, or sometimes creates its own 1-node cluster.

In RHEL6 this behaviour is by design. If a configuration is wrong is wrong and needs to be fixed. This is no different than expecting any other daemon on a system to fix their configuration automatically if it's wrong.

In short, after lengthy discussion with Chrissie, synchronizing the configuration at startup is a very complex operation that has so many path to failures that is not worth even considering.