Bug 806002

Summary: Failed cluster rejoin after reboot might lead to later rejoin without being in fence domain
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Kortus <jkortus>
Component: clusterAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: medium    
Version: 6.3CC: ccaulfie, cluster-maint, jpayne, lhh, rpeterso, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cluster-3.0.12.1-30.el6 Doc Type: Bug Fix
Doc Text:
Cause: cman init script did not roll back changes in case of errors during startup. Consequence: some daemons could be erroneously left running on a node. Fix: cman init script now performs a full roll back when errors are encountered. Result: no daemons are left running in case of errors.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 13:58:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaroslav Kortus 2012-03-22 16:19:22 UTC
Description of problem:
When a node fails and gets fenced, it usually leads to rebooting the node and it joins the cluster with a fresh state. In this case everything works as expected (node joins, waits for quorum, starts all configured services, clvmd, dlm daemons etc...)

However, when there is a block during rejoin (think of this as a switch port failure for example), the node cannot rejoin the cluster. This attempt will fail during boot on "Waiting for quorum..." later accompanied by "[FAILED]".

In this state the cman is still started and after network connectivity is restored, it will happily rejoin. In this case, it is just plain cman that is running, without any fenced, clvmd, dlm dameons and others started later in the init script.

So while it seems that the node is Online and fully operational, the opposite is true and manual 'service cman start' is still required to get to operational state. This might be confusing and IMHO it is not a correct state to be in.


Version-Release number of selected component (if applicable):
cman-3.0.12.1-28.el6.x86_64


How reproducible:
always

Steps to Reproduce:
1. block a node on switch (iptables -m physdev drops in virts, for example)
2. wait for the node to get fenced/rebooted
3. see the failing attempt to rejoin
4. remove the block from 1.
5. wait for the node to rejoin
6. check fence_tool ls output on all nodes (previously blocked node still not present, although it is Online)
  
Actual results:
Node rejoins the cluster without any usable services and without being in fence domain

Expected results:
one of:
1. rollback of all previously taken actions (stopping qdiskd, killing cman, basically calling the stop function) if the quorum is not regained during the specified timeout
2. start all daemons as if there was a quorum and let them to handle it properly. 

I guess variant 1 should be preferred as it more or less copies the existing state without described negative consequences.

Comment 2 Fabio Massimo Di Nitto 2012-03-26 08:13:25 UTC
(In reply to comment #0)

good catch.

> Expected results:
> one of:
> 1. rollback of all previously taken actions (stopping qdiskd, killing cman,
> basically calling the stop function) if the quorum is not regained during the
> specified timeout
> 2. start all daemons as if there was a quorum and let them to handle it
> properly. 
> 
> I guess variant 1 should be preferred as it more or less copies the existing
> state without described negative consequences.

Yes we should go for 1).

#2 is doable, but it involves some major surgery in the init script and add a major change in behavior.

Comment 6 Fabio Massimo Di Nitto 2012-05-02 07:55:33 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: cman init script did not roll back changes in case of errors during startup.
Consequence: some daemons could be erroneously left running on a node. 
Fix: cman init script now performs a full roll back when errors are encountered.
Result: no daemons are left running in case of errors.

Comment 7 Justin Payne 2012-05-07 16:38:47 UTC
Verified in cman-3.0.12.1-31.el6:

[root@dash-03 ~]# rpm -q cman
cman-3.0.12.1-24.el6.x86_64
[root@dash-03 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... corosync died: Could not read cluster configuration Check cluster logs for details
                                                           [FAILED]
[root@dash-03 ~]# /etc/init.d/cman status
corosync is stopped
[root@dash-03 ~]# ls new/
clusterlib-3.0.12.1-31.el6.x86_64.rpm  cman-3.0.12.1-31.el6.x86_64.rpm
[root@dash-03 ~]# yum localupdate new/c*
[root@dash-03 ~]# rpm -q cman
cman-3.0.12.1-31.el6.x86_64
[root@dash-03 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... corosync died: Could not read cluster configuration Check cluster logs for details
                                                           [FAILED]
Stopping cluster: 
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]

Comment 9 errata-xmlrpc 2012-06-20 13:58:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0861.html