806002 – Failed cluster rejoin after reboot might lead to later rejoin without being in fence domain

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 806002 - Failed cluster rejoin after reboot might lead to later rejoin without being in fence domain

Summary: Failed cluster rejoin after reboot might lead to later rejoin without being i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Fabio Massimo Di Nitto
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-22 16:19 UTC by Jaroslav Kortus
Modified:	2012-06-20 13:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:	cluster-3.0.12.1-30.el6
Doc Type:	Bug Fix
Doc Text:	Cause: cman init script did not roll back changes in case of errors during startup. Consequence: some daemons could be erroneously left running on a node. Fix: cman init script now performs a full roll back when errors are encountered. Result: no daemons are left running in case of errors.
Clone Of:
Environment:
Last Closed:	2012-06-20 13:58:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2012:0861	0	normal	SHIPPED_LIVE	cluster and gfs2-utils bug fix and enhancement update	2013-05-30 19:58:20 UTC

Description Jaroslav Kortus 2012-03-22 16:19:22 UTC

Description of problem:
When a node fails and gets fenced, it usually leads to rebooting the node and it joins the cluster with a fresh state. In this case everything works as expected (node joins, waits for quorum, starts all configured services, clvmd, dlm daemons etc...)

However, when there is a block during rejoin (think of this as a switch port failure for example), the node cannot rejoin the cluster. This attempt will fail during boot on "Waiting for quorum..." later accompanied by "[FAILED]".

In this state the cman is still started and after network connectivity is restored, it will happily rejoin. In this case, it is just plain cman that is running, without any fenced, clvmd, dlm dameons and others started later in the init script.

So while it seems that the node is Online and fully operational, the opposite is true and manual 'service cman start' is still required to get to operational state. This might be confusing and IMHO it is not a correct state to be in.

Version-Release number of selected component (if applicable):
cman-3.0.12.1-28.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. block a node on switch (iptables -m physdev drops in virts, for example)
2. wait for the node to get fenced/rebooted
3. see the failing attempt to rejoin
4. remove the block from 1.
5. wait for the node to rejoin
6. check fence_tool ls output on all nodes (previously blocked node still not present, although it is Online)

Actual results:
Node rejoins the cluster without any usable services and without being in fence domain

Expected results:
one of:
1. rollback of all previously taken actions (stopping qdiskd, killing cman, basically calling the stop function) if the quorum is not regained during the specified timeout
2. start all daemons as if there was a quorum and let them to handle it properly.

I guess variant 1 should be preferred as it more or less copies the existing state without described negative consequences.

Comment 2 Fabio Massimo Di Nitto 2012-03-26 08:13:25 UTC

(In reply to comment #0)

good catch.

> Expected results:
> one of:
> 1. rollback of all previously taken actions (stopping qdiskd, killing cman,
> basically calling the stop function) if the quorum is not regained during the
> specified timeout
> 2. start all daemons as if there was a quorum and let them to handle it
> properly. 
> 
> I guess variant 1 should be preferred as it more or less copies the existing
> state without described negative consequences.

Yes we should go for 1).

#2 is doable, but it involves some major surgery in the init script and add a major change in behavior.

Comment 6 Fabio Massimo Di Nitto 2012-05-02 07:55:33 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: cman init script did not roll back changes in case of errors during startup.
Consequence: some daemons could be erroneously left running on a node. 
Fix: cman init script now performs a full roll back when errors are encountered.
Result: no daemons are left running in case of errors.

Comment 7 Justin Payne 2012-05-07 16:38:47 UTC

Verified in cman-3.0.12.1-31.el6:

[root@dash-03 ~]# rpm -q cman
cman-3.0.12.1-24.el6.x86_64
[root@dash-03 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... corosync died: Could not read cluster configuration Check cluster logs for details
                                                           [FAILED]
[root@dash-03 ~]# /etc/init.d/cman status
corosync is stopped
[root@dash-03 ~]# ls new/
clusterlib-3.0.12.1-31.el6.x86_64.rpm  cman-3.0.12.1-31.el6.x86_64.rpm
[root@dash-03 ~]# yum localupdate new/c*
[root@dash-03 ~]# rpm -q cman
cman-3.0.12.1-31.el6.x86_64
[root@dash-03 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... corosync died: Could not read cluster configuration Check cluster logs for details
                                                           [FAILED]
Stopping cluster: 
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]

Comment 9 errata-xmlrpc 2012-06-20 13:58:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0861.html

Note You need to log in before you can comment on or make changes to this bug.