Bug 495707

Summary: CMAN: too many transition restarts - will die
Product: [Retired] Red Hat Cluster Suite Reporter: Eduardo Damato <edamato>
Component: cman-kernelAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4CC: bkahn, cfeist, cluster-maint, djansa, djuran, edamato, grimme, hlawatschek, jwilleford, lpleiman, michael.hagmann, mwhitehe, rpeterso, sbradley, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cman-kernel-2.6.9-55.13.el4_7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-08-31 07:52:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduardo Damato 2009-04-14 13:50:19 UTC
Description of problem:

When adding a new node to an existing 4 node cluster on RHEL4:

2.6.9-78.0.1.ELsmp #1 SMP Tue Jul 22 18:01:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

The addition of the node triggered CMAN transitions that did not stabilize, therefore the joining node was expelled from the cluster, however the 4 cluster nodes suffered 'Inconsistent cluster views' therefore 3 nodes exited CMAN, followed by dlm_emergency_shutdown, and the remaining node lost quorum before doing a dlm_emergency shutdown.

This behaviour is similar to:

 https://bugzilla.redhat.com/show_bug.cgi?id=444751
 https://bugzilla.redhat.com/show_bug.cgi?id=435491#c22

(Sanitized data)

NODE08:

(NEW NODE - trying to join)

Mar 23 14:38:49 node08 ccsd[10436]: cluster.conf (cluster name = Cluster, version = 83) found. 
Mar 23 14:38:50 node08 kernel: CMAN: Waiting to join or form a Linux-cluster
Mar 23 14:38:50 node08 kernel: CMAN: sending membership request
Mar 23 14:38:51 node08 ccsd[10436]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.5 
Mar 23 14:38:51 node08 ccsd[10436]: Initial status:: Inquorate 
Mar 23 14:38:51 node08 ccsd[10436]: Cluster is not quorate.  Refusing connection. 
Mar 23 14:38:51 node08 ccsd[10436]: Error while processing connect: Connection refused 
Mar 23 14:38:56 node08 ccsd[10436]: Cluster is not quorate.  Refusing connection. 
Mar 23 14:38:56 node08 ccsd[10436]: Error while processing connect: Connection refused 
Mar 23 14:39:01 node08 ccsd[10436]: Cluster is not quorate.  Refusing connection. 
Mar 23 14:39:01 node08 ccsd[10436]: Error while processing connect: Connection refused 
Mar 23 14:39:07 node08 ccsd[10436]: Cluster is not quorate.  Refusing connection. 
Mar 23 14:39:07 node08 ccsd[10436]: Error while processing connect: Connection refused 
...
Mar 23 14:44:24 node08 kernel: CMAN: sending membership request
Mar 23 14:44:24 node08 kernel: CMAN: sending membership request
Mar 23 14:44:26 node08 ccsd[10752]: Stopping ccsd, SIGTERM received. 
Mar 23 14:44:27 node08 ccsd: shutdown succeeded
Mar 23 14:44:29 node08 kernel: CMAN: sending membership request
Mar 23 14:44:34 node08 last message repeated 3 times
Mar 23 14:44:36 node08 cman: failed to stop cman failed
Mar 23 14:44:39 node08 kernel: CMAN: sending membership request
Mar 23 14:44:39 node08 kernel: CMAN: sending membership request
...
Mar 23 14:44:39 node08 kernel: CMAN: got node node04.example.com
Mar 23 14:44:39 node08 kernel: CMAN: got node node07.example.com
Mar 23 14:44:39 node08 kernel: CMAN: Finished transition, generation 1
Mar 23 14:44:51 node08 kernel: CMAN: quorum regained, resuming activity
Mar 23 14:44:51 node08 kernel: CMAN: Initiating transition, generation 35
Mar 23 14:44:51 node08 kernel: CMAN: Being told to leave the cluster by node 4
Mar 23 14:44:51 node08 kernel: CMAN: we are leaving the cluster. 
Mar 23 14:45:46 node08 shutdown: shutting down for system reboot



NODE05:

Mar 23 14:38:03 node05 ccsd[7155]: Update of cluster.conf complete (version 82 -> 83). 
Mar 23 14:38:03 node05 clurgmgrd[10272]: <notice> Reconfiguring 
Mar 23 14:38:50 node05 kernel: CMAN: Initiating transition, generation 20
Mar 23 14:39:05 node05 kernel: CMAN: Initiating transition, generation 21
Mar 23 14:39:20 node05 kernel: CMAN: Initiating transition, generation 22
Mar 23 14:39:35 node05 kernel: CMAN: Initiating transition, generation 23
Mar 23 14:39:50 node05 kernel: CMAN: Initiating transition, generation 24
Mar 23 14:40:05 node05 kernel: CMAN: Initiating transition, generation 25
Mar 23 14:40:20 node05 kernel: CMAN: Initiating transition, generation 26
Mar 23 14:40:35 node05 kernel: CMAN: Initiating transition, generation 27
Mar 23 14:40:50 node05 kernel: CMAN: Initiating transition, generation 28
Mar 23 14:41:05 node05 kernel: CMAN: Initiating transition, generation 29
Mar 23 14:41:20 node05 kernel: CMAN: Initiating transition, generation 30
Mar 23 14:41:35 node05 kernel: CMAN: too many transition restarts - will die
Mar 23 14:41:35 node05 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
Mar 23 14:41:35 node05 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:41:35 node05 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:41:35 node05 kernel: SM: 00000002 sm_stop: SG still joined
Mar 23 14:41:35 node05 kernel: SM: 01000003 sm_stop: SG still joined
Mar 23 14:41:35 node05 kernel: SM: 03000006 sm_stop: SG still joined



NODE04:

Mar 23 14:38:02 node04 ccsd[7479]: Update of cluster.conf complete (version 82 -> 83). 
Mar 23 14:38:02 node04 clurgmgrd[11049]: <notice> Reconfiguring 
...
Mar 23 14:38:50 node04 kernel: CMAN: Started transition, generation 20
Mar 23 14:39:05 node04 kernel: CMAN: Started transition, generation 21
Mar 23 14:39:20 node04 kernel: CMAN: Started transition, generation 22
Mar 23 14:39:35 node04 kernel: CMAN: Started transition, generation 23
Mar 23 14:39:50 node04 kernel: CMAN: Started transition, generation 24
Mar 23 14:40:05 node04 kernel: CMAN: Started transition, generation 25
Mar 23 14:40:20 node04 kernel: CMAN: Started transition, generation 26
Mar 23 14:40:35 node04 kernel: CMAN: Started transition, generation 27
Mar 23 14:40:50 node04 kernel: CMAN: Started transition, generation 28
Mar 23 14:41:05 node04 kernel: CMAN: Started transition, generation 29
Mar 23 14:41:20 node04 kernel: CMAN: Started transition, generation 30
Mar 23 14:41:41 node04 kernel: CMAN: node node05.example.com has been removed from the cluster : No response to messages
Mar 23 14:44:20 node04 kernel: CMAN: node node06.example.com has been removed from the cluster : Inconsistent cluster view
Mar 23 14:44:20 node04 kernel: CMAN: Started transition, generation 31
Mar 23 14:44:21 node04 kernel: CMAN: Finished transition, generation 31
Mar 23 14:44:36 node04 kernel: CMAN: Finished transition, generation 32
Mar 23 14:44:51 node04 kernel: CMAN: Finished transition, generation 33
Mar 23 14:44:51 node04 fenced[7577]: fencing deferred to node07.example.com
Mar 23 14:44:51 node04 kernel: CMAN: Started transition, generation 35
Mar 23 14:44:51 node04 kernel: CMAN: Started transition, generation 36
Mar 23 14:45:14 node04 kernel: CMAN: Being told to leave the cluster by node 3
Mar 23 14:45:14 node04 kernel: CMAN: we are leaving the cluster. 
Mar 23 14:45:14 node04 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:45:14 node04 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:45:14 node04 kernel: SM: 00000002 sm_stop: SG still joined
Mar 23 14:45:14 node04 kernel: SM: 01000003 sm_stop: SG still joined
Mar 23 14:45:14 node04 kernel: SM: 03000006 sm_stop: SG still joined
Mar 23 14:45:14 node04 ccsd[7479]: Cluster manager shutdown.  Attemping to reconnect... 



NODE07:

Mar 23 14:38:01 node07 ccsd[7135]: Update of cluster.conf complete (version 82 -> 83). 
Mar 23 14:38:08 node07 clurgmgrd[10260]: <notice> Reconfiguring 
Mar 23 14:38:50 node07 kernel: CMAN: Started transition, generation 20
Mar 23 14:39:05 node07 kernel: CMAN: Started transition, generation 21
Mar 23 14:39:20 node07 kernel: CMAN: Started transition, generation 22
Mar 23 14:39:35 node07 kernel: CMAN: Started transition, generation 23
Mar 23 14:39:50 node07 kernel: CMAN: Started transition, generation 24
Mar 23 14:40:05 node07 kernel: CMAN: Started transition, generation 25
Mar 23 14:40:20 node07 kernel: CMAN: Started transition, generation 26
Mar 23 14:40:35 node07 kernel: CMAN: Started transition, generation 27
Mar 23 14:40:50 node07 kernel: CMAN: Started transition, generation 28
Mar 23 14:41:05 node07 kernel: CMAN: Started transition, generation 29
Mar 23 14:41:20 node07 kernel: CMAN: Started transition, generation 30
Mar 23 14:41:41 node07 kernel: CMAN: removing node node05.example.com from the cluster : No response to messages
Mar 23 14:44:20 node07 kernel: CMAN: removing node node06.example.com from the cluster : Inconsistent cluster view
Mar 23 14:44:20 node07 kernel: CMAN: Initiating transition, generation 31
Mar 23 14:44:21 node07 kernel: CMAN: Completed transition, generation 31
Mar 23 14:44:27 node07 kernel: CMAN: removing node node08.example.com from the cluster : No response to messages
Mar 23 14:44:27 node07 kernel: CMAN: quorum lost, blocking activity
Mar 23 14:44:27 node07 kernel: CMAN: Completed transition, generation 32
Mar 23 14:44:39 node07 kernel: CMAN: Completed transition, generation 33
Mar 23 14:44:51 node07 kernel: CMAN: quorum regained, resuming activity
Mar 23 14:44:51 node07 kernel: CMAN: Initiating transition, generation 35
Mar 23 14:44:51 node07 kernel: CMAN: Started transition, generation 35
Mar 23 14:44:51 node07 kernel: CMAN: Initiating transition, generation 36
Mar 23 14:44:51 node07 fenced[7204]: node05.example.com not a cluster member after 0 sec post_fail_delay
Mar 23 14:44:51 node07 fenced[7204]: node06.example.com not a cluster member after 0 sec post_fail_delay
Mar 23 14:44:51 node07 fenced[7204]: fencing node "node05.example.com"
Mar 23 14:44:51 node07 kernel: CMAN: removing node node08.example.com from the cluster : Shutdown
Mar 23 14:44:52 node07 kernel: CMAN: quorum lost, blocking activity
Mar 23 14:45:14 node07 kernel: CMAN: removing node node04.example.com from the cluster : Missed too many heartbeats
Mar 23 14:45:14 node07 kernel: CMAN: Initiating transition, generation 37
Mar 23 14:45:29 node07 kernel: CMAN: Initiating transition, generation 38
Mar 23 14:45:44 node07 kernel: CMAN: Initiating transition, generation 39
Mar 23 14:45:59 node07 kernel: CMAN: Initiating transition, generation 40
Mar 23 14:46:14 node07 kernel: CMAN: Initiating transition, generation 41
Mar 23 14:46:29 node07 kernel: CMAN: Initiating transition, generation 42
Mar 23 14:46:44 node07 kernel: CMAN: Initiating transition, generation 43
Mar 23 14:46:59 node07 kernel: CMAN: Initiating transition, generation 44
Mar 23 14:47:14 node07 kernel: CMAN: Initiating transition, generation 45
Mar 23 14:47:29 node07 kernel: CMAN: Initiating transition, generation 46
Mar 23 14:47:44 node07 kernel: CMAN: Initiating transition, generation 47
Mar 23 14:47:59 node07 kernel: CMAN: too many transition restarts - will die
Mar 23 14:47:59 node07 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
Mar 23 14:47:59 node07 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:47:59 node07 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:47:59 node07 kernel: SM: 00000002 sm_stop: SG still joined
Mar 23 14:47:59 node07 kernel: SM: 01000003 sm_stop: SG still joined
Mar 23 14:47:59 node07 kernel: SM: 03000006 sm_stop: SG still joined



NODE06:

Mar 23 14:38:03 node06 ccsd[7206]: Update of cluster.conf complete (version 82 -> 83). 
Mar 23 14:38:50 node06 kernel: CMAN: Started transition, generation 20
Mar 23 14:39:05 node06 kernel: CMAN: Started transition, generation 21
Mar 23 14:39:20 node06 kernel: CMAN: Started transition, generation 22
Mar 23 14:39:35 node06 kernel: CMAN: Started transition, generation 23
Mar 23 14:39:50 node06 kernel: CMAN: Started transition, generation 24
Mar 23 14:40:05 node06 kernel: CMAN: Started transition, generation 25
Mar 23 14:40:20 node06 kernel: CMAN: Started transition, generation 26
Mar 23 14:40:35 node06 kernel: CMAN: Started transition, generation 27
Mar 23 14:40:50 node06 kernel: CMAN: Started transition, generation 28
Mar 23 14:41:05 node06 kernel: CMAN: Started transition, generation 29
Mar 23 14:41:20 node06 kernel: CMAN: Started transition, generation 30
Mar 23 14:41:35 node06 kernel: CMAN: removing node node05.example.com from the cluster : Inconsistent cluster view
Mar 23 14:41:35 node06 kernel: CMAN: Initiating transition, generation 31
Mar 23 14:41:50 node06 kernel: CMAN: Initiating transition, generation 32
Mar 23 14:42:05 node06 kernel: CMAN: Initiating transition, generation 33
Mar 23 14:42:20 node06 kernel: CMAN: Initiating transition, generation 34
Mar 23 14:42:35 node06 kernel: CMAN: Initiating transition, generation 35
Mar 23 14:42:50 node06 kernel: CMAN: Initiating transition, generation 36
Mar 23 14:43:05 node06 kernel: CMAN: Initiating transition, generation 37
Mar 23 14:43:20 node06 kernel: CMAN: Initiating transition, generation 38
Mar 23 14:43:35 node06 kernel: CMAN: Initiating transition, generation 39
Mar 23 14:43:50 node06 kernel: CMAN: Initiating transition, generation 40
Mar 23 14:44:05 node06 kernel: CMAN: Initiating transition, generation 41
Mar 23 14:44:20 node06 kernel: CMAN: too many transition restarts - will die
Mar 23 14:44:20 node06 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
Mar 23 14:44:20 node06 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:44:20 node06 kernel: WARNING: dlm_emergency_shutdown
Mar 23 14:44:20 node06 kernel: SM: 00000002 sm_stop: SG still joined
Mar 23 14:44:20 node06 kernel: SM: 01000003 sm_stop: SG still joined
Mar 23 14:44:20 node06 kernel: SM: 03000006 sm_stop: SG still joined
Mar 23 14:44:20 node06 ccsd[7206]: Cluster manager shutdown.  Attemping to reconnect... 
Mar 23 14:44:48 node06 ccsd[7206]: Unable to connect to cluster infrastructure after 30 seconds. 


Version-Release number of selected component (if applicable):

2.6.9-78.0.1.ELsmp #1 SMP Tue Jul 22 18:01:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:

Not reproducible. Only happens on 3+ node clusters, and normally during membership transitions, ie when a node joins or is removed from the cluster, however this does not happen every time. One point that seems relevant is that this bug seems to happen when the cluster has been running for very long time.

Steps to Reproduce:

N/A
  
Actual results:

CMAN transitions do not converge to a cluster view, and cause all nodes apart from the master to shutdown CMAN, usually accompained by dlm_emergency_shutdown.

Expected results:

CMAN transitions to finish correctly and cluster continue operations stably.

Additional info:

When nodes are running GFS, right after the dlm_emergency_shutdown, the first call on lock_dlm:do_dlm_lock() causes the machine to panic, because all lockspaces have been disabled during dlm_emergency_shutdown.

Comment 2 Eduardo Damato 2009-04-14 13:58:32 UTC
To further debug the problem it is necessary to provide packages with
CMAN-kernel debugging, enabling DEBUG_MEMB and DEBUG_COMMS on cnxman-private.h.
The following patch has been provided already.

# diff -Nurp cnxman-private.h.orig cnxman-private.h
--- cnxman-private.h.orig       2009-04-14 14:55:51.000000000 +0100
+++ cnxman-private.h    2009-04-14 14:56:02.000000000 +0100
@@ -412,8 +412,8 @@ extern inline char *print_addr(unsigned 
 /* Debug enabling macros. Sorry about the C++ comments but they're easier to
  * get rid of than C ones... */

-// #define DEBUG_MEMB
-// #define DEBUG_COMMS
+#define DEBUG_MEMB
+#define DEBUG_COMMS
 // #define DEBUG_BARRIER

 /* Debug macros */

Comment 5 Christine Caulfield 2009-04-21 07:00:01 UTC
*** Bug 495967 has been marked as a duplicate of this bug. ***

Comment 47 Shane Bradley 2009-08-28 16:06:41 UTC
This is public note for customer's hitting this issue that it is resolved in Red Hat Enterprise Linux 4 update 8.

--sbradley

Comment 49 errata-xmlrpc 2009-08-31 07:52:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1237.html