Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 457107

Summary: Killing node X because it has rejoined the cluster with existing state
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.3CC: bstevens, ccaulfie, cluster-maint, edamato, jplans, mkarg, riek, sdake, tao, teigland, zxvdr.au
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 21:50:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from hayes-02
none
log from hayes-03
none
cman killed by node2 because we jejoined the cluster without a full restart none

Description Nate Straz 2008-07-29 17:03:10 UTC
Description of problem:

While running a load on multiple GFS2 file systems, one of four nodes hangs hard
and is removed from the cluster.  After that happens two nodes seem to be forced
out by openais and the cluster can not be restarted.

Version-Release number of selected component (if applicable):
openais-0.80.3-15.el5
kernel-2.6.18-98.el5
kmod-gfs2-1.98-1.1.el5.abhi.11

How reproducible:
Unknown, although I've hit it a few times now.

Steps to Reproduce:
1. Start a load on the cluster
2. cut power to one node
  
Actual results:

Logs from tank-01:

Jul 29 11:20:25 tank-01 openais[4989]: [TOTEM] entering GATHER state from 12. 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] entering GATHER state from 0. 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] Creating commit token because I
am the rep. 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] Saving state aru 194aa high seq
received 194aa 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] Storing new sequence id for ring cc 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] entering COMMIT state. 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] entering RECOVERY state. 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] position [0] member 10.15.89.91: 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] position [1] member 10.15.89.93: 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] position [2] member 10.15.89.94: 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] Did not need to originate any
messages in recovery. 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] Sending initial ORF token 
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] CLM CONFIGURATION CHANGE 
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] New Configuration: 
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ]  r(0) ip(10.15.89.91)  
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ]  r(0) ip(10.15.89.93)  
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ]  r(0) ip(10.15.89.94)  
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] Members Left: 
Jul 29 11:20:29 tank-01 kernel: dlm: closing connection to node 2
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ]  r(0) ip(10.15.89.92)  
Jul 29 11:20:29 tank-01 kernel: dlm: closing connection to node 3
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] Members Joined: 
Jul 29 11:20:29 tank-01 kernel: dlm: closing connection to node 4
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] CLM CONFIGURATION CHANGE 
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] New Configuration: 
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ]  r(0) ip(10.15.89.91)  
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ]  r(0) ip(10.15.89.93)  
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ]  r(0) ip(10.15.89.94)  
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] Members Left: 
Jul 29 11:20:29 tank-01 openais[4989]: [CLM  ] Members Joined: 
Jul 29 11:20:29 tank-01 openais[4989]: [SYNC ] This node is within the primary
component and will provide service. 
Jul 29 11:20:29 tank-01 openais[4989]: [TOTEM] entering OPERATIONAL state. 
Jul 29 11:20:29 tank-01 openais[4989]: [MAIN ] Killing node tank-03 because it
has rejoined the cluster with existing state 
Jul 29 11:20:29 tank-01 openais[4989]: [MAIN ] Killing node tank-04 because it
has rejoined the cluster with existing state 
Jul 29 11:20:29 tank-01 openais[4989]: [CMAN ] quorum lost, blocking activity 


Logs from tank-03:

Jul 29 11:20:25 tank-03 openais[4628]: [TOTEM] The token was lost in the
OPERATIONAL state. 
Jul 29 11:20:25 tank-03 openais[4628]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes). 
Jul 29 11:20:25 tank-03 openais[4628]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes). 
Jul 29 11:20:25 tank-03 openais[4628]: [TOTEM] entering GATHER state from 2. 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] entering GATHER state from 0. 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] Saving state aru 194aa high seq
received 194aa 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] Storing new sequence id for ring cc 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] entering COMMIT state. 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] entering RECOVERY state. 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] position [0] member 10.15.89.91: 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] position [1] member 10.15.89.93: 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] position [2] member 10.15.89.94: 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] Did not need to originate any
messages in recovery. 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] CLM CONFIGURATION CHANGE 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] New Configuration: 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ]  r(0) ip(10.15.89.91)  
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ]  r(0) ip(10.15.89.93)  
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ]  r(0) ip(10.15.89.94)  
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] Members Left: 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ]  r(0) ip(10.15.89.92)  
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] Members Joined: 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] CLM CONFIGURATION CHANGE 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] New Configuration: 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ]  r(0) ip(10.15.89.91)  
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ]  r(0) ip(10.15.89.93)  
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ]  r(0) ip(10.15.89.94)  
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] Members Left: 
Jul 29 11:20:30 tank-03 openais[4628]: [CLM  ] Members Joined: 
Jul 29 11:20:30 tank-03 openais[4628]: [SYNC ] This node is within the primary
component and will provide service. 
Jul 29 11:20:30 tank-03 openais[4628]: [TOTEM] entering OPERATIONAL state. 
Jul 29 11:20:30 tank-03 openais[4628]: [MAIN ] Killing node tank-04 because it
has rejoined the cluster with existing state 
Jul 29 11:20:30 tank-03 openais[4628]: [CMAN ] cman killed by node 4 because we
rejoined the cluster without a full restart 


Logs from tank-04:

Jul 29 11:20:24 tank-04 openais[4618]: [TOTEM] entering GATHER state from 12. 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] entering GATHER state from 11. 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] Saving state aru 194aa high seq
received 194aa 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] Storing new sequence id for ring cc 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] entering COMMIT state. 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] entering RECOVERY state. 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] position [0] member 10.15.89.91: 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] position [1] member 10.15.89.93: 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] position [2] member 10.15.89.94: 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] previous ring seq 200 rep
10.15.89.91 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] aru 194aa high delivered 194aa
received flag 1 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] Did not need to originate any
messages in recovery. 
Jul 29 11:20:29 tank-04 kernel: dlm: closing connection to node 2
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] CLM CONFIGURATION CHANGE 
Jul 29 11:20:29 tank-04 kernel: dlm: closing connection to node 3
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] New Configuration: 
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ]  r(0) ip(10.15.89.91)  
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ]  r(0) ip(10.15.89.93)  
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ]  r(0) ip(10.15.89.94)  
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] Members Left: 
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ]  r(0) ip(10.15.89.92)  
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] Members Joined: 
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] CLM CONFIGURATION CHANGE 
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] New Configuration: 
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ]  r(0) ip(10.15.89.91)  
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ]  r(0) ip(10.15.89.93)  
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ]  r(0) ip(10.15.89.94)  
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] Members Left: 
Jul 29 11:20:29 tank-04 openais[4618]: [CLM  ] Members Joined: 
Jul 29 11:20:29 tank-04 openais[4618]: [SYNC ] This node is within the primary
component and will provide service. 
Jul 29 11:20:29 tank-04 openais[4618]: [TOTEM] entering OPERATIONAL state. 
Jul 29 11:20:29 tank-04 openais[4618]: [MAIN ] Killing node tank-03 because it
has rejoined the cluster with existing state 
Jul 29 11:20:29 tank-04 openais[4618]: [CMAN ] quorum lost, blocking activity 
Jul 29 11:20:29 tank-04 openais[4618]: [MAIN ] Node tank-01 not joined to cman
because it has existing state 
Jul 29 11:20:29 tank-04 openais[4618]: [CMAN ] cman killed by node 1 because we
rejoined the cluster without a full restart 


Expected results:

When one node fails, only one node should fail.

Additional info:

Comment 1 Corey Marthaler 2008-07-29 19:02:44 UTC
I reproduced this today with a simple recovery attempt. I started cman on all 3
of my hayes machines (hayes-0[123]) and then rebooted hayes-01. This cause
hayes-03 to get kicked out of the cluster.

[root@hayes-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X    668                        hayes-01
   2   M    664   2008-07-29 13:37:20  hayes-02
   3   X    672                        hayes-03
[root@hayes-02 ~]# cman_tool services
type             level name     id       state
fence            0     default  00010001 FAIL_ALL_STOPPED
[1 2 3]
dlm              1     clvmd    00020002 FAIL_ALL_STOPPED
[1 2 3]

[root@hayes-03 ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?
[root@hayes-03 ~]# cman_tool services
Unable to connect to groupd.  Is it running?


Jul 29 13:44:17 hayes-02 openais[3135]: [MAIN ] Killing node hayes-03 because it
has rejoined the cluster with existing state

Jul 29 13:45:29 hayes-03 openais[3132]: [CMAN ] cman killed by node 2 because we
rejoined the cluster without a full restart
Jul 29 13:45:29 hayes-03 groupd[3144]: cpg_mcast_joined error 2 handle
6b8b456700000000
Jul 29 13:45:29 hayes-03 gfs_controld[3164]: groupd_dispatch error -1 errno 11
Jul 29 13:45:29 hayes-03 fenced[3152]: groupd is down, exiting
Jul 29 13:45:29 hayes-03 gfs_controld[3164]: groupd connection died
Jul 29 13:45:29 hayes-03 gfs_controld[3164]: cluster is down, exiting
Jul 29 13:45:29 hayes-03 clogd[3257]: cpg_dispatch failed: SA_AIS_ERR_LIBRARY
Jul 29 13:45:29 hayes-03 last message repeated 2 times
Jul 29 13:45:29 hayes-03 clogd[3257]: Bad callback on cluster/10
Jul 29 13:45:29 hayes-03 dlm_controld[3158]: cluster is down, exiting
Jul 29 13:45:29 hayes-03 clogd[3257]: cpg_dispatch failed: SA_AIS_ERR_LIBRARY


I'll post the full logs from this.

I was running openais-0.80.3-17.el5 (Built: Mon 19 May 2008 02:38:39 PM CDT)



Comment 2 Corey Marthaler 2008-07-29 19:07:34 UTC
Created attachment 312922 [details]
log from hayes-02

Comment 3 Corey Marthaler 2008-07-29 19:08:50 UTC
Created attachment 312923 [details]
log from hayes-03

Comment 4 Corey Marthaler 2008-07-29 19:27:39 UTC
I'm able to easily repo this on my taft cluster as well, it's running
openais-0.80.3-15.el5, but both have the 2.6.18-98.el5 kernel. 

Comment 5 Nate Straz 2008-07-29 20:13:06 UTC
I was able to reproduce this on -92.1.10.el5 with the following steps:

1. Mount a gfs file system
2. start d_io across the cluster
3. tail -f /var/log/messages on all nodes
4. power off node 2


Comment 6 Dean Jansa 2008-07-29 20:25:12 UTC
I was able to reproduce this with 2.6.18-98.el5 + openais-0.80.3-15.el5 on ia64.

Following the steps in comment #5 left the cluster in this state:

(I rebooted link-16, link-13 and link-14 left running, and link-14 was disallowed)

[root@link-13 ~]# cman_tool nodes
NOTE: There are 1 disallowed nodes,
      members list may seem inconsistent across the cluster
Node  Sts   Inc   Joined               Name
   1   M     92   2008-07-29 15:16:42  link-13
   2   d     96   2008-07-29 15:16:42  link-14
   4   X     96                        link-16

[root@link-14 ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?



Comment 7 Nate Straz 2008-07-29 21:12:08 UTC
Changing the summary to reflect the key log message for this bug.

Comment 8 Kiersten (Kerri) Anderson 2008-07-30 15:14:16 UTC
Chrissie, can you look into this, might be related to the latest cman changes
for 5.3.

Comment 9 Nate Straz 2008-07-30 16:29:33 UTC
I started attacking this problem from a clean RHEL 5.2 install going forward:

RHEL 5.2 - PASS
  kernel-2.6.18-92.el5 + openais-0.80.3-15.el5 + cman-2.0.84-2.el5
RHEL 5.2.Z - PASS
  kernel-2.6.18-92.1.10.el5 + openais-0.80.3-15.el5 + cman-2.0.84-2.el5
RHEL 5.2.Z w/ new cman - FAIL
  kernel-2.6.18-92.1.10.el5 + openais-0.80.3-15.el5 + cman-2.0.86-1.el5.test.plock.1

Comment 10 Nate Straz 2008-07-30 16:35:02 UTC
Since I've proven that the new cman package introduces the bug I'm adding the
Regression keyword.  Also, since the new cman is required for testing newer
kernels, I'm adding the TestBlocker keyword.

Comment 11 David Teigland 2008-07-31 15:52:12 UTC
The fact that cluster mirrors and aoe are being used here, both of which
put a bunch of additional traffic on the network, makes me wonder if
excessive network load could be an issue.


Comment 12 David Teigland 2008-07-31 16:01:04 UTC
I also wonder about openais ipc problems, given some of the errors from
aisexec, and the fact that both plocks and cluster mirrors have been known
to create ipc problems in the past, and both are being used together here.


Comment 13 Corey Marthaler 2008-07-31 16:03:40 UTC
Neither cmirrors nor aoe are required to reproduce this issue. All it takes is
having a quorate cman cluster, then rebooting one of the nodes in that cluster.

Comment 14 Christine Caulfield 2008-07-31 16:12:12 UTC
Looking at the syslogs on the two machines makes me pretty convinced it's the
patch contributed in bz#443358 that's at fault.

Comment 15 Corey Marthaler 2008-07-31 20:14:48 UTC
This bug isn't in cman-2.0.87-1.el5.test.plock.3.

(Build Date: Thu 31 Jul 2008 01:37:30 PM CDT)

Comment 16 Christine Caulfield 2008-08-01 09:41:39 UTC
Yes, it's that patch. It's totally broken and based on at least two
misunderstandings.

If the DLM is starting regardless of the fencing state then that might need to
be checked separately, but it might be that the lockspace is starting up with no
fencing (ie. before it has started at all) - which I believe is correct.

commit 4bb7b3ff842ac40a3705d82e74f266b64f8e7609
Author: Christine Caulfield <ccaulfie>
Date:   Fri Aug 1 10:38:00 2008 +0100

    cman: Revert dirty patch


Comment 28 errata-xmlrpc 2009-01-20 21:50:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html

Comment 30 yangyue 2012-11-02 05:47:52 UTC
Created attachment 636951 [details]
cman killed by node2 because we jejoined the cluster without a full restart

cman killed by node2 because we jejoined the cluster without a full restart