Bug 222919 - cman killed by node 2 for reason 3
Summary: cman killed by node 2 for reason 3
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-16 21:14 UTC by Corey Marthaler
Modified: 2009-04-16 22:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-12-19 17:39:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
All /var/log/messages output after boot (62.54 KB, text/plain)
2007-01-23 18:47 UTC, Ryan McCabe
no flags Details

Description Corey Marthaler 2007-01-16 21:14:35 UTC
Description of problem:
I was running revolver on the x86_64 taft cluster and after about 5 iterations
of shooting (with an elevated multicast load) taft-01 was force to leave the
cluster. This was right after taft-04 had been shot and then brought back up. 
I then attempted to 'service cman start' again by hand and the node ended up
killed again.

Revolver output:
================================================================================
Senario iteration 1.2 started at Tue Jan 16 14:14:02 CST 2007
Sleeping 1 minute(s) to let the I/O get its lock count up...
Senario: DLM kill Quorum minus one

Those picked to face the revolver... taft-04
Feeling lucky taft-04? Well do ya? Go'head make my day...

Verify that taft-04 has been removed from cluster on remaining nodes
JT1=08:12:02 JT2=08:12:02
JT1=08:12:02 JT2=08:12:02
JT1=08:12:02 JT2=08:12:02
Verifying that the dueler(s) are alive
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
Didn't receive heartbeat for 120 seconds
child taft-04_1 (19478) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-04_3 (19481) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-04_0 (19476) exited with status 127
still not all alive, sleeping another 10 seconds
Didn't receive heartbeat for 120 seconds
child taft-04_2 (19480) exited with status 127
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
All killed nodes are back up (able to be pinged), making sure they're qarshable...
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
All killed nodes are now qarshable

Verifying that recovery properly took place (on the nodes that stayed in the
cluster)
checking that all of the cluster nodes are now/still cman members...
cman_tool: Cannot open connection to cman, is it running ?
taft-01 is not a member on taft-01
cman_tool: Cannot open connection to cman, is it running ?
taft-02 is not a member on taft-01
cman_tool: Cannot open connection to cman, is it running ?
taft-03 is not a member on taft-01
taft-01 is not a member on taft-02
taft-01 is not a member on taft-03
[...]


Console messages on taft-01:
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] The token was lost in the COMMIT
state.
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering GATHER state from 4.
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Creating commit token because I
am the rep.
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering COMMIT state.
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering RECOVERY state.
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [0] member 10.15.89.67:
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 76 rep 10.15.89.67
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 5434 high delivered 5434
received flag 0
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [1] member 10.15.89.68:
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 88 rep 10.15.89.68
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 31c high delivered 31c
received flag 0
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [2] member 10.15.89.69:
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 88 rep 10.15.89.68
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 31c high delivered 31c
received flag 0
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [3] member 10.15.89.70:
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 88 rep 10.15.89.68
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 31c high delivered 31c
received flag 0
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Did not need to originate any
messages in recovery.
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Storing new sequence id for ring 60
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Sending initial ORF token
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] New Configuration:
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.67)
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.68)
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.69)
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] Members Left:
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] Members Joined:
Jan 16 08:17:48 taft-01 openais[2556]: [SYNC ] This node is within the primary
component and will prov
ide service.
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] New Configuration:
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.67)
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.68)
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.69)
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.70)
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] Members Left:
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ] Members Joined:
Jan 16 08:17:48 taft-01 openais[2556]: [CLM  ]  r(0) ip(10.15.89.70)
Jan 16 08:17:48 taft-01 openais[2556]: [SYNC ] This node is within the primary
component and will prov
ide service.
Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering OPERATIONAL state.
Jan 16 08:17:48 taft-01 openais[2556]: [CMAN ] cman killed by node 2 for reason 3
Jan 16 08:17:49 taft-01 fenced[2574]: cluster is down, exiting
Jan 16 08:17:49 taft-01 dlm_controld[2580]: cluster is down, exiting
Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 4
Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 2
Jan 16 08:17:49 taft-01 gfs_controld[2586]: cpg_mcast_joined error 2 handle
6b8b456700000000 MSG_PLOCK
Jan 16 08:17:49 taft-01 gfs_controld[2586]: send plock error -1
Jan 16 08:17:49 taft-01 gfs_controld[2586]: cluster is down, exiting
Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 3
Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 1
Jan 16 08:18:10 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:18:11 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:18:11 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 26 seconds.
Jan 16 08:18:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 30 seconds.
Jan 16 08:18:37 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:18:37 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:18:37 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 27 seconds.
Jan 16 08:18:44 taft-01 xinetd[2818]: START: qarsh pid=5323 from=10.15.80.47
Jan 16 08:18:44 taft-01 qarshd[5323]: Talking to peer 10.15.80.47:49245
Jan 16 08:18:44 taft-01 qarshd[5323]: Running cmdline: cman_tool nodes
Jan 16 08:18:44 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5323 duration=0(sec)
Jan 16 08:18:44 taft-01 xinetd[2818]: START: qarsh pid=5325 from=10.15.80.47
Jan 16 08:18:44 taft-01 qarshd[5325]: Talking to peer 10.15.80.47:49246
Jan 16 08:18:44 taft-01 qarshd[5325]: Running cmdline: cman_tool nodes
Jan 16 08:18:44 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5325 duration=0(sec)
Jan 16 08:18:44 taft-01 xinetd[2818]: START: qarsh pid=5327 from=10.15.80.47
Jan 16 08:18:44 taft-01 qarshd[5327]: Talking to peer 10.15.80.47:49247
Jan 16 08:18:44 taft-01 qarshd[5327]: Running cmdline: cman_tool nodes
Jan 16 08:18:44 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5327 duration=0(sec)
Jan 16 08:18:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 60 seconds.
Jan 16 08:18:50 taft-01 xinetd[2818]: START: qarsh pid=5329 from=10.15.80.47
Jan 16 08:18:50 taft-01 qarshd[5329]: Talking to peer 10.15.80.47:49254
Jan 16 08:18:50 taft-01 qarshd[5329]: Running cmdline: cman_tool nodes
Jan 16 08:18:50 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5329 duration=0(sec)
Jan 16 08:18:50 taft-01 xinetd[2818]: START: qarsh pid=5331 from=10.15.80.47
Jan 16 08:18:50 taft-01 qarshd[5331]: Talking to peer 10.15.80.47:49255
Jan 16 08:18:50 taft-01 qarshd[5331]: Running cmdline: cman_tool nodes
Jan 16 08:18:50 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5331 duration=0(sec)
Jan 16 08:18:50 taft-01 xinetd[2818]: START: qarsh pid=5333 from=10.15.80.47
Jan 16 08:18:50 taft-01 qarshd[5333]: Talking to peer 10.15.80.47:49256
Jan 16 08:18:50 taft-01 qarshd[5333]: Running cmdline: cman_tool nodes
Jan 16 08:18:50 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5333 duration=0(sec)
Jan 16 08:18:55 taft-01 xinetd[2818]: START: qarsh pid=5335 from=10.15.80.47
Jan 16 08:18:55 taft-01 qarshd[5335]: Talking to peer 10.15.80.47:49263
Jan 16 08:18:55 taft-01 qarshd[5335]: Running cmdline: cman_tool nodes
Jan 16 08:18:55 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5335 duration=0(sec)
Jan 16 08:18:55 taft-01 xinetd[2818]: START: qarsh pid=5337 from=10.15.80.47
Jan 16 08:18:55 taft-01 qarshd[5337]: Talking to peer 10.15.80.47:49264
Jan 16 08:18:55 taft-01 qarshd[5337]: Running cmdline: cman_tool nodes
Jan 16 08:18:55 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5337 duration=0(sec)
Jan 16 08:18:55 taft-01 xinetd[2818]: START: qarsh pid=5339 from=10.15.80.47
Jan 16 08:18:55 taft-01 qarshd[5339]: Talking to peer 10.15.80.47:49265
Jan 16 08:18:55 taft-01 qarshd[5339]: Running cmdline: cman_tool nodes
Jan 16 08:18:55 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5339 duration=0(sec)
Jan 16 08:19:00 taft-01 xinetd[2818]: START: qarsh pid=5342 from=10.15.80.47
Jan 16 08:19:00 taft-01 qarshd[5342]: Talking to peer 10.15.80.47:49272
Jan 16 08:19:00 taft-01 qarshd[5342]: Running cmdline: cman_tool nodes
Jan 16 08:19:00 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5342 duration=0(sec)
Jan 16 08:19:00 taft-01 xinetd[2818]: START: qarsh pid=5344 from=10.15.80.47
Jan 16 08:19:00 taft-01 qarshd[5344]: Talking to peer 10.15.80.47:49273
Jan 16 08:19:00 taft-01 qarshd[5344]: Running cmdline: cman_tool nodes
Jan 16 08:19:00 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5344 duration=0(sec)
Jan 16 08:19:00 taft-01 xinetd[2818]: START: qarsh pid=5346 from=10.15.80.47
Jan 16 08:19:00 taft-01 qarshd[5346]: Talking to peer 10.15.80.47:49274
Jan 16 08:19:00 taft-01 qarshd[5346]: Running cmdline: cman_tool nodes
Jan 16 08:19:00 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5346 duration=0(sec)
Jan 16 08:19:04 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:19:04 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:19:04 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 30 seconds.
Jan 16 08:19:05 taft-01 xinetd[2818]: START: qarsh pid=5393 from=10.15.80.47
Jan 16 08:19:05 taft-01 qarshd[5393]: Talking to peer 10.15.80.47:49281
Jan 16 08:19:05 taft-01 qarshd[5393]: Running cmdline: cman_tool nodes
Jan 16 08:19:05 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5393 duration=0(sec)
Jan 16 08:19:06 taft-01 xinetd[2818]: START: qarsh pid=5395 from=10.15.80.47
Jan 16 08:19:06 taft-01 qarshd[5395]: Talking to peer 10.15.80.47:49282
Jan 16 08:19:06 taft-01 qarshd[5395]: Running cmdline: cman_tool nodes
Jan 16 08:19:06 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5395 duration=0(sec)
Jan 16 08:19:06 taft-01 xinetd[2818]: START: qarsh pid=5397 from=10.15.80.47
Jan 16 08:19:06 taft-01 qarshd[5397]: Talking to peer 10.15.80.47:49283
Jan 16 08:19:06 taft-01 qarshd[5397]: Running cmdline: cman_tool nodes
Jan 16 08:19:06 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5397 duration=0(sec)
Jan 16 08:19:11 taft-01 xinetd[2818]: START: qarsh pid=5399 from=10.15.80.47
Jan 16 08:19:11 taft-01 qarshd[5399]: Talking to peer 10.15.80.47:49290
Jan 16 08:19:11 taft-01 qarshd[5399]: Running cmdline: cman_tool nodes
Jan 16 08:19:11 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5399 duration=0(sec)
Jan 16 08:19:11 taft-01 xinetd[2818]: START: qarsh pid=5401 from=10.15.80.47
Jan 16 08:19:11 taft-01 qarshd[5401]: Talking to peer 10.15.80.47:49291
Jan 16 08:19:11 taft-01 qarshd[5401]: Running cmdline: cman_tool nodes
Jan 16 08:19:11 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5401 duration=0(sec)
Jan 16 08:19:11 taft-01 xinetd[2818]: START: qarsh pid=5403 from=10.15.80.47
Jan 16 08:19:11 taft-01 qarshd[5403]: Talking to peer 10.15.80.47:49292
Jan 16 08:19:11 taft-01 qarshd[5403]: Running cmdline: cman_tool nodes
Jan 16 08:19:11 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5403 duration=0(sec)
Jan 16 08:19:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 90 seconds.
Jan 16 08:19:17 taft-01 qarshd[3372]: Sending child 3373 signal 2
Jan 16 08:19:17 taft-01 qarshd[3319]: Sending child 3320 signal 2
Jan 16 08:19:17 taft-01 qarshd[3335]: Sending child 3336 signal 2
Jan 16 08:19:17 taft-01 qarshd[3339]: Sending child 3340 signal 2
Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3372
duration=1026(sec)
Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3335
duration=1030(sec)
Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3319
duration=1031(sec)
Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3339
duration=1028(sec)
Jan 16 08:19:34 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:19:38 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:19:38 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:19:38 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 27 seconds.
Jan 16 08:19:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 120 seconds.
Jan 16 08:19:47 taft-01 xinetd[2818]: EXIT: btimed signal=14 pid=2898
duration=1105(sec)
Jan 16 08:20:05 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:20:05 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:20:05 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 30 seconds.
Jan 16 08:20:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 150 seconds.
Jan 16 08:20:35 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:20:35 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:20:35 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 30 seconds.
Jan 16 08:20:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 180 seconds.
Jan 16 08:21:05 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:21:05 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:21:05 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 31 seconds.
Jan 16 08:21:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 210 seconds.
Jan 16 08:21:36 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:21:36 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:21:36 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 29 seconds.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] AIS Executive Service RELEASE
'subrev 1324 version 0.80.2'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Copyright (C) 2002-2006
MontaVista Software, Inc and contributors.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] AIS Executive Service: started
and ready to provide service.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Using default multicast address
of 239.192.215.172
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_cpg loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais cluster closed process group service v1.01'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_cfg loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais configuration service'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_msg loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais message service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_lck loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais distributed locking service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_evt loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais event service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_ckpt
loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais checkpoint service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_amf loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais availability management framework B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_clm loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais cluster membership service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_evs loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais extended virtual synchrony service'
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_cman
loaded.
Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler
'openais CMAN membership service 2.01'
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Token Timeout (10000 ms)
retransmit timeout (495 ms)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] token hold (386 ms) retransmits
before loss (20 retrans)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] join (60 ms) send_join (0 ms)
consensus (4800 ms) merge (200 ms)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] downcheck (1000 ms) fail to recv
const (50 msgs)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] seqno unchanged const (30
rotations) Maximum network MTU 1500
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] window size per rotation (50
messages) maximum messages per rotation (17 messages)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] send threads (0 threads)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP token expired timeout (495 ms)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP token problem counter (2000 ms)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP threshold (10 problem count)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP mode set to none.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] heartbeat_failures_allowed (0)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] max_network_delay (50 ms)
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] HeartBeat is Disabled. To enable
set heartbeat_failures_allowed > 0
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Receive multicast socket recv
buffer size (262142 bytes).
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] The network interface
[10.15.89.67] is now up.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Created or loaded sequence id
0.10.15.89.67 for this ring.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering GATHER state from 15.
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais extended virtual synchrony service'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais cluster membership service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais availability management framework B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais checkpoint service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais event service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais distributed locking service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais message service B.01.01'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais configuration service'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais cluster closed process group service v1.01'
Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler
'openais CMAN membership service 2.01'
Jan 16 08:21:41 taft-01 openais[5688]: [CMAN ] CMAN 2.0.57 (built Dec 21 2006
12:28:50) started
Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] Not using a virtual synchrony filter.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Creating commit token because I
am the rep.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Saving state aru 0 high seq
received 0
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering COMMIT state.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering RECOVERY state.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [0] member 10.15.89.67:
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 0 rep 10.15.89.67
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 0 high delivered 0 received
flag 0
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Did not need to originate any
messages in recovery.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Storing new sequence id for ring 4
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Sending initial ORF token
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] New Configuration:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Left:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Joined:
Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary
component and will provide service.
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] New Configuration:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.67)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Left:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Joined:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.67)
Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary
component and will provide service.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering OPERATIONAL state.
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] got nodejoin message 10.15.89.67
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering GATHER state from 11.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Creating commit token because I
am the rep.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Saving state aru 9 high seq
received 9
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering COMMIT state.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering RECOVERY state.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [0] member 10.15.89.67:
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 4 rep 10.15.89.67
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 9 high delivered 9 received
flag 0
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [1] member 10.15.89.68:
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 100 rep 10.15.89.68
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 2b34 high delivered 2b34
received flag 0
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [2] member 10.15.89.69:
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 100 rep 10.15.89.68
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 2b34 high delivered 2b34
received flag 0
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [3] member 10.15.89.70:
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 100 rep 10.15.89.68
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 2b34 high delivered 2b34
received flag 0
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Did not need to originate any
messages in recovery.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Storing new sequence id for ring 68
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Sending initial ORF token
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] New Configuration:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.67)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Left:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Joined:
Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary
component and will provide service.
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] New Configuration:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.67)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.68)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.69)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.70)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Left:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] Members Joined:
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.68)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.69)
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ]  r(0) ip(10.15.89.70)
Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary
component and will provide service.
Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering OPERATIONAL state.
Jan 16 08:21:41 taft-01 openais[5688]: [CMAN ] quorum regained, resuming activity
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] got nodejoin message 10.15.89.67
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] got nodejoin message 10.15.89.68
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] got nodejoin message 10.15.89.69
Jan 16 08:21:41 taft-01 openais[5688]: [CLM  ] got nodejoin message 10.15.89.70
Jan 16 08:21:41 taft-01 openais[5688]: [CPG  ] got joinlist message from node 4
Jan 16 08:21:41 taft-01 openais[5688]: [CPG  ] got joinlist message from node 2
Jan 16 08:21:41 taft-01 openais[5688]: [CPG  ] got joinlist message from node 3
Jan 16 08:21:42 taft-01 ccsd[2549]: Initial status:: Quorate
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft3 in
/sys/kernel/dlm
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft2 in
/sys/kernel/dlm
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft1 in
/sys/kernel/dlm
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft0 in
/sys/kernel/dlm
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object clvmd in
/sys/kernel/dlm
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object
TAFT_CLUSTER:taft3 in /sys/fs/gfs
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object
TAFT_CLUSTER:taft2 in /sys/fs/gfs
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object
TAFT_CLUSTER:taft1 in /sys/fs/gfs
Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object
TAFT_CLUSTER:taft0 in /sys/fs/gfs
Jan 16 08:21:42 taft-01 groupd[5696]: local node must be reset to clear 9
uncontrolled instances of gfs and/or dlm
Jan 16 08:21:42 taft-01 openais[5688]: [CMAN ] cman killed by node 1 for reason 2
Jan 16 08:21:42 taft-01 fenced[5703]: cman_init error 0 111
Jan 16 08:21:42 taft-01 gfs_controld[5715]: cman_init error 111
Jan 16 08:21:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 240 seconds.
Jan 16 08:21:52 taft-01 dlm_controld[5709]: group_init error 0 111
Jan 16 08:22:05 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67
Jan 16 08:22:05 taft-01 dhclient: DHCPACK from 10.15.89.100
Jan 16 08:22:05 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 25 seconds.
Jan 16 08:22:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure
after 270 seconds.




[root@taft-01 ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?


[root@taft-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X    104                        taft-01
   2   M      4   2007-01-16 08:06:25  taft-02
   3   M     64   2007-01-16 08:06:25  taft-03
   4   M     88   2007-01-16 08:17:43  taft-04


[root@taft-03 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X    104                        taft-01
   2   M     64   2007-01-16 08:06:25  taft-02
   3   M      4   2007-01-16 08:05:50  taft-03
   4   M     88   2007-01-16 08:17:43  taft-04


[root@taft-04 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X    104                        taft-01
   2   M     88   2007-01-16 08:17:43  taft-02
   3   M     88   2007-01-16 08:17:43  taft-03
   4   M      4   2007-01-16 08:17:18  taft-04


[root@taft-01 ~]# group_tool -v
Unable to connect to groupd.  Is it running?


[root@taft-02 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 none
[2 3 4]
dlm              1     clvmd    00030002 none
[2 3 4]
dlm              1     taft0    00050002 none
[2 3]
dlm              1     taft1    00070002 none
[2 3]
dlm              1     taft2    00090002 none
[2 3]
dlm              1     taft3    000b0002 none
[2 3]
gfs              2     taft0    00040002 none
[2 3]
gfs              2     taft1    00060002 none
[2 3]
gfs              2     taft2    00080002 none
[2 3]
gfs              2     taft3    000a0002 none
[2 3]


[root@taft-03 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 none
[2 3 4]
dlm              1     clvmd    00030002 none
[2 3 4]
dlm              1     taft0    00050002 none
[2 3]
dlm              1     taft1    00070002 none
[2 3]
dlm              1     taft2    00090002 none
[2 3]
dlm              1     taft3    000b0002 none
[2 3]
gfs              2     taft0    00040002 none
[2 3]
gfs              2     taft1    00060002 none
[2 3]
gfs              2     taft2    00080002 none
[2 3]
gfs              2     taft3    000a0002 none
[2 3]


[root@taft-04 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 none
[2 3 4]
dlm              1     clvmd    00030002 none
[2 3 4]


Version-Release number of selected component (if applicable):
2.6.18-1.3002.el5
cman-2.0.57-1.el5

Comment 1 Christine Caulfield 2007-01-17 10:05:40 UTC
The first 'killed' message "by node 2 for reason 3" is a very likely a result of
the openAIS rejoin - see the first syslog line you posted:

Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] The token was lost in the COMMIT
state.

So cman has shutdown openais, and all the subsidiary components should tidy up
too (as best they can).

The second 'killed' message is very different:
Jan 16 08:21:42 taft-01 openais[5688]: [CMAN ] cman killed by node 1 for reason 2

This is (I think) groupd detected already running subsystems and deciding it
can't continue with them in that state and forcing a leave of the cluster:

Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object
TAFT_CLUSTER:taft0 in /sys/fs/gfs
Jan 16 08:21:42 taft-01 groupd[5696]: local node must be reset to clear 9
uncontrolled instances of gfs and/or dlm

So, I reckon the initial cause of this is probably the elevated multicast load
causing the openais rejoin to occur.

It might be worth opening another bug to cover the fact that GFS doesn't shut
down correctly when cman forces an emergency shutdown (though how fixable that
is I'm not sure!)

It also bothers me that the node was not fenced after being ejected from the
cluster but I'll pass on that for the moment cos I'm not sure what (if any)
fencing system you're using with revolver.

Comment 2 David Teigland 2007-01-17 15:47:03 UTC
I'll basically repeat what Patrick explained, adding a couple points
here and there.

1. openais shuts down taft-01, presumably due to elevated multicast load
   (whether that's a big issue or not depends I'd guess on how excessive
   the load was, Steven may be able to comment on that.)

2. the other daemons shut down, abandoning local instances of gfs/dlm.
   There's no way to umount/leave these instances because the cluster
   has already gone away.

3. if there are enough other nodes remaining for quorum, they should be
  fencing taft-01 as a result of step 1.  The group_tool output from
  taft-04 appears to show that taft-01 has been fenced.  If it wasn't
  then there may be a problem in fenced, we'd want to see the fenced
  logs from the other nodes.  What kind of fencing is configured?

4. 'service cman start' was run again.  openais joins the cluster, but
   groupd recognizes the abandoned instances of gfs/dlm from above.  From
   this it knows what happened in steps 1 and 2, and knows that it needs to
   be reset/fenced.  groupd kills cman on the local node (which we see above).
   At this stage groupd could do one of two different things, it doesn't really
   matter which:
   
   i) It could just sit and wait to be fenced which it knows someone will
   eventually do.
   
   ii) It could attempt to fence _itself_.  There's one situation where this
   can be beneficial.  That's when power fencing is being used and there are
   not enough other nodes to make quorum.  The other nodes won't fence/reboot
   it because they don't have quorum.  Fencing itself means it reboots itself.
   When it comes back up, it'll properly rejoin the cluster and may result
   in the cluster having quorum again.  (groupd is using the fencing system
   in this case for its side effects of rebooting the node, not in the normal
   sense of doing a real fencing operation).

   groupd is currently doing (ii) -- it forks/execs "fence_node" with its
   own name.

So, it appears that multiple things may have gone wrong here.  First, openais
was shut down.  Second, the initial fencing operation didn't reboot the node. 
Third, when the cluster was restarted on this defunct node, it should have
fenced itself, which didn't reboot the node either.


Comment 3 Ryan McCabe 2007-01-23 18:47:28 UTC
Created attachment 146332 [details]
All /var/log/messages output after boot

Comment 4 Ryan McCabe 2007-01-23 18:48:16 UTC
I'm seeing the same thing on one of our clusters here. When fencing is
configured properly, the node that is being killed by cman is fenced endlessly.
When I disable fencing, I see (basically) the same output as above. After
openais and groupd die, ccsd and clurgmgrd are left running. One of the
rgmanager threads is blocked trying to connect to DLM. If I wake it up by
stracing it, the thread exits, and almost always, the kernel oopses.

All cluster nodes are running the x86_64 20070112.3 tree with the
2.6.18-1.3002.el5xen kernel.

Here's the kernel output

dlm: closing connection to node 5
dlm: closing connection to node 4
dlm: closing connection to node 3
dlm: closing connection to node 2
dlm: closing connection to node 1
Unable to handle kernel NULL pointer dereference at 0000000000000002 RIP: 
 [<ffffffff8032ade1>] list_del+0x8/0x71
PGD 698b9067 PUD 69f99067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file:
/devices/pci0000:00/0000:00:03.0/0000:01:00.0/0000:02:08.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor
CPU 1 
Modules linked in: bridge netloop netbk blktap blkbk nfs lockd fscache nfs_acl
lock_dlm gfs2 dlm configfs sunrpc ipv6 video sbs i2c_ec i2c_core button batter
y asus_acpi ac parport_pc lp parport joydev sr_mod sg pcspkr bnx2 shpchp ide_cd
serial_core cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage 
qla2xxx scsi_transport_fc mptsas mptscsih mptbase scsi_transport_sas sd_mod
scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 3970, comm: clurgmgrd Not tainted 2.6.18-1.3002.el5xen #1
RIP: e030:[<ffffffff8032ade1>]  [<ffffffff8032ade1>] list_del+0x8/0x71
RSP: e02b:ffff880069929dd8  EFLAGS: 00010246
RAX: 0000000000000002 RBX: ffff88006a7d8000 RCX: ffff88006a468000
RDX: ffffffffff578040 RSI: ffff880075b407a0 RDI: ffff88006a7d8000
RBP: ffff88006a7d8000 R08: ffff880069928000 R09: 0000000000000001
R10: ffff880075a4cac0 R11: ffff880075b407a0 R12: 00000000fffffe00
R13: ffff88006a7d8030 R14: 0000000000000020 R15: ffff880073f660d8
FS:  00002aaaaaabba00(0000) GS:ffffffff8058d080(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process clurgmgrd (pid: 3970, threadinfo ffff880069928000, task ffff880075b407a0)
Stack:  ffff880069929e08  ffffffff8841c853  0000000000000071  00000000f14c7dd0 
 ffff880069929eb0  ffff88006a7d84c4  6567616e616d6772  0000000000000072 
 0000000000000000  0000000000000000 
Call Trace:
 [<ffffffff8841c853>] :dlm:dlm_new_lockspace+0x6b7/0x76a
 [<ffffffff8841f8b3>] :dlm:device_write+0x3b7/0x5a7
 [<ffffffff80216292>] vfs_write+0xce/0x174
 [<ffffffff80216a8e>] sys_write+0x45/0x6e
 [<ffffffff8025c5dd>] tracesys+0xa7/0xb2


Code: 48 8b 10 48 39 fa 74 1b 48 89 fe 31 c0 48 c7 c7 a8 4b 47 80 
RIP  [<ffffffff8032ade1>] list_del+0x8/0x71
 RSP <ffff880069929dd8>
CR2: 0000000000000002
 <1>Unable to handle kernel paging request at 0000000100000001 RIP: 
 [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45
PGD 699e9067 PUD 0 
Oops: 0000 [2] SMP 
last sysfs file:
/devices/pci0000:00/0000:00:03.0/0000:01:00.0/0000:02:08.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor
CPU 0 
Modules linked in: bridge netloop netbk blktap blkbk nfs lockd fscache nfs_acl
lock_dlm gfs2 dlm configfs sunrpc ipv6 video sbs i2c_ec i2c_core button batter
y asus_acpi ac parport_pc lp parport joydev sr_mod sg pcspkr bnx2 shpchp ide_cd
serial_core cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage 
qla2xxx scsi_transport_fc mptsas mptscsih mptbase scsi_transport_sas sd_mod
scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 3972, comm: dlm_scand Not tainted 2.6.18-1.3002.el5xen #1
RIP: e030:[<ffffffff8841ca7a>]  [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45
RSP: e02b:ffff88006985fee0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000100000001 RCX: ffff88006b1a6080
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88006a7d8000
RBP: ffffffff8841ca63 R08: ffff88006985e000 R09: 0000000000000000
R10: ffffffff804b9ce0 R11: ffff88006b1a6080 R12: ffff880069929c98
R13: 0000000000000000 R14: ffff880069929c88 R15: ffffffff8029442b
FS:  00002aaaaaaba2a0(0000) GS:ffffffff8058d000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process dlm_scand (pid: 3972, threadinfo ffff88006985e000, task ffff88006b1a6080)
Stack:  0000000000000000  ffffffff80232807  0000000000000000  0000000000000000 
 ffffffffffffffff  ffffffffffffffff  ffffffffffffffff  7fffffffffffffff 
 ffffffffffffffff  ffff880001004460 
Call Trace:
 [<ffffffff80232807>] kthread+0xfe/0x132
 [<ffffffff8025ce10>] child_rip+0xa/0x12
 [<ffffffff8029442b>] keventd_create_kthread+0x0/0x61
 [<ffffffff80232709>] kthread+0x0/0x132
 [<ffffffff8025ce06>] child_rip+0x0/0x12


Code: 48 8b 03 0f 18 08 48 81 fb 10 3c 43 88 75 e6 69 3d a9 e3 00 
RIP  [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45
 RSP <ffff88006985fee0>
CR2: 0000000100000001


I'll create an attachment for everything that is logged to /var/log/messages.

I can reproduce this at will and anyone who wants to take a look into it, can
use the machine.

Comment 5 Christine Caulfield 2007-01-24 16:01:02 UTC
"When fencing is configured properly, the node that is being killed by cman is
fenced endlessly."

That sounds wrong.

Comment 6 Ryan McCabe 2007-01-24 16:03:49 UTC
I just checked the logs on the node that did the fencing, and it looks like
you're right. The fence operation succeeded but reported failure. Looks like a
bug in the new fence APC agent.

Comment 7 Ryan McCabe 2007-01-24 16:17:28 UTC
Argh. Sorry, I spoke too quickly. I stopped the fencing by just removing exec
perms on the fence agent script, and the failed fence operations I was seeing
were a result of that. Looking back farther into the logs, the node really was
being fenced as soon as it joined:


Jan 23 13:16:03 marge fenced[2947]: homer.lab.boston.redhat.com not a cluster me
mber after 2 sec post_fail_delay
Jan 23 13:16:03 marge fenced[2947]: fencing node "homer.lab.boston.redhat.com"
Jan 23 13:16:06 marge openais[2931]: [MAIN ] Killing node homer.lab.boston.redha
t.com because it has rejoined the cluster without cman_tool join 
Jan 23 13:16:36 marge fenced[2947]: fence "homer.lab.boston.redhat.com" success

ad nauseam


Comment 8 Christine Caulfield 2007-01-24 16:36:25 UTC
That looks pretty much like a consequence of the first error. The cluster is
unstable because a node has rejoined without going through cman. I suspect the
node that gets repeatedly fenced is the victim of "CMAN: Joined a cluster with
disallowed nodes. must die".

The way around this is to increase the token timeout in cluster.conf
eg:
<totem token="15000"/>


Comment 9 Steven Dake 2007-01-24 19:48:10 UTC
Patrick,

if this error occurs, ie: cman is killed by another node, shouldn't the
remaining cluster be stable?  In the case of the first error, I believe we
should fence the node whether or not there is quorum by running reboot -fin from
within cman instead of exiting.

Increasing the token timeout simply causes  this case not to happen, which in
the end it eventually will.

Comment 10 Christine Caulfield 2007-01-25 09:21:29 UTC
It depends.

If the node is killed because the cluster is already unstable without it (which
I think is what's happening here) then removing a node that hasn't really joined
isn't going to make any difference.

The reason the 'rejoined' node isn't killed is because the cluster is not
quorate and so doesn't know whether it is part of the stable portion (or even if
there is one).

Comment 11 RHEL Program Management 2007-04-25 22:17:30 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Christine Caulfield 2007-05-08 15:45:45 UTC
I think for sites with an "elevated multicast load" we should probably recommend
upping the timeouts (I'm sure that happens with RHEL4 anyway).

The message is improved in the head of CVS & the RHEL5 branch. 

Comment 13 Christine Caulfield 2007-05-14 11:30:34 UTC
I'll mark this as NEEDINFO - let me know if it still fails with increased timeouts.

Comment 14 Corey Marthaler 2007-06-05 16:08:40 UTC
Can't currently reproduce, moving off the release criteria.

Comment 15 Corey Marthaler 2007-12-19 17:39:01 UTC
This bug hasn't been seen in almost a year, closing...


Note You need to log in before you can comment on or make changes to this bug.