Description of problem: I was running revolver on the x86_64 taft cluster and after about 5 iterations of shooting (with an elevated multicast load) taft-01 was force to leave the cluster. This was right after taft-04 had been shot and then brought back up. I then attempted to 'service cman start' again by hand and the node ended up killed again. Revolver output: ================================================================================ Senario iteration 1.2 started at Tue Jan 16 14:14:02 CST 2007 Sleeping 1 minute(s) to let the I/O get its lock count up... Senario: DLM kill Quorum minus one Those picked to face the revolver... taft-04 Feeling lucky taft-04? Well do ya? Go'head make my day... Verify that taft-04 has been removed from cluster on remaining nodes JT1=08:12:02 JT2=08:12:02 JT1=08:12:02 JT2=08:12:02 JT1=08:12:02 JT2=08:12:02 Verifying that the dueler(s) are alive still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds Didn't receive heartbeat for 120 seconds child taft-04_1 (19478) exited with status 127 Didn't receive heartbeat for 120 seconds child taft-04_3 (19481) exited with status 127 Didn't receive heartbeat for 120 seconds child taft-04_0 (19476) exited with status 127 still not all alive, sleeping another 10 seconds Didn't receive heartbeat for 120 seconds child taft-04_2 (19480) exited with status 127 still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds All killed nodes are back up (able to be pinged), making sure they're qarshable... still not all qarshable, sleeping another 10 seconds still not all qarshable, sleeping another 10 seconds still not all qarshable, sleeping another 10 seconds still not all qarshable, sleeping another 10 seconds still not all qarshable, sleeping another 10 seconds still not all qarshable, sleeping another 10 seconds All killed nodes are now qarshable Verifying that recovery properly took place (on the nodes that stayed in the cluster) checking that all of the cluster nodes are now/still cman members... cman_tool: Cannot open connection to cman, is it running ? taft-01 is not a member on taft-01 cman_tool: Cannot open connection to cman, is it running ? taft-02 is not a member on taft-01 cman_tool: Cannot open connection to cman, is it running ? taft-03 is not a member on taft-01 taft-01 is not a member on taft-02 taft-01 is not a member on taft-03 [...] Console messages on taft-01: Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] The token was lost in the COMMIT state. Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering GATHER state from 4. Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Creating commit token because I am the rep. Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering COMMIT state. Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering RECOVERY state. Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [0] member 10.15.89.67: Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 76 rep 10.15.89.67 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 5434 high delivered 5434 received flag 0 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [1] member 10.15.89.68: Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 88 rep 10.15.89.68 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 31c high delivered 31c received flag 0 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [2] member 10.15.89.69: Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 88 rep 10.15.89.68 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 31c high delivered 31c received flag 0 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] position [3] member 10.15.89.70: Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] previous ring seq 88 rep 10.15.89.68 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] aru 31c high delivered 31c received flag 0 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Did not need to originate any messages in recovery. Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Storing new sequence id for ring 60 Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] Sending initial ORF token Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] CLM CONFIGURATION CHANGE Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] New Configuration: Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.67) Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.68) Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.69) Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] Members Left: Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] Members Joined: Jan 16 08:17:48 taft-01 openais[2556]: [SYNC ] This node is within the primary component and will prov ide service. Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] CLM CONFIGURATION CHANGE Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] New Configuration: Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.67) Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.68) Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.69) Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.70) Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] Members Left: Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] Members Joined: Jan 16 08:17:48 taft-01 openais[2556]: [CLM ] r(0) ip(10.15.89.70) Jan 16 08:17:48 taft-01 openais[2556]: [SYNC ] This node is within the primary component and will prov ide service. Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] entering OPERATIONAL state. Jan 16 08:17:48 taft-01 openais[2556]: [CMAN ] cman killed by node 2 for reason 3 Jan 16 08:17:49 taft-01 fenced[2574]: cluster is down, exiting Jan 16 08:17:49 taft-01 dlm_controld[2580]: cluster is down, exiting Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 4 Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 2 Jan 16 08:17:49 taft-01 gfs_controld[2586]: cpg_mcast_joined error 2 handle 6b8b456700000000 MSG_PLOCK Jan 16 08:17:49 taft-01 gfs_controld[2586]: send plock error -1 Jan 16 08:17:49 taft-01 gfs_controld[2586]: cluster is down, exiting Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 3 Jan 16 08:17:49 taft-01 kernel: dlm: closing connection to node 1 Jan 16 08:18:10 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:18:11 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:18:11 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 26 seconds. Jan 16 08:18:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 30 seconds. Jan 16 08:18:37 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:18:37 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:18:37 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 27 seconds. Jan 16 08:18:44 taft-01 xinetd[2818]: START: qarsh pid=5323 from=10.15.80.47 Jan 16 08:18:44 taft-01 qarshd[5323]: Talking to peer 10.15.80.47:49245 Jan 16 08:18:44 taft-01 qarshd[5323]: Running cmdline: cman_tool nodes Jan 16 08:18:44 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5323 duration=0(sec) Jan 16 08:18:44 taft-01 xinetd[2818]: START: qarsh pid=5325 from=10.15.80.47 Jan 16 08:18:44 taft-01 qarshd[5325]: Talking to peer 10.15.80.47:49246 Jan 16 08:18:44 taft-01 qarshd[5325]: Running cmdline: cman_tool nodes Jan 16 08:18:44 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5325 duration=0(sec) Jan 16 08:18:44 taft-01 xinetd[2818]: START: qarsh pid=5327 from=10.15.80.47 Jan 16 08:18:44 taft-01 qarshd[5327]: Talking to peer 10.15.80.47:49247 Jan 16 08:18:44 taft-01 qarshd[5327]: Running cmdline: cman_tool nodes Jan 16 08:18:44 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5327 duration=0(sec) Jan 16 08:18:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 60 seconds. Jan 16 08:18:50 taft-01 xinetd[2818]: START: qarsh pid=5329 from=10.15.80.47 Jan 16 08:18:50 taft-01 qarshd[5329]: Talking to peer 10.15.80.47:49254 Jan 16 08:18:50 taft-01 qarshd[5329]: Running cmdline: cman_tool nodes Jan 16 08:18:50 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5329 duration=0(sec) Jan 16 08:18:50 taft-01 xinetd[2818]: START: qarsh pid=5331 from=10.15.80.47 Jan 16 08:18:50 taft-01 qarshd[5331]: Talking to peer 10.15.80.47:49255 Jan 16 08:18:50 taft-01 qarshd[5331]: Running cmdline: cman_tool nodes Jan 16 08:18:50 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5331 duration=0(sec) Jan 16 08:18:50 taft-01 xinetd[2818]: START: qarsh pid=5333 from=10.15.80.47 Jan 16 08:18:50 taft-01 qarshd[5333]: Talking to peer 10.15.80.47:49256 Jan 16 08:18:50 taft-01 qarshd[5333]: Running cmdline: cman_tool nodes Jan 16 08:18:50 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5333 duration=0(sec) Jan 16 08:18:55 taft-01 xinetd[2818]: START: qarsh pid=5335 from=10.15.80.47 Jan 16 08:18:55 taft-01 qarshd[5335]: Talking to peer 10.15.80.47:49263 Jan 16 08:18:55 taft-01 qarshd[5335]: Running cmdline: cman_tool nodes Jan 16 08:18:55 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5335 duration=0(sec) Jan 16 08:18:55 taft-01 xinetd[2818]: START: qarsh pid=5337 from=10.15.80.47 Jan 16 08:18:55 taft-01 qarshd[5337]: Talking to peer 10.15.80.47:49264 Jan 16 08:18:55 taft-01 qarshd[5337]: Running cmdline: cman_tool nodes Jan 16 08:18:55 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5337 duration=0(sec) Jan 16 08:18:55 taft-01 xinetd[2818]: START: qarsh pid=5339 from=10.15.80.47 Jan 16 08:18:55 taft-01 qarshd[5339]: Talking to peer 10.15.80.47:49265 Jan 16 08:18:55 taft-01 qarshd[5339]: Running cmdline: cman_tool nodes Jan 16 08:18:55 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5339 duration=0(sec) Jan 16 08:19:00 taft-01 xinetd[2818]: START: qarsh pid=5342 from=10.15.80.47 Jan 16 08:19:00 taft-01 qarshd[5342]: Talking to peer 10.15.80.47:49272 Jan 16 08:19:00 taft-01 qarshd[5342]: Running cmdline: cman_tool nodes Jan 16 08:19:00 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5342 duration=0(sec) Jan 16 08:19:00 taft-01 xinetd[2818]: START: qarsh pid=5344 from=10.15.80.47 Jan 16 08:19:00 taft-01 qarshd[5344]: Talking to peer 10.15.80.47:49273 Jan 16 08:19:00 taft-01 qarshd[5344]: Running cmdline: cman_tool nodes Jan 16 08:19:00 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5344 duration=0(sec) Jan 16 08:19:00 taft-01 xinetd[2818]: START: qarsh pid=5346 from=10.15.80.47 Jan 16 08:19:00 taft-01 qarshd[5346]: Talking to peer 10.15.80.47:49274 Jan 16 08:19:00 taft-01 qarshd[5346]: Running cmdline: cman_tool nodes Jan 16 08:19:00 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5346 duration=0(sec) Jan 16 08:19:04 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:19:04 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:19:04 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 30 seconds. Jan 16 08:19:05 taft-01 xinetd[2818]: START: qarsh pid=5393 from=10.15.80.47 Jan 16 08:19:05 taft-01 qarshd[5393]: Talking to peer 10.15.80.47:49281 Jan 16 08:19:05 taft-01 qarshd[5393]: Running cmdline: cman_tool nodes Jan 16 08:19:05 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5393 duration=0(sec) Jan 16 08:19:06 taft-01 xinetd[2818]: START: qarsh pid=5395 from=10.15.80.47 Jan 16 08:19:06 taft-01 qarshd[5395]: Talking to peer 10.15.80.47:49282 Jan 16 08:19:06 taft-01 qarshd[5395]: Running cmdline: cman_tool nodes Jan 16 08:19:06 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5395 duration=0(sec) Jan 16 08:19:06 taft-01 xinetd[2818]: START: qarsh pid=5397 from=10.15.80.47 Jan 16 08:19:06 taft-01 qarshd[5397]: Talking to peer 10.15.80.47:49283 Jan 16 08:19:06 taft-01 qarshd[5397]: Running cmdline: cman_tool nodes Jan 16 08:19:06 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5397 duration=0(sec) Jan 16 08:19:11 taft-01 xinetd[2818]: START: qarsh pid=5399 from=10.15.80.47 Jan 16 08:19:11 taft-01 qarshd[5399]: Talking to peer 10.15.80.47:49290 Jan 16 08:19:11 taft-01 qarshd[5399]: Running cmdline: cman_tool nodes Jan 16 08:19:11 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5399 duration=0(sec) Jan 16 08:19:11 taft-01 xinetd[2818]: START: qarsh pid=5401 from=10.15.80.47 Jan 16 08:19:11 taft-01 qarshd[5401]: Talking to peer 10.15.80.47:49291 Jan 16 08:19:11 taft-01 qarshd[5401]: Running cmdline: cman_tool nodes Jan 16 08:19:11 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5401 duration=0(sec) Jan 16 08:19:11 taft-01 xinetd[2818]: START: qarsh pid=5403 from=10.15.80.47 Jan 16 08:19:11 taft-01 qarshd[5403]: Talking to peer 10.15.80.47:49292 Jan 16 08:19:11 taft-01 qarshd[5403]: Running cmdline: cman_tool nodes Jan 16 08:19:11 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=5403 duration=0(sec) Jan 16 08:19:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 90 seconds. Jan 16 08:19:17 taft-01 qarshd[3372]: Sending child 3373 signal 2 Jan 16 08:19:17 taft-01 qarshd[3319]: Sending child 3320 signal 2 Jan 16 08:19:17 taft-01 qarshd[3335]: Sending child 3336 signal 2 Jan 16 08:19:17 taft-01 qarshd[3339]: Sending child 3340 signal 2 Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3372 duration=1026(sec) Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3335 duration=1030(sec) Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3319 duration=1031(sec) Jan 16 08:19:17 taft-01 xinetd[2818]: EXIT: qarsh status=0 pid=3339 duration=1028(sec) Jan 16 08:19:34 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:19:38 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:19:38 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:19:38 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 27 seconds. Jan 16 08:19:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 120 seconds. Jan 16 08:19:47 taft-01 xinetd[2818]: EXIT: btimed signal=14 pid=2898 duration=1105(sec) Jan 16 08:20:05 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:20:05 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:20:05 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 30 seconds. Jan 16 08:20:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 150 seconds. Jan 16 08:20:35 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:20:35 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:20:35 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 30 seconds. Jan 16 08:20:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 180 seconds. Jan 16 08:21:05 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:21:05 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:21:05 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 31 seconds. Jan 16 08:21:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 210 seconds. Jan 16 08:21:36 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:21:36 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:21:36 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 29 seconds. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] AIS Executive Service RELEASE 'subrev 1324 version 0.80.2' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] AIS Executive Service: started and ready to provide service. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Using default multicast address of 239.192.215.172 Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_cpg loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais cluster closed process group service v1.01' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_cfg loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais configuration service' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_msg loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais message service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_lck loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais distributed locking service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_evt loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais event service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_ckpt loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais checkpoint service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_amf loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais availability management framework B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_clm loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais cluster membership service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_evs loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais extended virtual synchrony service' Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] openais component openais_cman loaded. Jan 16 08:21:41 taft-01 openais[5688]: [MAIN ] Registering service handler 'openais CMAN membership service 2.01' Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] send threads (0 threads) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP token expired timeout (495 ms) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP token problem counter (2000 ms) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP threshold (10 problem count) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] RRP mode set to none. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] heartbeat_failures_allowed (0) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] max_network_delay (50 ms) Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] The network interface [10.15.89.67] is now up. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Created or loaded sequence id 0.10.15.89.67 for this ring. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering GATHER state from 15. Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais extended virtual synchrony service' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais cluster membership service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais availability management framework B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais checkpoint service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais event service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais distributed locking service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais message service B.01.01' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais configuration service' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais cluster closed process group service v1.01' Jan 16 08:21:41 taft-01 openais[5688]: [SERV ] Initialising service handler 'openais CMAN membership service 2.01' Jan 16 08:21:41 taft-01 openais[5688]: [CMAN ] CMAN 2.0.57 (built Dec 21 2006 12:28:50) started Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] Not using a virtual synchrony filter. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Creating commit token because I am the rep. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Saving state aru 0 high seq received 0 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering COMMIT state. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering RECOVERY state. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [0] member 10.15.89.67: Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 0 rep 10.15.89.67 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 0 high delivered 0 received flag 0 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Did not need to originate any messages in recovery. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Storing new sequence id for ring 4 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Sending initial ORF token Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] CLM CONFIGURATION CHANGE Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] New Configuration: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Left: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Joined: Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary component and will provide service. Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] CLM CONFIGURATION CHANGE Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] New Configuration: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.67) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Left: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Joined: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.67) Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary component and will provide service. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering OPERATIONAL state. Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] got nodejoin message 10.15.89.67 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering GATHER state from 11. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Creating commit token because I am the rep. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Saving state aru 9 high seq received 9 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering COMMIT state. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering RECOVERY state. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [0] member 10.15.89.67: Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 4 rep 10.15.89.67 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 9 high delivered 9 received flag 0 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [1] member 10.15.89.68: Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 100 rep 10.15.89.68 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 2b34 high delivered 2b34 received flag 0 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [2] member 10.15.89.69: Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 100 rep 10.15.89.68 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 2b34 high delivered 2b34 received flag 0 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] position [3] member 10.15.89.70: Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] previous ring seq 100 rep 10.15.89.68 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] aru 2b34 high delivered 2b34 received flag 0 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Did not need to originate any messages in recovery. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Storing new sequence id for ring 68 Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] Sending initial ORF token Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] CLM CONFIGURATION CHANGE Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] New Configuration: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.67) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Left: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Joined: Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary component and will provide service. Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] CLM CONFIGURATION CHANGE Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] New Configuration: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.67) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.68) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.69) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.70) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Left: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] Members Joined: Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.68) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.69) Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] r(0) ip(10.15.89.70) Jan 16 08:21:41 taft-01 openais[5688]: [SYNC ] This node is within the primary component and will provide service. Jan 16 08:21:41 taft-01 openais[5688]: [TOTEM] entering OPERATIONAL state. Jan 16 08:21:41 taft-01 openais[5688]: [CMAN ] quorum regained, resuming activity Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] got nodejoin message 10.15.89.67 Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] got nodejoin message 10.15.89.68 Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] got nodejoin message 10.15.89.69 Jan 16 08:21:41 taft-01 openais[5688]: [CLM ] got nodejoin message 10.15.89.70 Jan 16 08:21:41 taft-01 openais[5688]: [CPG ] got joinlist message from node 4 Jan 16 08:21:41 taft-01 openais[5688]: [CPG ] got joinlist message from node 2 Jan 16 08:21:41 taft-01 openais[5688]: [CPG ] got joinlist message from node 3 Jan 16 08:21:42 taft-01 ccsd[2549]: Initial status:: Quorate Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft3 in /sys/kernel/dlm Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft2 in /sys/kernel/dlm Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft1 in /sys/kernel/dlm Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object taft0 in /sys/kernel/dlm Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object clvmd in /sys/kernel/dlm Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object TAFT_CLUSTER:taft3 in /sys/fs/gfs Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object TAFT_CLUSTER:taft2 in /sys/fs/gfs Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object TAFT_CLUSTER:taft1 in /sys/fs/gfs Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object TAFT_CLUSTER:taft0 in /sys/fs/gfs Jan 16 08:21:42 taft-01 groupd[5696]: local node must be reset to clear 9 uncontrolled instances of gfs and/or dlm Jan 16 08:21:42 taft-01 openais[5688]: [CMAN ] cman killed by node 1 for reason 2 Jan 16 08:21:42 taft-01 fenced[5703]: cman_init error 0 111 Jan 16 08:21:42 taft-01 gfs_controld[5715]: cman_init error 111 Jan 16 08:21:47 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 240 seconds. Jan 16 08:21:52 taft-01 dlm_controld[5709]: group_init error 0 111 Jan 16 08:22:05 taft-01 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Jan 16 08:22:05 taft-01 dhclient: DHCPACK from 10.15.89.100 Jan 16 08:22:05 taft-01 dhclient: bound to 10.15.89.67 -- renewal in 25 seconds. Jan 16 08:22:17 taft-01 ccsd[2549]: Unable to connect to cluster infrastructure after 270 seconds. [root@taft-01 ~]# cman_tool nodes cman_tool: Cannot open connection to cman, is it running ? [root@taft-02 ~]# cman_tool nodes Node Sts Inc Joined Name 1 X 104 taft-01 2 M 4 2007-01-16 08:06:25 taft-02 3 M 64 2007-01-16 08:06:25 taft-03 4 M 88 2007-01-16 08:17:43 taft-04 [root@taft-03 ~]# cman_tool nodes Node Sts Inc Joined Name 1 X 104 taft-01 2 M 64 2007-01-16 08:06:25 taft-02 3 M 4 2007-01-16 08:05:50 taft-03 4 M 88 2007-01-16 08:17:43 taft-04 [root@taft-04 ~]# cman_tool nodes Node Sts Inc Joined Name 1 X 104 taft-01 2 M 88 2007-01-16 08:17:43 taft-02 3 M 88 2007-01-16 08:17:43 taft-03 4 M 4 2007-01-16 08:17:18 taft-04 [root@taft-01 ~]# group_tool -v Unable to connect to groupd. Is it running? [root@taft-02 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010002 none [2 3 4] dlm 1 clvmd 00030002 none [2 3 4] dlm 1 taft0 00050002 none [2 3] dlm 1 taft1 00070002 none [2 3] dlm 1 taft2 00090002 none [2 3] dlm 1 taft3 000b0002 none [2 3] gfs 2 taft0 00040002 none [2 3] gfs 2 taft1 00060002 none [2 3] gfs 2 taft2 00080002 none [2 3] gfs 2 taft3 000a0002 none [2 3] [root@taft-03 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010002 none [2 3 4] dlm 1 clvmd 00030002 none [2 3 4] dlm 1 taft0 00050002 none [2 3] dlm 1 taft1 00070002 none [2 3] dlm 1 taft2 00090002 none [2 3] dlm 1 taft3 000b0002 none [2 3] gfs 2 taft0 00040002 none [2 3] gfs 2 taft1 00060002 none [2 3] gfs 2 taft2 00080002 none [2 3] gfs 2 taft3 000a0002 none [2 3] [root@taft-04 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010002 none [2 3 4] dlm 1 clvmd 00030002 none [2 3 4] Version-Release number of selected component (if applicable): 2.6.18-1.3002.el5 cman-2.0.57-1.el5
The first 'killed' message "by node 2 for reason 3" is a very likely a result of the openAIS rejoin - see the first syslog line you posted: Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] The token was lost in the COMMIT state. So cman has shutdown openais, and all the subsidiary components should tidy up too (as best they can). The second 'killed' message is very different: Jan 16 08:21:42 taft-01 openais[5688]: [CMAN ] cman killed by node 1 for reason 2 This is (I think) groupd detected already running subsystems and deciding it can't continue with them in that state and forcing a leave of the cluster: Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object TAFT_CLUSTER:taft0 in /sys/fs/gfs Jan 16 08:21:42 taft-01 groupd[5696]: local node must be reset to clear 9 uncontrolled instances of gfs and/or dlm So, I reckon the initial cause of this is probably the elevated multicast load causing the openais rejoin to occur. It might be worth opening another bug to cover the fact that GFS doesn't shut down correctly when cman forces an emergency shutdown (though how fixable that is I'm not sure!) It also bothers me that the node was not fenced after being ejected from the cluster but I'll pass on that for the moment cos I'm not sure what (if any) fencing system you're using with revolver.
I'll basically repeat what Patrick explained, adding a couple points here and there. 1. openais shuts down taft-01, presumably due to elevated multicast load (whether that's a big issue or not depends I'd guess on how excessive the load was, Steven may be able to comment on that.) 2. the other daemons shut down, abandoning local instances of gfs/dlm. There's no way to umount/leave these instances because the cluster has already gone away. 3. if there are enough other nodes remaining for quorum, they should be fencing taft-01 as a result of step 1. The group_tool output from taft-04 appears to show that taft-01 has been fenced. If it wasn't then there may be a problem in fenced, we'd want to see the fenced logs from the other nodes. What kind of fencing is configured? 4. 'service cman start' was run again. openais joins the cluster, but groupd recognizes the abandoned instances of gfs/dlm from above. From this it knows what happened in steps 1 and 2, and knows that it needs to be reset/fenced. groupd kills cman on the local node (which we see above). At this stage groupd could do one of two different things, it doesn't really matter which: i) It could just sit and wait to be fenced which it knows someone will eventually do. ii) It could attempt to fence _itself_. There's one situation where this can be beneficial. That's when power fencing is being used and there are not enough other nodes to make quorum. The other nodes won't fence/reboot it because they don't have quorum. Fencing itself means it reboots itself. When it comes back up, it'll properly rejoin the cluster and may result in the cluster having quorum again. (groupd is using the fencing system in this case for its side effects of rebooting the node, not in the normal sense of doing a real fencing operation). groupd is currently doing (ii) -- it forks/execs "fence_node" with its own name. So, it appears that multiple things may have gone wrong here. First, openais was shut down. Second, the initial fencing operation didn't reboot the node. Third, when the cluster was restarted on this defunct node, it should have fenced itself, which didn't reboot the node either.
Created attachment 146332 [details] All /var/log/messages output after boot
I'm seeing the same thing on one of our clusters here. When fencing is configured properly, the node that is being killed by cman is fenced endlessly. When I disable fencing, I see (basically) the same output as above. After openais and groupd die, ccsd and clurgmgrd are left running. One of the rgmanager threads is blocked trying to connect to DLM. If I wake it up by stracing it, the thread exits, and almost always, the kernel oopses. All cluster nodes are running the x86_64 20070112.3 tree with the 2.6.18-1.3002.el5xen kernel. Here's the kernel output dlm: closing connection to node 5 dlm: closing connection to node 4 dlm: closing connection to node 3 dlm: closing connection to node 2 dlm: closing connection to node 1 Unable to handle kernel NULL pointer dereference at 0000000000000002 RIP: [<ffffffff8032ade1>] list_del+0x8/0x71 PGD 698b9067 PUD 69f99067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:01:00.0/0000:02:08.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor CPU 1 Modules linked in: bridge netloop netbk blktap blkbk nfs lockd fscache nfs_acl lock_dlm gfs2 dlm configfs sunrpc ipv6 video sbs i2c_ec i2c_core button batter y asus_acpi ac parport_pc lp parport joydev sr_mod sg pcspkr bnx2 shpchp ide_cd serial_core cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage qla2xxx scsi_transport_fc mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 3970, comm: clurgmgrd Not tainted 2.6.18-1.3002.el5xen #1 RIP: e030:[<ffffffff8032ade1>] [<ffffffff8032ade1>] list_del+0x8/0x71 RSP: e02b:ffff880069929dd8 EFLAGS: 00010246 RAX: 0000000000000002 RBX: ffff88006a7d8000 RCX: ffff88006a468000 RDX: ffffffffff578040 RSI: ffff880075b407a0 RDI: ffff88006a7d8000 RBP: ffff88006a7d8000 R08: ffff880069928000 R09: 0000000000000001 R10: ffff880075a4cac0 R11: ffff880075b407a0 R12: 00000000fffffe00 R13: ffff88006a7d8030 R14: 0000000000000020 R15: ffff880073f660d8 FS: 00002aaaaaabba00(0000) GS:ffffffff8058d080(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process clurgmgrd (pid: 3970, threadinfo ffff880069928000, task ffff880075b407a0) Stack: ffff880069929e08 ffffffff8841c853 0000000000000071 00000000f14c7dd0 ffff880069929eb0 ffff88006a7d84c4 6567616e616d6772 0000000000000072 0000000000000000 0000000000000000 Call Trace: [<ffffffff8841c853>] :dlm:dlm_new_lockspace+0x6b7/0x76a [<ffffffff8841f8b3>] :dlm:device_write+0x3b7/0x5a7 [<ffffffff80216292>] vfs_write+0xce/0x174 [<ffffffff80216a8e>] sys_write+0x45/0x6e [<ffffffff8025c5dd>] tracesys+0xa7/0xb2 Code: 48 8b 10 48 39 fa 74 1b 48 89 fe 31 c0 48 c7 c7 a8 4b 47 80 RIP [<ffffffff8032ade1>] list_del+0x8/0x71 RSP <ffff880069929dd8> CR2: 0000000000000002 <1>Unable to handle kernel paging request at 0000000100000001 RIP: [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45 PGD 699e9067 PUD 0 Oops: 0000 [2] SMP last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:01:00.0/0000:02:08.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor CPU 0 Modules linked in: bridge netloop netbk blktap blkbk nfs lockd fscache nfs_acl lock_dlm gfs2 dlm configfs sunrpc ipv6 video sbs i2c_ec i2c_core button batter y asus_acpi ac parport_pc lp parport joydev sr_mod sg pcspkr bnx2 shpchp ide_cd serial_core cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage qla2xxx scsi_transport_fc mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 3972, comm: dlm_scand Not tainted 2.6.18-1.3002.el5xen #1 RIP: e030:[<ffffffff8841ca7a>] [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45 RSP: e02b:ffff88006985fee0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000100000001 RCX: ffff88006b1a6080 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88006a7d8000 RBP: ffffffff8841ca63 R08: ffff88006985e000 R09: 0000000000000000 R10: ffffffff804b9ce0 R11: ffff88006b1a6080 R12: ffff880069929c98 R13: 0000000000000000 R14: ffff880069929c88 R15: ffffffff8029442b FS: 00002aaaaaaba2a0(0000) GS:ffffffff8058d000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process dlm_scand (pid: 3972, threadinfo ffff88006985e000, task ffff88006b1a6080) Stack: 0000000000000000 ffffffff80232807 0000000000000000 0000000000000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff ffffffffffffffff ffff880001004460 Call Trace: [<ffffffff80232807>] kthread+0xfe/0x132 [<ffffffff8025ce10>] child_rip+0xa/0x12 [<ffffffff8029442b>] keventd_create_kthread+0x0/0x61 [<ffffffff80232709>] kthread+0x0/0x132 [<ffffffff8025ce06>] child_rip+0x0/0x12 Code: 48 8b 03 0f 18 08 48 81 fb 10 3c 43 88 75 e6 69 3d a9 e3 00 RIP [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45 RSP <ffff88006985fee0> CR2: 0000000100000001 I'll create an attachment for everything that is logged to /var/log/messages. I can reproduce this at will and anyone who wants to take a look into it, can use the machine.
"When fencing is configured properly, the node that is being killed by cman is fenced endlessly." That sounds wrong.
I just checked the logs on the node that did the fencing, and it looks like you're right. The fence operation succeeded but reported failure. Looks like a bug in the new fence APC agent.
Argh. Sorry, I spoke too quickly. I stopped the fencing by just removing exec perms on the fence agent script, and the failed fence operations I was seeing were a result of that. Looking back farther into the logs, the node really was being fenced as soon as it joined: Jan 23 13:16:03 marge fenced[2947]: homer.lab.boston.redhat.com not a cluster me mber after 2 sec post_fail_delay Jan 23 13:16:03 marge fenced[2947]: fencing node "homer.lab.boston.redhat.com" Jan 23 13:16:06 marge openais[2931]: [MAIN ] Killing node homer.lab.boston.redha t.com because it has rejoined the cluster without cman_tool join Jan 23 13:16:36 marge fenced[2947]: fence "homer.lab.boston.redhat.com" success ad nauseam
That looks pretty much like a consequence of the first error. The cluster is unstable because a node has rejoined without going through cman. I suspect the node that gets repeatedly fenced is the victim of "CMAN: Joined a cluster with disallowed nodes. must die". The way around this is to increase the token timeout in cluster.conf eg: <totem token="15000"/>
Patrick, if this error occurs, ie: cman is killed by another node, shouldn't the remaining cluster be stable? In the case of the first error, I believe we should fence the node whether or not there is quorum by running reboot -fin from within cman instead of exiting. Increasing the token timeout simply causes this case not to happen, which in the end it eventually will.
It depends. If the node is killed because the cluster is already unstable without it (which I think is what's happening here) then removing a node that hasn't really joined isn't going to make any difference. The reason the 'rejoined' node isn't killed is because the cluster is not quorate and so doesn't know whether it is part of the stable portion (or even if there is one).
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I think for sites with an "elevated multicast load" we should probably recommend upping the timeouts (I'm sure that happens with RHEL4 anyway). The message is improved in the head of CVS & the RHEL5 branch.
I'll mark this as NEEDINFO - let me know if it still fails with increased timeouts.
Can't currently reproduce, moving off the release criteria.
This bug hasn't been seen in almost a year, closing...