Hide Forgot
Description of problem: I hit this while testing big token. In two-node cluster when one node is rebooted and then successfully fenced, it tries to rejoin back into the cluster. It rejoins successfully, but right after that it is removed from the cluster by the other node and then fenced. It ends with repetitive fencing of the first node. Version-Release number of selected component (if applicable): cman-3.0.12.1-20.el6.x86_64 corosync-1.4.1-3.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. setup two-node cluster 2. run revolver test 3. node is fenced repeatedly Actual results: Node is removed from cluster and repeatedly fenced again Expected results: Node successfully joins the cluster after fence and reboot Additional info: cluster.conf: <?xml version="1.0"?> <cluster name="BZ_Cluster" config_version="1"> <cman two_node="1" expected_votes="1"> </cman> <fence_daemon post_join_delay="20" clean_start="0"/> <clusternodes> <clusternode name="bz1" votes="1" nodeid="1"> <fence> <method name="IPMI"> <device name="imm_ipmi" ipaddr="10.34.43.28"/> </method> </fence> </clusternode> <clusternode name="bz2" votes="1" nodeid="2"> <fence> <method name="IPMI"> <device name="imm_ipmi" ipaddr="10.34.43.30"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="imm_ipmi" agent="fence_ipmilan" auth="PASSWORD" login="xxx" passwd="xxx" action="reboot"/> </fencedevices> <totem token="300000"/> <logging debug="on"/> <rm> <resources> <ip address="10.34.43.88" monitor_link="1"/> <clusterfs name="gfsfs" device="/dev/vg1/lv1" fstype="gfs2" mountpoint="/tmp/test" force_unmount="1" options="rw"/> <nfsexport name="test_nfs_export"/> <nfsclient name="nfs_client1" allow_recover="1" options="rw" target="10.34.2.112"/> </resources> <service name="test" autostart="1" recovery="relocate"> <ip ref="10.34.43.88"/> <clusterfs ref="gfsfs"> <nfsexport ref="test_nfs_export"> <nfsclient ref="nfs_client1"/> </nfsexport> </clusterfs> </service> </rm> </cluster>
Log from the fencing node: Sep 20 07:16:17 qe-x3650-02 corosync[2783]: [QUORUM] Members[1]: 2 Sep 20 07:16:17 qe-x3650-02 corosync[2783]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 20 07:16:17 qe-x3650-02 rgmanager[3126]: State change: bz1 DOWN Sep 20 07:16:17 qe-x3650-02 kernel: dlm: closing connection to node 1 Sep 20 07:16:17 qe-x3650-02 fenced[2841]: fencing node bz1 Sep 20 07:16:17 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Trying to acquire journal lock... Sep 20 07:16:17 qe-x3650-02 corosync[2783]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:2 left:1) Sep 20 07:16:17 qe-x3650-02 corosync[2783]: [MAIN ] Completed service synchronization, ready to provide service. Sep 20 07:16:19 qe-x3650-02 xinetd[1944]: START: qarsh pid=5862 from=::ffff:10.34.43.27 Sep 20 07:16:19 qe-x3650-02 qarshd[5862]: Talking to peer 10.34.43.27:44248 Sep 20 07:16:19 qe-x3650-02 qarshd[5862]: Running cmdline: cman_tool nodes -F name,type Sep 20 07:16:19 qe-x3650-02 xinetd[1944]: EXIT: qarsh status=0 pid=5862 duration=0(sec) Sep 20 07:16:21 qe-x3650-02 corosync[2783]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 20 07:16:21 qe-x3650-02 corosync[2783]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:1 left:0) Sep 20 07:16:21 qe-x3650-02 corosync[2783]: [MAIN ] Completed service synchronization, ready to provide service. Sep 20 07:16:25 qe-x3650-02 fenced[2841]: fence bz1 success Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Looking at journal... Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Acquiring the transaction lock... Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Replaying journal... Sep 20 07:16:25 qe-x3650-02 rgmanager[3126]: Taking over service service:test from down member bz1 Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Replayed 12 of 167 blocks Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Found 3 revoke tags Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Journal replayed in 1s Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Done Sep 20 07:16:25 qe-x3650-02 rgmanager[5994]: [clusterfs] mounting /dev/dm-3 on /tmp/test Sep 20 07:16:25 qe-x3650-02 rgmanager[6023]: [clusterfs] mount -t gfs2 -o rw /dev/dm-3 /tmp/test Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "BZ_Cluster:test" Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: Joined cluster. Now mounting FS... Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=0, already locked for use Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=0: Looking at journal... Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=0: Done Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=1: Trying to acquire journal lock... Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=1: Looking at journal... Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=1: Done Sep 20 07:16:26 qe-x3650-02 rgmanager[6087]: [nfsclient] Adding export: 10.34.2.112:/tmp/test (rw) Sep 20 07:16:26 qe-x3650-02 rgmanager[6156]: [ip] Adding IPv4 address 10.34.43.88/24 to eth0 Sep 20 07:16:29 qe-x3650-02 rgmanager[3126]: Service service:test started Sep 20 07:16:53 qe-x3650-02 kernel: rport-5:0-3: blocked FC remote port time out: removing rport Sep 20 07:19:39 qe-x3650-02 corosync[2783]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 20 07:19:39 qe-x3650-02 corosync[2783]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:1 left:0) Sep 20 07:19:39 qe-x3650-02 corosync[2783]: [MAIN ] Completed service synchronization, ready to provide service. Sep 20 07:19:43 qe-x3650-02 corosync[2783]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 20 07:19:43 qe-x3650-02 corosync[2783]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:1 left:0) Sep 20 07:19:43 qe-x3650-02 corosync[2783]: [MAIN ] Completed service synchronization, ready to provide service. Sep 20 07:20:30 qe-x3650-02 corosync[2783]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 20 07:20:30 qe-x3650-02 corosync[2783]: [QUORUM] Members[2]: 1 2 Sep 20 07:20:30 qe-x3650-02 corosync[2783]: [QUORUM] Members[2]: 1 2 Sep 20 07:20:30 qe-x3650-02 corosync[2783]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0) Sep 20 07:20:30 qe-x3650-02 corosync[2783]: [MAIN ] Completed service synchronization, ready to provide service. Sep 20 07:20:30 qe-x3650-02 fenced[2841]: telling cman to remove nodeid 1 from cluster Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: START: qarsh pid=9035 from=::ffff:10.34.43.27 Sep 20 07:21:02 qe-x3650-02 qarshd[9035]: Talking to peer 10.34.43.27:52603 Sep 20 07:21:02 qe-x3650-02 qarshd[9035]: Running cmdline: cman_tool nodes Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: EXIT: qarsh status=0 pid=9035 duration=0(sec) Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: START: qarsh pid=9037 from=::ffff:10.34.43.27 Sep 20 07:21:02 qe-x3650-02 qarshd[9037]: Talking to peer 10.34.43.27:52604 Sep 20 07:21:02 qe-x3650-02 qarshd[9037]: Running cmdline: runcon -t unconfined_t cman_tool services Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: EXIT: qarsh status=0 pid=9037 duration=0(sec) Sep 20 07:25:30 qe-x3650-02 corosync[2783]: [TOTEM ] A processor failed, forming new configuration. Sep 20 07:25:32 qe-x3650-02 corosync[2783]: [QUORUM] Members[1]: 2 Sep 20 07:25:32 qe-x3650-02 corosync[2783]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 20 07:25:32 qe-x3650-02 kernel: dlm: closing connection to node 1 Sep 20 07:25:32 qe-x3650-02 corosync[2783]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:2 left:1) Sep 20 07:25:32 qe-x3650-02 corosync[2783]: [MAIN ] Completed service synchronization, ready to provide service. Sep 20 07:25:52 qe-x3650-02 fenced[2841]: fencing node bz1 Sep 20 07:26:03 qe-x3650-02 fenced[2841]: fence bz1 success
When using big timeouts like that, the killed node rejoins before the other node has noticed that it failed. During development we spent quite a bit of time to make that case work properly (it hasn't in the past.) The behavior is going to be slightly unusual, though, and I'm guessing that revolver may not be able to cope with it. It looks to me like the behavior above is probably correct (fence_tool dump from both nodes could confirm). What happens is: 1. members 1,2 2. revolver reboots 1 3. 2 does not notice 1 leaving (timeout is large) 4. 1 rejoins before 2 notices 1 has failed/left 5. fenced on 2 tells 1 to die > Sep 20 07:20:30 qe-x3650-02 fenced[2841]: telling cman to remove nodeid 1 from cluster 6. cman/corosync on 1 exit due to step 5 7. fenced on 2 sees 1 fail 8. fenced on 2 fences 1 > Sep 20 07:25:52 qe-x3650-02 fenced[2841]: fencing node bz1
I didn't notice the fencing at the start of the logs above, which I can't really explain, so I shouldn't say they look correct. We'll need to see fence_tool dump from the nodes to explain what's happening. I still suspect that revolver is getting out of sync with what the cluster is doing.
Created attachment 524048 [details] fence_tool dump
This does look like a bug, and it's complicated. The immediate problem we see is that when node1 restarts after a full reboot, node2's fenced kills it because it has restarted "with state". i.e. when node2 sees node1 rejoin the cpg, node2 says it is already a member of the fence domain, which it shouldn't be if it's just joining. node1 takes the correct action in killing it here. This is very similar to what happens when two nodes are partitioned and then merge together again. But this isn't a traditional partition merge... To say for certain what's happening I'd also need to see the fence_tool dump (or logfile) from the other node. But what I'm fairly sure is happening is the following. When node2 restarts after the reboot, it 1. forms a cluster by itself (very annoying corosync feature) 2. joins the fence domain by itself (that is becoming stateful) 1 is not a problem for us as long as it is resolved (i.e. joining node merges with the existing members) before step 2 where state is created. We usually insert some waits between 1 and 2 to try to ensure that 1 is resolved before 2. The first method to workaround 1 is waiting for quorum. We need quorum to really do much anyway, so we might as well wait for it prior to 2, since it also solves the problem of waiting for corosync to merge with existing nodes. In the 2node case, waiting for quorum does nothing, so there's a second wait in fence_tool join -m 45 which waits up to 45 seconds for all nodes to be members before doing the fence_tool join. We need to find out what's happening with this wait in these tests. This wait should in theory prevent a node from becoming stateful (fence_tool join) by itself.
The logs do show things working correctly once at the end: 1316524044 cluster node 1 added seq 2812 1316524044 fenced:default ring 1:2812 2 memb 1 2 1316524044 fenced:daemon ring 1:2812 2 memb 1 2 1316524046 fenced:daemon conf 2 1 0 memb 1 2 join 1 left 1316524046 receive_protocol from 1 max 1.1.1.0 run 1.1.1.0 1316524047 fenced:default conf 2 1 0 memb 1 2 join 1 left (the "run 1.1.1.0" from 1 has a final .0 which means not-stateful) Here's where it didn't work: (not sure what these are from, they don't seem to cause any problem, but they only appear when things don't work) 1316523470 fenced:daemon ring 2:2796 1 memb 2 1316523470 fenced:default ring 2:2796 1 memb 2 1316523474 fenced:daemon ring 2:2800 1 memb 2 1316523474 fenced:default ring 2:2800 1 memb 2 1316523519 cluster node 1 added seq 2804 1316523519 fenced:daemon conf 2 1 0 memb 1 2 join 1 left 1316523519 fenced:daemon ring 1:2804 2 memb 1 2 1316523519 fenced:default conf 2 1 0 memb 1 2 join 1 left 1316523519 fenced:default ring 1:2804 2 memb 1 2 1316523519 receive_protocol from 1 max 1.1.1.0 run 1.1.1.1 (result of receiving "run 1.1.1.1" from 1, the final .1 is STATEFUL ...) 1316523519 daemon node 1 max 0.0.0.0 run 0.0.0.0 1316523519 daemon node 1 join 1316523519 left 1316523249 local quorum 1316520586 1316523519 daemon node 1 stateful merge 1316523519 daemon node 1 kill due to stateful merge 1316523519 telling cman to remove nodeid 1 from cluster
I see you have <logging debug="on"/> so you should already have the fenced logs from both nodes in /var/log/cluster/
Created attachment 524205 [details] cluster logs, token timeout 300 s Attaching logs from the new test. 1. killed corosync on node2 2. after token timeout node2 is fenced by node1 3. node2 after reboot tries to join the cluster, but node1 removes node2 from cluster 4. after token timeout node2 is fenced by node1 again
From the ndoe that's reset: Sep 21 05:19:17 fenced fenced 3.0.12.1 started ... Sep 21 05:20:04 fenced cpg_join fenced:default ... Sep 21 05:20:04 fenced fenced:default conf 1 1 0 memb 2 join 2 left Sep 21 05:20:04 fenced add_change cg 1 joined nodeid 2 Sep 21 05:20:04 fenced add_change cg 1 m 1 j 1 r 0 f 0 Sep 21 05:20:04 fenced add_victims_init nodeid 1 Sep 21 05:20:04 fenced check_ringid cluster 3308 cpg 0:0 Sep 21 05:20:04 fenced fenced:default ring 2:3308 1 memb 2 Sep 21 05:20:04 fenced check_ringid done cluster 3308 cpg 2:3308 Sep 21 05:20:04 fenced check_quorum done Sep 21 05:20:04 fenced send_start 2:1 flags 1 started 0 m 1 j 1 r 0 f 0 Sep 21 05:20:04 fenced receive_start 2:1 len 152 Sep 21 05:20:04 fenced match_change 2:1 matches cg 1 Sep 21 05:20:04 fenced wait_messages cg 1 got all 1 Sep 21 05:20:04 fenced set_master from 0 to low node 2 Sep 21 05:20:04 fenced delay post_join_delay 20 quorate_from_last_update 1 Sep 21 05:20:05 fenced cluster node 1 added seq 3320 19:17 fenced is started 20:04 fence_tool join is run (47 sec later) 20:05 node 1 is added So the fence_tool join -m 45 is timing out waiting for all members, procedes to join the fence domain (creating state), the other node is finally seen, and we get the expected results. Does it really take corosync that long to see the other node? If so: is that due to corosync? due to the big timeout? due to a crap network? > Sep 21 05:19:14 qe-x3650-02 corosync[1880]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started > Sep 21 05:19:14 qe-x3650-02 corosync[1880]: [QUORUM] Members[1]: 2 > Sep 21 05:19:17 qe-x3650-02 fenced[1937]: fenced 3.0.12.1 started > Sep 21 05:20:04 qe-x3650-02 corosync[1880]: [QUORUM] Members[2]: 1 2 corosync is taking 50 seconds to merge with node2 after starting.
> Sep 21 05:19:16 qe-x3650-01 corosync[6872]: [TOTEM ] A processor joined or left the membership and a new membership was formed. > Sep 21 05:19:16 qe-x3650-01 corosync[6872]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0) > Sep 21 05:19:16 qe-x3650-01 corosync[6872]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 21 05:19:19 qe-x3650-01 corosync[6872]: [TOTEM ] A processor joined or left the membership and a new membership was formed. > Sep 21 05:19:19 qe-x3650-01 corosync[6872]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0) > Sep 21 05:19:19 qe-x3650-01 corosync[6872]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 21 05:20:04 qe-x3650-01 corosync[6872]: [TOTEM ] A processor joined or left the membership and a new membership was formed. > Sep 21 05:20:04 qe-x3650-01 corosync[6872]: [QUORUM] Members[2]: 1 2 > Sep 21 05:20:04 qe-x3650-01 corosync[6872]: [QUORUM] Members[2]: 1 2 > Sep 21 05:20:04 qe-x3650-01 corosync[6872]: [CPG ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0) > Sep 21 05:20:04 qe-x3650-01 corosync[6872]: [MAIN ] Completed service synchronization, ready to provide service. From the other node. Here it looks like corosync saw the other node at 19:16, but didn't finish adding it until 20:04 (45 seconds later). This looks more like a corosync problem than a bad network.
Run corosync-blackbox > bb after comment #10 occurs attach bb for each node to bugzilla.
Was this the problem where the customer switched to broadcast and the system worked properly afterward?
Created attachment 524244 [details] cluster logs, token timeout 240 s, corosync-blackbox Cluster logs with token timeout 240 seconds 1. killed corosync on node2 2. after token timeout node2 is fenced by node1 3. node2 after reboot tries to join the cluster, but node1 removes node2 from cluster 4. run corosync-blackbox > bb on both nodes, but it failed on node2
Please answer Comment #12. Thanks -steve
To eliminate network problems, please use broadcast or udpu mode and reproduce. Thanks -steve
(In reply to comment #12) > Was this the problem where the customer switched to broadcast and the system > worked properly afterward? No, this was not customers problem, i hit this when testing bz663397
Created attachment 524352 [details] cluster logs, token timeout 300 s, broadcast enabled With broadcast enabled there is no problem and fenced node successfully rejoins the cluster.
Martin, Thanks for testing broadcast. I am likely closing this bug as not a defect. This implies you have a problem with your network environment. There are a couple options for why your network environment isn't working. What infrastructure are you in?
Steven, Thank you, I think you are right, this seems to be some multicast network problem. I don't have access to the network infrastructure, so I cannot provide detailed information about network switches.
This looks like it was a configuration issue.