Bug 739930 - fenced node is rejected from two-node cluster after fencing
Summary: fenced node is rejected from two-node cluster after fencing
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.2
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 663397 752866
TreeView+ depends on / blocked
 
Reported: 2011-09-20 13:24 UTC by Martin Juricek
Modified: 2016-04-26 13:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-18 14:14:05 UTC


Attachments (Terms of Use)
fence_tool dump (9.90 KB, text/plain)
2011-09-20 14:31 UTC, Martin Juricek
no flags Details
cluster logs, token timeout 300 s (207.66 KB, application/x-compressed-tar)
2011-09-21 14:13 UTC, Martin Juricek
no flags Details
cluster logs, token timeout 240 s, corosync-blackbox (203.85 KB, application/x-compressed-tar)
2011-09-21 16:33 UTC, Martin Juricek
no flags Details
cluster logs, token timeout 300 s, broadcast enabled (106.81 KB, application/x-compressed-tar)
2011-09-22 08:47 UTC, Martin Juricek
no flags Details

Description Martin Juricek 2011-09-20 13:24:41 UTC
Description of problem:

I hit this while testing big token.
In two-node cluster when one node is rebooted and then successfully fenced, it tries to rejoin back into the cluster. It rejoins successfully, but right after that it is removed from the cluster by the other node and then fenced. It ends with repetitive fencing of the first node. 


Version-Release number of selected component (if applicable):

cman-3.0.12.1-20.el6.x86_64
corosync-1.4.1-3.el6.x86_64


How reproducible: 100%

Steps to Reproduce:
1. setup two-node cluster
2. run revolver test
3. node is fenced repeatedly
  
Actual results:
Node is removed from cluster and repeatedly fenced again

Expected results:
Node successfully joins the cluster after fence and reboot

Additional info:

cluster.conf:

<?xml version="1.0"?>
<cluster name="BZ_Cluster" config_version="1">
  <cman two_node="1" expected_votes="1">
  </cman>
  <fence_daemon post_join_delay="20" clean_start="0"/>
  <clusternodes>
    <clusternode name="bz1" votes="1" nodeid="1">
      <fence>
        <method name="IPMI">
          <device name="imm_ipmi" ipaddr="10.34.43.28"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="bz2" votes="1" nodeid="2">
      <fence>
        <method name="IPMI">
          <device name="imm_ipmi" ipaddr="10.34.43.30"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice name="imm_ipmi" agent="fence_ipmilan" auth="PASSWORD" login="xxx" passwd="xxx" action="reboot"/>
  </fencedevices>
  <totem token="300000"/>
  <logging debug="on"/>
  <rm>
    <resources>
      <ip address="10.34.43.88" monitor_link="1"/>
      <clusterfs name="gfsfs" device="/dev/vg1/lv1" fstype="gfs2" mountpoint="/tmp/test" force_unmount="1" options="rw"/>
      <nfsexport name="test_nfs_export"/>
      <nfsclient name="nfs_client1" allow_recover="1" options="rw" target="10.34.2.112"/>
    </resources>
    <service name="test" autostart="1" recovery="relocate">
      <ip ref="10.34.43.88"/>
      <clusterfs ref="gfsfs">
        <nfsexport ref="test_nfs_export">
          <nfsclient ref="nfs_client1"/>
        </nfsexport>
      </clusterfs>
    </service>
  </rm>
</cluster>

Comment 1 Martin Juricek 2011-09-20 13:29:53 UTC
Log from the fencing node:


Sep 20 07:16:17 qe-x3650-02 corosync[2783]:   [QUORUM] Members[1]: 2
Sep 20 07:16:17 qe-x3650-02 corosync[2783]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 20 07:16:17 qe-x3650-02 rgmanager[3126]: State change: bz1 DOWN
Sep 20 07:16:17 qe-x3650-02 kernel: dlm: closing connection to node 1
Sep 20 07:16:17 qe-x3650-02 fenced[2841]: fencing node bz1
Sep 20 07:16:17 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Trying to acquire journal lock...
Sep 20 07:16:17 qe-x3650-02 corosync[2783]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:2 left:1)
Sep 20 07:16:17 qe-x3650-02 corosync[2783]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 07:16:19 qe-x3650-02 xinetd[1944]: START: qarsh pid=5862 from=::ffff:10.34.43.27
Sep 20 07:16:19 qe-x3650-02 qarshd[5862]: Talking to peer 10.34.43.27:44248
Sep 20 07:16:19 qe-x3650-02 qarshd[5862]: Running cmdline: cman_tool nodes -F name,type
Sep 20 07:16:19 qe-x3650-02 xinetd[1944]: EXIT: qarsh status=0 pid=5862 duration=0(sec)
Sep 20 07:16:21 qe-x3650-02 corosync[2783]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 20 07:16:21 qe-x3650-02 corosync[2783]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:1 left:0)
Sep 20 07:16:21 qe-x3650-02 corosync[2783]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 07:16:25 qe-x3650-02 fenced[2841]: fence bz1 success
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Looking at journal...
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Acquiring the transaction lock...
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Replaying journal...
Sep 20 07:16:25 qe-x3650-02 rgmanager[3126]: Taking over service service:test from down member bz1
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Replayed 12 of 167 blocks
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Found 3 revoke tags
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Journal replayed in 1s
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test2.1: jid=0: Done
Sep 20 07:16:25 qe-x3650-02 rgmanager[5994]: [clusterfs] mounting /dev/dm-3 on /tmp/test
Sep 20 07:16:25 qe-x3650-02 rgmanager[6023]: [clusterfs] mount -t gfs2 -o rw /dev/dm-3 /tmp/test
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "BZ_Cluster:test"
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: Joined cluster. Now mounting FS...
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=0, already locked for use
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=0: Looking at journal...
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=0: Done
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=1: Trying to acquire journal lock...
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=1: Looking at journal...
Sep 20 07:16:25 qe-x3650-02 kernel: GFS2: fsid=BZ_Cluster:test.0: jid=1: Done
Sep 20 07:16:26 qe-x3650-02 rgmanager[6087]: [nfsclient] Adding export: 10.34.2.112:/tmp/test (rw)
Sep 20 07:16:26 qe-x3650-02 rgmanager[6156]: [ip] Adding IPv4 address 10.34.43.88/24 to eth0
Sep 20 07:16:29 qe-x3650-02 rgmanager[3126]: Service service:test started
Sep 20 07:16:53 qe-x3650-02 kernel: rport-5:0-3: blocked FC remote port time out: removing rport
Sep 20 07:19:39 qe-x3650-02 corosync[2783]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 20 07:19:39 qe-x3650-02 corosync[2783]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:1 left:0)
Sep 20 07:19:39 qe-x3650-02 corosync[2783]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 07:19:43 qe-x3650-02 corosync[2783]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 20 07:19:43 qe-x3650-02 corosync[2783]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:1 left:0)
Sep 20 07:19:43 qe-x3650-02 corosync[2783]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 07:20:30 qe-x3650-02 corosync[2783]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 20 07:20:30 qe-x3650-02 corosync[2783]:   [QUORUM] Members[2]: 1 2
Sep 20 07:20:30 qe-x3650-02 corosync[2783]:   [QUORUM] Members[2]: 1 2
Sep 20 07:20:30 qe-x3650-02 corosync[2783]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0)
Sep 20 07:20:30 qe-x3650-02 corosync[2783]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 07:20:30 qe-x3650-02 fenced[2841]: telling cman to remove nodeid 1 from cluster
Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: START: qarsh pid=9035 from=::ffff:10.34.43.27
Sep 20 07:21:02 qe-x3650-02 qarshd[9035]: Talking to peer 10.34.43.27:52603
Sep 20 07:21:02 qe-x3650-02 qarshd[9035]: Running cmdline: cman_tool nodes
Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: EXIT: qarsh status=0 pid=9035 duration=0(sec)
Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: START: qarsh pid=9037 from=::ffff:10.34.43.27
Sep 20 07:21:02 qe-x3650-02 qarshd[9037]: Talking to peer 10.34.43.27:52604
Sep 20 07:21:02 qe-x3650-02 qarshd[9037]: Running cmdline: runcon -t unconfined_t cman_tool services
Sep 20 07:21:02 qe-x3650-02 xinetd[1944]: EXIT: qarsh status=0 pid=9037 duration=0(sec)
Sep 20 07:25:30 qe-x3650-02 corosync[2783]:   [TOTEM ] A processor failed, forming new configuration.
Sep 20 07:25:32 qe-x3650-02 corosync[2783]:   [QUORUM] Members[1]: 2
Sep 20 07:25:32 qe-x3650-02 corosync[2783]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 20 07:25:32 qe-x3650-02 kernel: dlm: closing connection to node 1
Sep 20 07:25:32 qe-x3650-02 corosync[2783]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.31) ; members(old:2 left:1)
Sep 20 07:25:32 qe-x3650-02 corosync[2783]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 07:25:52 qe-x3650-02 fenced[2841]: fencing node bz1
Sep 20 07:26:03 qe-x3650-02 fenced[2841]: fence bz1 success

Comment 2 David Teigland 2011-09-20 13:48:53 UTC
When using big timeouts like that, the killed node rejoins before the other node has noticed that it failed.  During development we spent quite a bit of time to make that case work properly (it hasn't in the past.)  The behavior is going to be slightly unusual, though, and I'm guessing that revolver may not be able to cope with it.

It looks to me like the behavior above is probably correct (fence_tool dump from both nodes could confirm).  What happens is:

1. members 1,2
2. revolver reboots 1
3. 2 does not notice 1 leaving (timeout is large)
4. 1 rejoins before 2 notices 1 has failed/left
5. fenced on 2 tells 1 to die

> Sep 20 07:20:30 qe-x3650-02 fenced[2841]: telling cman to remove nodeid 1 from cluster

6. cman/corosync on 1 exit due to step 5
7. fenced on 2 sees 1 fail
8. fenced on 2 fences 1

>  Sep 20 07:25:52 qe-x3650-02 fenced[2841]: fencing node bz1

Comment 3 David Teigland 2011-09-20 14:04:50 UTC
I didn't notice the fencing at the start of the logs above, which I can't really explain, so I shouldn't say they look correct.  We'll need to see fence_tool dump from the nodes to explain what's happening.  I still suspect that revolver is getting out of sync with what the cluster is doing.

Comment 4 Martin Juricek 2011-09-20 14:31:29 UTC
Created attachment 524048 [details]
fence_tool dump

Comment 5 David Teigland 2011-09-20 19:47:19 UTC
This does look like a bug, and it's complicated.  The immediate problem we see is that when node1 restarts after a full reboot, node2's fenced kills it because it has restarted "with state".  i.e. when node2 sees node1 rejoin the cpg, node2 says it is already a member of the fence domain, which it shouldn't be if it's just joining.  node1 takes the correct action in killing it here.  This is very similar to what happens when two nodes are partitioned and then merge together again.  But this isn't a traditional partition merge...

To say for certain what's happening I'd also need to see the fence_tool dump (or logfile) from the other node.  But what I'm fairly sure is happening is the following.

When node2 restarts after the reboot, it
1. forms a cluster by itself (very annoying corosync feature)
2. joins the fence domain by itself (that is becoming stateful)

1 is not a problem for us as long as it is resolved (i.e. joining node merges with the existing members) before step 2 where state is created.  We usually insert some waits between 1 and 2 to try to ensure that 1 is resolved before 2.

The first method to workaround 1 is waiting for quorum.  We need quorum to really do much anyway, so we might as well wait for it prior to 2, since it also solves the problem of waiting for corosync to merge with existing nodes.

In the 2node case, waiting for quorum does nothing, so there's a second wait
in fence_tool join -m 45 which waits up to 45 seconds for all nodes to be members before doing the fence_tool join.  We need to find out what's happening with this wait in these tests.  This wait should in theory prevent a node from becoming stateful (fence_tool join) by itself.

Comment 6 David Teigland 2011-09-20 20:57:42 UTC
The logs do show things working correctly once at the end:

1316524044 cluster node 1 added seq 2812
1316524044 fenced:default ring 1:2812 2 memb 1 2
1316524044 fenced:daemon ring 1:2812 2 memb 1 2
1316524046 fenced:daemon conf 2 1 0 memb 1 2 join 1 left
1316524046 receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
1316524047 fenced:default conf 2 1 0 memb 1 2 join 1 left
(the "run 1.1.1.0" from 1 has a final .0 which means not-stateful)

Here's where it didn't work:

(not sure what these are from, they don't seem to cause any problem, but they
only appear when things don't work)
1316523470 fenced:daemon ring 2:2796 1 memb 2
1316523470 fenced:default ring 2:2796 1 memb 2
1316523474 fenced:daemon ring 2:2800 1 memb 2
1316523474 fenced:default ring 2:2800 1 memb 2

1316523519 cluster node 1 added seq 2804
1316523519 fenced:daemon conf 2 1 0 memb 1 2 join 1 left
1316523519 fenced:daemon ring 1:2804 2 memb 1 2
1316523519 fenced:default conf 2 1 0 memb 1 2 join 1 left
1316523519 fenced:default ring 1:2804 2 memb 1 2
1316523519 receive_protocol from 1 max 1.1.1.0 run 1.1.1.1
(result of receiving "run 1.1.1.1" from 1, the final .1 is STATEFUL ...)
1316523519 daemon node 1 max 0.0.0.0 run 0.0.0.0
1316523519 daemon node 1 join 1316523519 left 1316523249 local quorum 1316520586
1316523519 daemon node 1 stateful merge
1316523519 daemon node 1 kill due to stateful merge
1316523519 telling cman to remove nodeid 1 from cluster

Comment 7 David Teigland 2011-09-20 21:19:56 UTC
I see you have <logging debug="on"/> so you should already have the fenced logs from both nodes in /var/log/cluster/

Comment 8 Martin Juricek 2011-09-21 14:13:51 UTC
Created attachment 524205 [details]
cluster logs, token timeout 300 s

Attaching logs from the new test.

1. killed corosync on node2
2. after token timeout node2 is fenced by node1
3. node2 after reboot tries to join the cluster, but node1 removes node2 from cluster 
4. after token timeout node2 is fenced by node1 again

Comment 9 David Teigland 2011-09-21 14:49:41 UTC
From the ndoe that's reset:

Sep 21 05:19:17 fenced fenced 3.0.12.1 started
...
Sep 21 05:20:04 fenced cpg_join fenced:default ...
Sep 21 05:20:04 fenced fenced:default conf 1 1 0 memb 2 join 2 left
Sep 21 05:20:04 fenced add_change cg 1 joined nodeid 2
Sep 21 05:20:04 fenced add_change cg 1 m 1 j 1 r 0 f 0
Sep 21 05:20:04 fenced add_victims_init nodeid 1
Sep 21 05:20:04 fenced check_ringid cluster 3308 cpg 0:0
Sep 21 05:20:04 fenced fenced:default ring 2:3308 1 memb 2
Sep 21 05:20:04 fenced check_ringid done cluster 3308 cpg 2:3308
Sep 21 05:20:04 fenced check_quorum done
Sep 21 05:20:04 fenced send_start 2:1 flags 1 started 0 m 1 j 1 r 0 f 0
Sep 21 05:20:04 fenced receive_start 2:1 len 152
Sep 21 05:20:04 fenced match_change 2:1 matches cg 1
Sep 21 05:20:04 fenced wait_messages cg 1 got all 1
Sep 21 05:20:04 fenced set_master from 0 to low node 2
Sep 21 05:20:04 fenced delay post_join_delay 20 quorate_from_last_update 1
Sep 21 05:20:05 fenced cluster node 1 added seq 3320

19:17 fenced is started
20:04 fence_tool join is run (47 sec later)
20:05 node 1 is added

So the fence_tool join -m 45 is timing out waiting for all members, procedes
to join the fence domain (creating state), the other node is finally seen,
and we get the expected results.

Does it really take corosync that long to see the other node?  If so:
is that due to corosync? due to the big timeout? due to a crap network?

> Sep 21 05:19:14 qe-x3650-02 corosync[1880]:   [MAIN  ] Corosync Cluster Engine ('1.4.1'): started
> Sep 21 05:19:14 qe-x3650-02 corosync[1880]:   [QUORUM] Members[1]: 2
> Sep 21 05:19:17 qe-x3650-02 fenced[1937]: fenced 3.0.12.1 started
> Sep 21 05:20:04 qe-x3650-02 corosync[1880]:   [QUORUM] Members[2]: 1 2

corosync is taking 50 seconds to merge with node2 after starting.

Comment 10 David Teigland 2011-09-21 14:56:58 UTC
> Sep 21 05:19:16 qe-x3650-01 corosync[6872]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Sep 21 05:19:16 qe-x3650-01 corosync[6872]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0)
> Sep 21 05:19:16 qe-x3650-01 corosync[6872]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Sep 21 05:19:19 qe-x3650-01 corosync[6872]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Sep 21 05:19:19 qe-x3650-01 corosync[6872]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0)
> Sep 21 05:19:19 qe-x3650-01 corosync[6872]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Sep 21 05:20:04 qe-x3650-01 corosync[6872]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Sep 21 05:20:04 qe-x3650-01 corosync[6872]:   [QUORUM] Members[2]: 1 2
> Sep 21 05:20:04 qe-x3650-01 corosync[6872]:   [QUORUM] Members[2]: 1 2
> Sep 21 05:20:04 qe-x3650-01 corosync[6872]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.43.29) ; members(old:1 left:0)
> Sep 21 05:20:04 qe-x3650-01 corosync[6872]:   [MAIN  ] Completed service synchronization, ready to provide service.

From the other node.  Here it looks like corosync saw the other node at 19:16, but didn't finish adding it until 20:04 (45 seconds later).  This looks more like a corosync problem than a bad network.

Comment 11 Steven Dake 2011-09-21 15:24:23 UTC
Run corosync-blackbox > bb after comment #10 occurs

attach bb for each node to bugzilla.

Comment 12 Steven Dake 2011-09-21 15:31:05 UTC
Was this the problem where the customer switched to broadcast and the system worked properly afterward?

Comment 13 Martin Juricek 2011-09-21 16:33:19 UTC
Created attachment 524244 [details]
cluster logs, token timeout 240 s, corosync-blackbox

Cluster logs with token timeout 240 seconds

1. killed corosync on node2
2. after token timeout node2 is fenced by node1
3. node2 after reboot tries to join the cluster, but node1 removes node2 from
cluster 
4. run corosync-blackbox > bb on both nodes, but it failed on node2

Comment 14 Steven Dake 2011-09-21 17:26:57 UTC
Please answer Comment #12.

Thanks
-steve

Comment 15 Steven Dake 2011-09-21 17:56:53 UTC
To eliminate network problems, please use broadcast or udpu mode and reproduce.

Thanks
-steve

Comment 16 Martin Juricek 2011-09-22 07:27:57 UTC
(In reply to comment #12)
> Was this the problem where the customer switched to broadcast and the system
> worked properly afterward?

No, this was not customers problem, i hit this when testing bz663397

Comment 17 Martin Juricek 2011-09-22 08:47:05 UTC
Created attachment 524352 [details]
cluster logs, token timeout 300 s, broadcast enabled

With broadcast enabled there is no problem and fenced node successfully rejoins the cluster.

Comment 18 Steven Dake 2011-09-22 15:14:20 UTC
Martin,

Thanks for testing broadcast.  I am likely closing this bug as not a defect.  This implies you have a problem with your network environment.  There are a couple options for why your network environment isn't working.  What infrastructure are you in?

Comment 19 Martin Juricek 2011-09-23 12:39:22 UTC
Steven,

Thank you, I think you are right, this seems to be some multicast network problem. I don't have access to the network infrastructure, so I cannot provide detailed information about network switches.

Comment 20 Lon Hohberger 2011-10-18 14:14:05 UTC
This looks like it was a configuration issue.


Note You need to log in before you can comment on or make changes to this bug.