Bug 843160

Summary: dlm_controld recovery stuck in check_fencing
Product: Red Hat Enterprise Linux 6 Reporter: David Teigland <teigland>
Component: clusterAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: low    
Version: 6.3CC: ccaulfie, cluster-maint, djansa, fdinitto, jruemker, lhh, mjuricek, rpeterso, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cluster-3.0.12.1-67.el6 Doc Type: Bug Fix
Doc Text:
Cause: Fencing time comparison does not work as expected when a fence agent completes quickly or the corosync callback is delayed. Consequence: dlm recovery will be stuck waiting for fencing to complete. Fix: Save and compare different time stamps that are not affected by the sequence of fencing and corosync callbacks. Result: dlm recovery will not be stuck waiting for fencing to complete in this case.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 04:44:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Teigland 2012-07-25 18:24:50 UTC
Description of problem:

This commit
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=4039bf4817a96b6aab20de948389f43b89ce4a8e

fixing bug 594511 was not a complete solution.  The chance remains that dlm_controld will get stuck waiting for fencing that has already happened.

dlm_controld is checking that the last_fenced_time is >= the time the failed node was removed from the cpg (fail_time).  But, a fast fencing agent and a delayed cpg callback can cause the fail_time to be later than the last_fenced_time.  Comparing last_fenced_time to the time the failed node was removed via the cman api probably has a similar problem.

To test this I added usleep(1000000) in del_configfs_node() before rmdir.

fence_tool dump

>1343237151 cluster node 2 removed seq 488
>1343237151 fenced:daemon conf 1 0 1 memb 1 join left 2
>1343237151 fenced:daemon ring 1:488 1 memb 1
>1343237151 fenced:default conf 1 0 1 memb 1 join left 2
>1343237151 add_change cg 4 remove nodeid 2 reason 3
>1343237151 add_change cg 4 m 1 j 0 r 1 f 1
>1343237151 add_victims node 2
>1343237151 check_ringid cluster 488 cpg 1:484
>1343237151 fenced:default ring 1:488 1 memb 1
>1343237151 check_ringid done cluster 488 cpg 1:488
>1343237151 check_quorum done
>1343237151 send_start 1:4 flags 2 started 3 m 1 j 0 r 1 f 1
>1343237151 receive_start 1:4 len 152
>1343237151 match_change 1:4 matches cg 4
>1343237151 wait_messages cg 4 got all 1
>1343237151 set_master from 1 to complete node 1
>1343237151 delay post_fail_delay 0 quorate_from_last_update 0
>1343237151 dct-xen-11 not a cluster member after 0 sec post_fail_delay
>1343237151 fencing node dct-xen-11
>1343237151 fence dct-xen-11 dev 0.0 agent /bin/true result: success
>1343237151 fence dct-xen-11 success
>1343237151 send_victim_done cg 4 flags 2 victim nodeid 2
>1343237151 send_complete 1:4 flags 2 started 3 m 1 j 0 r 1 f 1
>1343237151 receive_victim_done 1:4 flags 2 len 80
>1343237151 receive_victim_done 1:4 remove victim 2 time 1343237151 how 1
>1343237151 receive_complete 1:4 len 152

dlm_tool dump

>1343237151 cluster node 2 removed seq 488
>1343237151 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
>1343237152 dlm:controld conf 1 0 1 memb 1 join left 2
>1343237152 dlm:ls:dlm_master conf 1 0 1 memb 1 join left 2
>1343237152 dlm_master add_change cg 3 remove nodeid 2 reason 3
>1343237152 dlm_master add_change cg 3 counts member 1 joined 0 remove 1 failed 1
>1343237152 dlm_master stop_kernel cg 3
>1343237152 write "0" to "/sys/kernel/dlm/dlm_master/control"
>1343237152 dlm_master check_fencing 2 wait add 1343237075 fail 1343237152 last 1343237151


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2012-07-25 18:26:47 UTC
The symptom of this problem is fencing being done but dlm still waiting for fencing:

# group_tool
fence domain
member count  1
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1

dlm lockspaces
name          dlm_master
id            0x8321e41a
flags         0x00000004 kern_stop
change        member 2 joined 1 remove 0 failed 0 seq 2,2
members       1 2
new change    member 1 joined 0 remove 1 failed 1 seq 3,3
new status    wait_messages 0 wait_condition 1 fencing
new members   1

Comment 2 David Teigland 2012-07-25 18:52:30 UTC
A possible solution might be to save the add_time that preceded the last failure as need_fenced_time and compare last_fenced_time to that.

Comment 4 David Teigland 2012-07-31 14:32:06 UTC
No one has reported seeing this except me.

Comment 6 David Teigland 2012-10-24 15:32:31 UTC
Pushing this out since I don't have time to work on it for 6.4

Comment 8 David Teigland 2014-01-10 22:24:52 UTC
pushed patch to RHEL6 branch
https://git.fedorahosted.org/cgit/cluster.git/commit/?h=RHEL6&id=2d06dd478c27bf864ba1a5ac0cbb1ba6c3ed947f

I tested and verified this in two different ways:

1. added usleep(1000000) in del_configfs_node() before rmdir, recompiled.
   killed nodeB, nodeA recovered properly based on the adjusted time comparison.

2. (no code change necessary) suspended the dlm_controld process on nodeA,
   killed nodeB, waited for fenced to complete fencing on nodeA, resumed
   dlm_controld on nodeA, and nodeA recovered properly.

Comment 13 errata-xmlrpc 2014-10-14 04:44:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1420.html