Hide Forgot
Description of problem: This commit http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=4039bf4817a96b6aab20de948389f43b89ce4a8e fixing bug 594511 was not a complete solution. The chance remains that dlm_controld will get stuck waiting for fencing that has already happened. dlm_controld is checking that the last_fenced_time is >= the time the failed node was removed from the cpg (fail_time). But, a fast fencing agent and a delayed cpg callback can cause the fail_time to be later than the last_fenced_time. Comparing last_fenced_time to the time the failed node was removed via the cman api probably has a similar problem. To test this I added usleep(1000000) in del_configfs_node() before rmdir. fence_tool dump >1343237151 cluster node 2 removed seq 488 >1343237151 fenced:daemon conf 1 0 1 memb 1 join left 2 >1343237151 fenced:daemon ring 1:488 1 memb 1 >1343237151 fenced:default conf 1 0 1 memb 1 join left 2 >1343237151 add_change cg 4 remove nodeid 2 reason 3 >1343237151 add_change cg 4 m 1 j 0 r 1 f 1 >1343237151 add_victims node 2 >1343237151 check_ringid cluster 488 cpg 1:484 >1343237151 fenced:default ring 1:488 1 memb 1 >1343237151 check_ringid done cluster 488 cpg 1:488 >1343237151 check_quorum done >1343237151 send_start 1:4 flags 2 started 3 m 1 j 0 r 1 f 1 >1343237151 receive_start 1:4 len 152 >1343237151 match_change 1:4 matches cg 4 >1343237151 wait_messages cg 4 got all 1 >1343237151 set_master from 1 to complete node 1 >1343237151 delay post_fail_delay 0 quorate_from_last_update 0 >1343237151 dct-xen-11 not a cluster member after 0 sec post_fail_delay >1343237151 fencing node dct-xen-11 >1343237151 fence dct-xen-11 dev 0.0 agent /bin/true result: success >1343237151 fence dct-xen-11 success >1343237151 send_victim_done cg 4 flags 2 victim nodeid 2 >1343237151 send_complete 1:4 flags 2 started 3 m 1 j 0 r 1 f 1 >1343237151 receive_victim_done 1:4 flags 2 len 80 >1343237151 receive_victim_done 1:4 remove victim 2 time 1343237151 how 1 >1343237151 receive_complete 1:4 len 152 dlm_tool dump >1343237151 cluster node 2 removed seq 488 >1343237151 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2" >1343237152 dlm:controld conf 1 0 1 memb 1 join left 2 >1343237152 dlm:ls:dlm_master conf 1 0 1 memb 1 join left 2 >1343237152 dlm_master add_change cg 3 remove nodeid 2 reason 3 >1343237152 dlm_master add_change cg 3 counts member 1 joined 0 remove 1 failed 1 >1343237152 dlm_master stop_kernel cg 3 >1343237152 write "0" to "/sys/kernel/dlm/dlm_master/control" >1343237152 dlm_master check_fencing 2 wait add 1343237075 fail 1343237152 last 1343237151 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The symptom of this problem is fencing being done but dlm still waiting for fencing: # group_tool fence domain member count 1 victim count 0 victim now 0 master nodeid 1 wait state none members 1 dlm lockspaces name dlm_master id 0x8321e41a flags 0x00000004 kern_stop change member 2 joined 1 remove 0 failed 0 seq 2,2 members 1 2 new change member 1 joined 0 remove 1 failed 1 seq 3,3 new status wait_messages 0 wait_condition 1 fencing new members 1
A possible solution might be to save the add_time that preceded the last failure as need_fenced_time and compare last_fenced_time to that.
No one has reported seeing this except me.
Pushing this out since I don't have time to work on it for 6.4
pushed patch to RHEL6 branch https://git.fedorahosted.org/cgit/cluster.git/commit/?h=RHEL6&id=2d06dd478c27bf864ba1a5ac0cbb1ba6c3ed947f I tested and verified this in two different ways: 1. added usleep(1000000) in del_configfs_node() before rmdir, recompiled. killed nodeB, nodeA recovered properly based on the adjusted time comparison. 2. (no code change necessary) suspended the dlm_controld process on nodeA, killed nodeB, waited for fenced to complete fencing on nodeA, resumed dlm_controld on nodeA, and nodeA recovered properly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1420.html