Red Hat Bugzilla – Bug 1268313
clvmd/dlm resource agent monitor action should recognize it is hung
Last modified: 2016-11-03 14:56:15 EDT
Description of problem: This is bug for the same problem as bz1241511 but with component resource-agent. resume of bz124151: All nodes lose/resume quorum without fencing (caused by network connection loss). This leads to dlm waiting for fencing causing clvmd to appear hung and all access to lvs will wait forever. The resulting situation is quorate cluster with clvmd-clone and dlm-clone (said to be) running correctly where they both are hung. As concluded in this bz1241511 the problem is not pacemaker's nor dlm's. I believe that the correct resolution of this situation will be clvmd failing it's monitor action leading fence. Version-Release number of selected component (if applicable): dlm-4.0.2-5.el7.x86_64 lvm2-cluster-2.02.115-3.el7.x86_64 pacemaker-1.1.12-22.el7.x86_64 corosync-2.3.4-4.el7.x86_64 How reproducible: very frequent Steps to Reproduce: . have quorate pacemaker cluster . check nodes uptime . disable network communication between all nodes with iptables and wait for all nodes turning inquorate . enable at the same time network communication between nodes . check whether fencing occured and if it has not check dlm status and logs Actual results: dlm hanging Expected results: dlm happilly working Additional info: # tail /var/log/messages ... Jul 9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline) Jul 9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline) Jul 9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop dlm:1 (virt-018 - blocked) Jul 9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop dlm:2 (virt-019 - blocked) Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 2 needs fencing Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 1 needs fencing Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge Jul 9 12:19:12 virt-020 dlm_controld[2438]: 183 fence work wait to clear merge 2 clean 1 part 0 gone 0 Jul 9 12:19:39 virt-020 dlm_controld[2438]: 210 clvmd wait for fencing
what dlm and/or clvmd command should be executed to determine that they are hung?
I'm not sure about the exact command. We should somehow find out that controld is in the non operational state that can be resolved only by fencing. 'dlm_tool ls' at the moment of the problem shows: dlm lockspaces name clvmd id 0x4104eefa flags 0x00000004 kern_stop change member 3 joined 1 remove 0 failed 0 seq 1,1 members 1 2 3 new change member 3 joined 1 remove 0 failed 0 seq 2,4 new status wait fencing new members 1 2 3 Maybe the 'wait fencing' status gives us this information?
controld is a shell script that calls into the dlm/clvmd, so > We should somehow find out that controld is in the non operational state that > can be resolved only by fencing. has no meaning. We can have it look for 'wait fencing' though. Is there an equivelent command when controld is managing clvmd_controld? Oyvind: Btw. the controld agent is in the pacemaker tree when you need to find it.
Is there an easy way to reproduce this issue? I only got it to go into "wait fencing" once by random timing, and the other times it goes into "wait fencing" when it gets network connectivity back causing fencing.
Scratch build available at: https://brewweb.devel.redhat.com/taskinfo?taskID=10124358
Created attachment 1097646 [details] Working patch
https://github.com/ClusterLabs/pacemaker/pull/839
The resource agent is in the pacemaker package.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
RecoverySwitchFailure test scenario passed, marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html