Bug 1241511
Summary: | dlm_controld waits for fencing which will never occur causing hang | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | michal novacek <mnovacek> | ||||
Component: | dlm | Assignee: | David Teigland <teigland> | ||||
Status: | CLOSED NOTABUG | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 7.1 | CC: | cluster-maint, dvossel, jbrassow, jkortus, mnovacek, zren | ||||
Target Milestone: | rc | Keywords: | TestBlocker | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-08-14 15:30:22 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
"Fencing will never occur because the cluster is quorate with all nodes" If stateful cluster nodes fail, they need to be fenced. If dlm is in charge of fencing, then stateful cluster merges are a situation where you might need to manually intervene (e.g. if no partition maintained quorum). When pacemaker does fencing, I don't know what's supposed to happen. If you reproduce this with dlm by itself (get rid of pacemaker) then I could explain the behavior. Please either reproduce that way, or reassign to pacemaker. The cluster doesn't require fencing in this situation. If the dlm requires it, then it is up to the dlm to initiate it. The expected dlm behavior here remains the same as it's been in the past (since partition/merge handling was added), and there does not appear to be anything to fix. In the case of a cluster partition that merges, if one partition maintained quorum, then it will kill merged nodes. Otherwise, as in this case, user intervention is required to select and kill merged nodes. Hello Michal, > How reproducible: very frequent > > Steps to Reproduce: > 1. have quorate pacemaker cluster > . check nodes uptime > . disable network communication between all nodes with iptables and wait for > all nodes turning inquorate > . enable at the same time network communication between nodes > . check whether fencing occured and if it has not check dlm status and logs With 3 nodes cluster, unfortunately I cannot reproduce(fencing quickly happens ) if applying iptables manually:-/ Looking at cluster report you attached, I think you may use some automatic method to make a real transient disconnection all of sudden. If so, could you please share your method/scripts to help reproduce? The reason why I'm here is this patch (https://github.com/ClusterLabs/pacemaker/pull/839) has a problem which will cause both nodes to be fenced in 2-nodes cluster unnecessarily in the following case: 1. Bring both nodes up in the cluster and all resources started. 2. Fence one node by issuing "pkill -9 corosync" 3. Watch logs and surviving node fences the other node and then ends up self fencing It will decrease availability in 2-nodes scenario. IMHO, the patch shouldn't let "controld" RA rely on "dlm_tool ls" to get "wait fencing" because this message means there's a node in cluster needing fencing. This commands on each node tell RA the same message, so every node will die. IOW, we need dlm tell RA if this node needs fencing, then that patch should work better. Thanks for your time;-) What I do is that I create /root/iptables.sh on each of the cluster node and then I do run from a node outside of the cluster: for i in 1 2 3; do ssh node$i /root/iptables.sh & done; wait This way I was able to manifest the problem described in like less than ten attempts on a three node cluster. The imortant thing is '&' instead of ';' in the for cycle which will the commands in parallel. Hope this helps. (In reply to michal novacek from comment #10) > What I do is that I create /root/iptables.sh on each of the cluster node and > then I do run from a node outside of the cluster: > > for i in 1 2 3; do ssh node$i /root/iptables.sh & done; wait > > This way I was able to manifest the problem described in like less than ten > attempts on a three node cluster. > > The imortant thing is '&' instead of ';' in the for cycle which will the > commands in parallel. > > Hope this helps. Hi Michal, Thanks a lot for your info! I've reproduced this problem now. In case you may interest: 1. setup ntp (optional); 2. put this scritps on every nodes: --- #!/bin/sh PATH=$PATH:/usr/sbin/ has_quorum= hosts="ocfs2test2,ocfs2test3" # other 2 nodes iptables -A INPUT -s $hosts -j DROP echo "iptables: add rules" > /tmp/cron.log while true; do has_quorum=`corosync-quorumtool | awk '{if($1=="Quorate:") print $2;}'` if [ $has_quorum == "No" ] ; then echo "Quorum lost now" >> /tmp/cron.log break; fi done iptables -D INPUT -s $hosts -j DROP echo "iptables: remove rules" >> /tmp/cron.log --- 3. concurrently trigger to run by crontab; Thanks again. |
Created attachment 1050232 [details] pcs cluster report output Description of problem: Have a quorate cluster running pacemaker cluster with clvmd and dlm clone. Disabling and enabling network communication between all nodes at the same time will most of the times (but not always) lead back to quorate cluster without any fencing. In this case, dlm_controld expects some fencing and will hang until it occurs. Fencing will never occur because the cluster is quorate with all nodes. Then, at the same time, disable network communication between cluster nodes. This will lead to all cluster nodes turning inquorate. Version-Release number of selected component (if applicable): dlm-4.0.2-5.el7.x86_64 lvm2-cluster-2.02.115-3.el7.x86_64 pacemaker-1.1.12-22.el7.x86_64 corosync-2.3.4-4.el7.x86_64 How reproducible: very frequent Steps to Reproduce: 1. have quorate pacemaker cluster . check nodes uptime . disable network communication between all nodes with iptables and wait for all nodes turning inquorate . enable at the same time network communication between nodes . check whether fencing occured and if it has not check dlm status and logs Actual results: dlm hanging Expected results: dlm happilly working Additional info: # tail /var/log/messages ... Jul 9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline) Jul 9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline) Jul 9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop dlm:1 (virt-018 - blocked) Jul 9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop dlm:2 (virt-019 - blocked) Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 2 needs fencing Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 1 needs fencing Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge Jul 9 12:19:12 virt-020 dlm_controld[2438]: 183 fence work wait to clear merge 2 clean 1 part 0 gone 0 Jul 9 12:19:39 virt-020 dlm_controld[2438]: 210 clvmd wait for fencing