| Summary: | crm_resource appears stuck while waiting 60 sec for messages | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.0 | CC: | cluster-maint, dvossel, fdinitto, mnovacek, tengel |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-07-03 22:11:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1111381, 1113520 | ||
Moving bug to pacemaker since this issue is caused by crm_resource (which pcs calls) hanging. While I investigate, can you try running: crm_resource -C -r clusterfs-clone and see if that behaves the same way? (Never to refer to the thing inside the clone/group)
I tried with Dummy resource and this behaves the same way.
Other thing to notice: The cleaning of the resource(s) is done but 1 is
returned presumably because of the crmd not replying.
I believe that the problem might be that crm_resource is waiting for +1 crmd
replies than it should:
# pcs status
Cluster name: STSRHTS19810
Last updated: Fri Jan 31 14:53:55 2014
Last change: Fri Jan 31 14:51:48 2014 via crmd on virt-139
Stack: corosync
Current DC: virt-139 (2) - partition with quorum
Version: 1.1.10-22.el7-368c726
3 Nodes configured
6 Resources configured
Online: [ virt-138 virt-139 virt-140 ]
Full list of resources:
virt-fencing (stonith:fence_xvm): Started virt-138
Resource Group: ha-apache-0
ip-0 (ocf::heartbeat:IPaddr2): Started virt-140
lvm-0 (ocf::heartbeat:LVM): Started virt-140
fs-0 (ocf::heartbeat:Filesystem): Started virt-140
apache-0 (ocf::heartbeat:apache): Started virt-140
le-dummy (ocf::heartbeat:Dummy): Started virt-139
PCSD Status:
virt-138: Online
virt-139: Online
virt-140: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
# crm_resource -C -r le-dummy
Cleaning up le-dummy on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up le-dummy on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up le-dummy on virt-140.cluster-qe.lab.eng.brq.redhat.com
Waiting for 4 replies from the CRMd...No messages received in 60 seconds.. aborting
# crm_resource -C -r ha-apache-0
Cleaning up ip-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up ip-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up ip-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Waiting for 13 replies from the CRMd............No messages received in 60 seconds.. aborting
12 resources 13 crmd replies expected.
I encountered this as well today -- it's a bug that looks to already be fixed upstream in tools/crm_resource.c 1.1.11-rc5 but there's a patch in the RPM making it a bit hidden. https://github.com/ClusterLabs/pacemaker/blob/master/tools/crm_resource.c https://github.com/ClusterLabs/pacemaker/commit/145c782e432d8108ca865f994640cf5a62406363 In the beta RPM (pacemaker-1.1.10-19.el7.src.rpm) the code sets crmd_replies_needed=0, then loops through and adds one for every node, then also in the start_mainloop() increments it one more time. The counter is decremented for each valid reply, however we're still left over with that one extra one added in start_mainloop() that wasn't decremented. The patch in the SRPM: bz720543-pcmk-crm_resource_wait_for_all_replies_when_cleaning_up_resources.patch ...is adding the loop increment which resulted in this bug due to the extra increment in start_mainloop(): + if(node->details->remote_rsc == NULL) { + crmd_replies_needed++; + } At least, that's my take on it -- I'm new to Pacemaker. :) (In reply to tengel from comment #5) > https://github.com/ClusterLabs/pacemaker/commit/ > 145c782e432d8108ca865f994640cf5a62406363 This is definitely the correct patch, well need to backport it This is still not fixed in RHEL7 RC1 - I have two nodes configured and get: [root@linclus3a heartbeat]# crm_resource -C -r db_ip Cleaning up db_ip on linclus3a Cleaning up db_ip on linclus3b Waiting for 3 replies from the CRMd..No messages received in 60 seconds.. aborting (In reply to tengel from comment #8) > This is still not fixed in RHEL7 RC1 Right. The fix will be available post GA. This appears to be fixed - I rebuilt the cluster with the GA and performed a full RPM update: [root@linclus3a ~]# crm_resource -C -r db_ip Cleaning up db_ip on linclus3a Cleaning up db_ip on linclus3b Waiting for 2 replies from the CRMd.. OK [root@linclus3a ~]# rpm -q pacemaker pacemaker-1.1.10-31.el7_0.x86_64 Looking spiffy, thanks. |
Description of problem: [root@virt-068 ~]# pcs resource cleanup clusterfs-clone Resource: clusterfs-clone successfully cleaned up [root@virt-068 ~]# pcs status Cluster name: COREY Last updated: Mon Dec 9 22:11:35 2013 Last change: Mon Dec 9 22:11:31 2013 via crmd on virt-066.cluster-qe.lab.eng.brq.redhat.com Stack: corosync Current DC: virt-068.cluster-qe.lab.eng.brq.redhat.com (3) - partition with quorum Version: 1.1.10-21.el7-368c726 3 Nodes configured 9 Resources configured Online: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ] Full list of resources: Clone Set: dlm-clone [dlm] Started: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ] Clone Set: clvmd-clone [clvmd] Started: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ] Clone Set: clusterfs-clone [clusterfs] Stopped: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ] PCSD Status: virt-066.cluster-qe.lab.eng.brq.redhat.com: virt-066.cluster-qe.lab.eng.brq.redhat.com: Online virt-067.cluster-qe.lab.eng.brq.redhat.com: virt-067.cluster-qe.lab.eng.brq.redhat.com: Online virt-068.cluster-qe.lab.eng.brq.redhat.com: virt-068.cluster-qe.lab.eng.brq.redhat.com: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@virt-068 ~]# pcs resource cleanup clusterfs ^CTraceback (most recent call last): File "/usr/sbin/pcs", line 130, in <module> main(sys.argv[1:]) File "/usr/sbin/pcs", line 109, in main resource.resource_cmd(argv) File "/usr/lib/python2.7/site-packages/pcs/resource.py", line 164, in resource_cmd resource_cleanup(res_id) File "/usr/lib/python2.7/site-packages/pcs/resource.py", line 1770, in resource_cleanup (output, retval) = utils.run(["crm_resource", "-C", "-r", res_id]) File "/usr/lib/python2.7/site-packages/pcs/utils.py", line 432, in run output,stderror = p.communicate(string_for_stdin) File "/usr/lib64/python2.7/subprocess.py", line 798, in communicate stdout = _eintr_retry_call(self.stdout.read) File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call return func(*args) KeyboardInterrupt [root@virt-068 ~]# crm_resource -C -r clusterfs Cleaning up clusterfs:0 on virt-066.cluster-qe.lab.eng.brq.redhat.com Cleaning up clusterfs:0 on virt-067.cluster-qe.lab.eng.brq.redhat.com Cleaning up clusterfs:0 on virt-068.cluster-qe.lab.eng.brq.redhat.com Waiting for 4 replies from the CRMd...No messages received in 60 seconds.. aborting Version-Release number of selected component (if applicable): pcs-0.9.100-1.el7.x86_64