Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1039734

Summary: crm_resource appears stuck while waiting 60 sec for messages
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: cluster-maint, dvossel, fdinitto, mnovacek, tengel
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-03 22:11:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1111381, 1113520    

Description Corey Marthaler 2013-12-09 21:20:58 UTC
Description of problem:

[root@virt-068 ~]# pcs resource cleanup clusterfs-clone
Resource: clusterfs-clone successfully cleaned up
[root@virt-068 ~]# pcs status
Cluster name: COREY
Last updated: Mon Dec  9 22:11:35 2013
Last change: Mon Dec  9 22:11:31 2013 via crmd on virt-066.cluster-qe.lab.eng.brq.redhat.com
Stack: corosync
Current DC: virt-068.cluster-qe.lab.eng.brq.redhat.com (3) - partition with quorum
Version: 1.1.10-21.el7-368c726
3 Nodes configured
9 Resources configured


Online: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Started: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]
 Clone Set: clusterfs-clone [clusterfs]
     Stopped: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]

PCSD Status:
virt-066.cluster-qe.lab.eng.brq.redhat.com: 
  virt-066.cluster-qe.lab.eng.brq.redhat.com: Online
virt-067.cluster-qe.lab.eng.brq.redhat.com: 
  virt-067.cluster-qe.lab.eng.brq.redhat.com: Online
virt-068.cluster-qe.lab.eng.brq.redhat.com: 
  virt-068.cluster-qe.lab.eng.brq.redhat.com: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root@virt-068 ~]# pcs resource cleanup clusterfs
^CTraceback (most recent call last):
  File "/usr/sbin/pcs", line 130, in <module>
    main(sys.argv[1:])
  File "/usr/sbin/pcs", line 109, in main
    resource.resource_cmd(argv)
  File "/usr/lib/python2.7/site-packages/pcs/resource.py", line 164, in resource_cmd
    resource_cleanup(res_id)
  File "/usr/lib/python2.7/site-packages/pcs/resource.py", line 1770, in resource_cleanup
    (output, retval) = utils.run(["crm_resource", "-C", "-r", res_id])
  File "/usr/lib/python2.7/site-packages/pcs/utils.py", line 432, in run
    output,stderror = p.communicate(string_for_stdin)
  File "/usr/lib64/python2.7/subprocess.py", line 798, in communicate
    stdout = _eintr_retry_call(self.stdout.read)
  File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt
[root@virt-068 ~]# crm_resource -C -r clusterfs
Cleaning up clusterfs:0 on virt-066.cluster-qe.lab.eng.brq.redhat.com
Cleaning up clusterfs:0 on virt-067.cluster-qe.lab.eng.brq.redhat.com
Cleaning up clusterfs:0 on virt-068.cluster-qe.lab.eng.brq.redhat.com
Waiting for 4 replies from the CRMd...No messages received in 60 seconds.. aborting


Version-Release number of selected component (if applicable):
pcs-0.9.100-1.el7.x86_64

Comment 2 Chris Feist 2014-01-07 15:07:33 UTC
Moving bug to pacemaker since this issue is caused by crm_resource (which pcs calls) hanging.

Comment 3 Andrew Beekhof 2014-01-10 01:45:44 UTC
While I investigate, can you try running:

   crm_resource -C -r clusterfs-clone

and see if that behaves the same way?

(Never to refer to the thing inside the clone/group)

Comment 4 michal novacek 2014-01-31 14:01:00 UTC
I tried with Dummy resource and this behaves the same way. 
Other thing to notice: The cleaning of the resource(s) is done but 1 is
returned presumably because of the crmd not replying.

I believe that the problem might be that crm_resource is waiting for +1 crmd
replies than it should:

# pcs status
Cluster name: STSRHTS19810
Last updated: Fri Jan 31 14:53:55 2014
Last change: Fri Jan 31 14:51:48 2014 via crmd on virt-139
Stack: corosync
Current DC: virt-139 (2) - partition with quorum
Version: 1.1.10-22.el7-368c726
3 Nodes configured
6 Resources configured


Online: [ virt-138 virt-139 virt-140 ]

Full list of resources:

 virt-fencing   (stonith:fence_xvm):    Started virt-138 
 Resource Group: ha-apache-0
     ip-0       (ocf::heartbeat:IPaddr2):       Started virt-140 
     lvm-0      (ocf::heartbeat:LVM):   Started virt-140 
     fs-0       (ocf::heartbeat:Filesystem):    Started virt-140 
     apache-0   (ocf::heartbeat:apache):        Started virt-140 
 le-dummy       (ocf::heartbeat:Dummy): Started virt-139 

PCSD Status:
  virt-138: Online
  virt-139: Online
  virt-140: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

# crm_resource -C -r le-dummy
Cleaning up le-dummy on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up le-dummy on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up le-dummy on virt-140.cluster-qe.lab.eng.brq.redhat.com
Waiting for 4 replies from the CRMd...No messages received in 60 seconds.. aborting

# crm_resource -C -r ha-apache-0
Cleaning up ip-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up ip-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up ip-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Waiting for 13 replies from the CRMd............No messages received in 60 seconds.. aborting
12 resources 13 crmd replies expected.

Comment 5 tengel 2014-03-28 22:58:26 UTC
I encountered this as well today -- it's a bug that looks to already be fixed upstream in tools/crm_resource.c 1.1.11-rc5 but there's a patch in the RPM making it a bit hidden.

https://github.com/ClusterLabs/pacemaker/blob/master/tools/crm_resource.c

https://github.com/ClusterLabs/pacemaker/commit/145c782e432d8108ca865f994640cf5a62406363

In the beta RPM (pacemaker-1.1.10-19.el7.src.rpm) the code sets crmd_replies_needed=0, then loops through and adds one for every node, then also in the start_mainloop() increments it one more time. The counter is decremented for each valid reply, however we're still left over with that one extra one added in start_mainloop() that wasn't decremented.

The patch in the SRPM:

bz720543-pcmk-crm_resource_wait_for_all_replies_when_cleaning_up_resources.patch

...is adding the loop increment which resulted in this bug due to the extra increment in start_mainloop():

+        if(node->details->remote_rsc == NULL) {
+            crmd_replies_needed++;
+        }

At least, that's my take on it -- I'm new to Pacemaker. :)

Comment 6 Andrew Beekhof 2014-04-15 05:18:03 UTC
(In reply to tengel from comment #5)
> https://github.com/ClusterLabs/pacemaker/commit/
> 145c782e432d8108ca865f994640cf5a62406363

This is definitely the correct patch, well need to backport it

Comment 8 tengel 2014-05-02 20:50:25 UTC
This is still not fixed in RHEL7 RC1 - I have two nodes configured and get:

[root@linclus3a heartbeat]# crm_resource -C -r db_ip
Cleaning up db_ip on linclus3a
Cleaning up db_ip on linclus3b
Waiting for 3 replies from the CRMd..No messages received in 60 seconds.. aborting

Comment 9 Andrew Beekhof 2014-05-06 06:18:44 UTC
(In reply to tengel from comment #8)
> This is still not fixed in RHEL7 RC1

Right. The fix will be available post GA.

Comment 10 tengel 2014-07-02 13:31:11 UTC
This appears to be fixed - I rebuilt the cluster with the GA and performed a full RPM update:

[root@linclus3a ~]# crm_resource -C -r db_ip
Cleaning up db_ip on linclus3a
Cleaning up db_ip on linclus3b
Waiting for 2 replies from the CRMd.. OK

[root@linclus3a ~]# rpm -q pacemaker
pacemaker-1.1.10-31.el7_0.x86_64

Looking spiffy, thanks.