Bug 1039734

Summary: crm_resource appears stuck while waiting 60 sec for messages
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: cluster-maint, dvossel, fdinitto, mnovacek, tengel
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-03 22:11:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1111381, 1113520    

Description Corey Marthaler 2013-12-09 21:20:58 UTC
Description of problem:

[root@virt-068 ~]# pcs resource cleanup clusterfs-clone
Resource: clusterfs-clone successfully cleaned up
[root@virt-068 ~]# pcs status
Cluster name: COREY
Last updated: Mon Dec  9 22:11:35 2013
Last change: Mon Dec  9 22:11:31 2013 via crmd on virt-066.cluster-qe.lab.eng.brq.redhat.com
Stack: corosync
Current DC: virt-068.cluster-qe.lab.eng.brq.redhat.com (3) - partition with quorum
Version: 1.1.10-21.el7-368c726
3 Nodes configured
9 Resources configured


Online: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Started: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]
 Clone Set: clusterfs-clone [clusterfs]
     Stopped: [ virt-066.cluster-qe.lab.eng.brq.redhat.com virt-067.cluster-qe.lab.eng.brq.redhat.com virt-068.cluster-qe.lab.eng.brq.redhat.com ]

PCSD Status:
virt-066.cluster-qe.lab.eng.brq.redhat.com: 
  virt-066.cluster-qe.lab.eng.brq.redhat.com: Online
virt-067.cluster-qe.lab.eng.brq.redhat.com: 
  virt-067.cluster-qe.lab.eng.brq.redhat.com: Online
virt-068.cluster-qe.lab.eng.brq.redhat.com: 
  virt-068.cluster-qe.lab.eng.brq.redhat.com: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root@virt-068 ~]# pcs resource cleanup clusterfs
^CTraceback (most recent call last):
  File "/usr/sbin/pcs", line 130, in <module>
    main(sys.argv[1:])
  File "/usr/sbin/pcs", line 109, in main
    resource.resource_cmd(argv)
  File "/usr/lib/python2.7/site-packages/pcs/resource.py", line 164, in resource_cmd
    resource_cleanup(res_id)
  File "/usr/lib/python2.7/site-packages/pcs/resource.py", line 1770, in resource_cleanup
    (output, retval) = utils.run(["crm_resource", "-C", "-r", res_id])
  File "/usr/lib/python2.7/site-packages/pcs/utils.py", line 432, in run
    output,stderror = p.communicate(string_for_stdin)
  File "/usr/lib64/python2.7/subprocess.py", line 798, in communicate
    stdout = _eintr_retry_call(self.stdout.read)
  File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt
[root@virt-068 ~]# crm_resource -C -r clusterfs
Cleaning up clusterfs:0 on virt-066.cluster-qe.lab.eng.brq.redhat.com
Cleaning up clusterfs:0 on virt-067.cluster-qe.lab.eng.brq.redhat.com
Cleaning up clusterfs:0 on virt-068.cluster-qe.lab.eng.brq.redhat.com
Waiting for 4 replies from the CRMd...No messages received in 60 seconds.. aborting


Version-Release number of selected component (if applicable):
pcs-0.9.100-1.el7.x86_64

Comment 2 Chris Feist 2014-01-07 15:07:33 UTC
Moving bug to pacemaker since this issue is caused by crm_resource (which pcs calls) hanging.

Comment 3 Andrew Beekhof 2014-01-10 01:45:44 UTC
While I investigate, can you try running:

   crm_resource -C -r clusterfs-clone

and see if that behaves the same way?

(Never to refer to the thing inside the clone/group)

Comment 4 michal novacek 2014-01-31 14:01:00 UTC
I tried with Dummy resource and this behaves the same way. 
Other thing to notice: The cleaning of the resource(s) is done but 1 is
returned presumably because of the crmd not replying.

I believe that the problem might be that crm_resource is waiting for +1 crmd
replies than it should:

# pcs status
Cluster name: STSRHTS19810
Last updated: Fri Jan 31 14:53:55 2014
Last change: Fri Jan 31 14:51:48 2014 via crmd on virt-139
Stack: corosync
Current DC: virt-139 (2) - partition with quorum
Version: 1.1.10-22.el7-368c726
3 Nodes configured
6 Resources configured


Online: [ virt-138 virt-139 virt-140 ]

Full list of resources:

 virt-fencing   (stonith:fence_xvm):    Started virt-138 
 Resource Group: ha-apache-0
     ip-0       (ocf::heartbeat:IPaddr2):       Started virt-140 
     lvm-0      (ocf::heartbeat:LVM):   Started virt-140 
     fs-0       (ocf::heartbeat:Filesystem):    Started virt-140 
     apache-0   (ocf::heartbeat:apache):        Started virt-140 
 le-dummy       (ocf::heartbeat:Dummy): Started virt-139 

PCSD Status:
  virt-138: Online
  virt-139: Online
  virt-140: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

# crm_resource -C -r le-dummy
Cleaning up le-dummy on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up le-dummy on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up le-dummy on virt-140.cluster-qe.lab.eng.brq.redhat.com
Waiting for 4 replies from the CRMd...No messages received in 60 seconds.. aborting

# crm_resource -C -r ha-apache-0
Cleaning up ip-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up ip-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up ip-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up lvm-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up fs-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-138.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-139.cluster-qe.lab.eng.brq.redhat.com
Cleaning up apache-0 on virt-140.cluster-qe.lab.eng.brq.redhat.com
Waiting for 13 replies from the CRMd............No messages received in 60 seconds.. aborting
12 resources 13 crmd replies expected.

Comment 5 tengel 2014-03-28 22:58:26 UTC
I encountered this as well today -- it's a bug that looks to already be fixed upstream in tools/crm_resource.c 1.1.11-rc5 but there's a patch in the RPM making it a bit hidden.

https://github.com/ClusterLabs/pacemaker/blob/master/tools/crm_resource.c

https://github.com/ClusterLabs/pacemaker/commit/145c782e432d8108ca865f994640cf5a62406363

In the beta RPM (pacemaker-1.1.10-19.el7.src.rpm) the code sets crmd_replies_needed=0, then loops through and adds one for every node, then also in the start_mainloop() increments it one more time. The counter is decremented for each valid reply, however we're still left over with that one extra one added in start_mainloop() that wasn't decremented.

The patch in the SRPM:

bz720543-pcmk-crm_resource_wait_for_all_replies_when_cleaning_up_resources.patch

...is adding the loop increment which resulted in this bug due to the extra increment in start_mainloop():

+        if(node->details->remote_rsc == NULL) {
+            crmd_replies_needed++;
+        }

At least, that's my take on it -- I'm new to Pacemaker. :)

Comment 6 Andrew Beekhof 2014-04-15 05:18:03 UTC
(In reply to tengel from comment #5)
> https://github.com/ClusterLabs/pacemaker/commit/
> 145c782e432d8108ca865f994640cf5a62406363

This is definitely the correct patch, well need to backport it

Comment 8 tengel 2014-05-02 20:50:25 UTC
This is still not fixed in RHEL7 RC1 - I have two nodes configured and get:

[root@linclus3a heartbeat]# crm_resource -C -r db_ip
Cleaning up db_ip on linclus3a
Cleaning up db_ip on linclus3b
Waiting for 3 replies from the CRMd..No messages received in 60 seconds.. aborting

Comment 9 Andrew Beekhof 2014-05-06 06:18:44 UTC
(In reply to tengel from comment #8)
> This is still not fixed in RHEL7 RC1

Right. The fix will be available post GA.

Comment 10 tengel 2014-07-02 13:31:11 UTC
This appears to be fixed - I rebuilt the cluster with the GA and performed a full RPM update:

[root@linclus3a ~]# crm_resource -C -r db_ip
Cleaning up db_ip on linclus3a
Cleaning up db_ip on linclus3b
Waiting for 2 replies from the CRMd.. OK

[root@linclus3a ~]# rpm -q pacemaker
pacemaker-1.1.10-31.el7_0.x86_64

Looking spiffy, thanks.