1519812 – pcs node standby --wait=... never terminates when using bundles

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1519812 - pcs node standby --wait=... never terminates when using bundles

Summary: pcs node standby --wait=... never terminates when using bundles

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	7.5
Assignee:	Ken Gaillot
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1520798
TreeView+	depends on / blocked

Reported:	2017-12-01 13:40 UTC by Michele Baldessari
Modified:	2018-04-10 15:35 UTC (History)
CC List:	14 users (show)
Fixed In Version:	pacemaker-1.1.18-10.el7
Doc Type:	Bug Fix
Doc Text:	Previously, the --wait option of the pcs utility sometimes blocked pcs commands indefinitely if clone notifications were not immediately runnable. This happened because Pacemaker unnecessarily waited until clone notifications had been completed when waiting for the cluster to stabilize. With this update, Pacemaker now ignores clone notification actions when waiting for cluster stability. As a result, pending clone notifications no longer block pcs commands when the cluster is stable.
Clone Of:
Clones:	1520798 (view as bug list)
Environment:
Last Closed:	2018-04-10 15:34:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:0860	0	None	None	None	2018-04-10 15:35:20 UTC

Description Michele Baldessari 2017-12-01 13:40:40 UTC

Description of problem:
Calling "pcs node standby overcloud-controller-1 --wait=900" hangs until the timeout hits.

The reason for this seems to be that "Running: /usr/sbin/crm_resource --wait --timeout=900" never returns. Which in turn is likely due to crm_simulate -Ls showing the following:
Transition Summary:                                                                                                         
 * Start      rabbitmq-bundle-2     ( overcloud-controller-0 )   due to unrunnable rabbitmq-bundle-docker-2 start (blocked) 
 * Start      rabbitmq:2            (      rabbitmq-bundle-2 )   due to unrunnable rabbitmq-bundle-docker-2 start (blocked) 
 * Start      galera-bundle-2       ( overcloud-controller-2 )   due to unrunnable galera-bundle-docker-2 start (blocked)   
 * Start      galera:2              (        galera-bundle-2 )   due to unrunnable galera-bundle-docker-2 start (blocked)   
 * Start      redis-bundle-2        ( overcloud-controller-0 )   due to unrunnable redis-bundle-docker-2 start (blocked)    
 * Start      redis:2               (         redis-bundle-2 )   due to unrunnable redis-bundle-docker-2 start (blocked)    

Note that the node does seem to go in standby mode correctly. It just seems that the bundles stay in the transition graph somehow so crm_resource --wait never returns.

Version-Release number of selected component (if applicable):
I reproduced this in my env which had:
[root@overcloud-controller-0 ~]# rpm -q pcs pacemaker
pcs-0.9.158-6.el7.centos.x86_64                      
pacemaker-1.1.16-12.el7_4.4.x86_64                   

But yuri reproduced it with the latest zstream aka:
pacemaker-1.1.16-12.el7_4.5.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP12
2. Call 'pcs node standby --wait=900'
3. Timeout

Actual results:
Timeout

Expected results:
No timeout

Comment 5 Michele Baldessari 2017-12-11 18:44:33 UTC

Setting needinfo on me as Ken needs confirmation that 7.5 is working ok in this regard

Comment 9 Ken Gaillot 2017-12-12 21:11:41 UTC

Can you attach a pe-input (or pcs cluster report) from when it was blocked? It doesn't happen in my simple tests, so I suspect it's going to be specific to something in your configuration.

Comment 11 Michele Baldessari 2017-12-13 12:52:34 UTC

As a potential additional info: the undstandby command works as expected

[root@overcloud-controller-2 tmp]# pcs node unstandby overcloud-controller-1 --wait=900                                                                                                       
[root@overcloud-controller-2 tmp]# echo $?
0

Comment 12 Andrew Beekhof 2018-01-16 08:07:28 UTC

Fix:
- https://github.com/beekhof/pacemaker/commit/dc404dc

Comment 13 Andrew Beekhof 2018-01-16 22:07:18 UTC

Assigning back to HA team for builds and z-streams

Comment 15 pkomarov 2018-02-14 16:36:08 UTC

Verified, 

# cat /etc/rhosp-release 
Red Hat OpenStack Platform release 12.0 (Pike)


# pcs status |head -n 3
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-10.el7-2b07d5c5a9) - partition with quorum

# time pcs node standby --wait=900

real	0m48.888s
user	0m2.034s
sys	0m0.138s
[root@controller-0 ~]# echo $?
0

Comment 16 pkomarov 2018-02-14 16:47:07 UTC

Verified, 

# pcs status |head -n 3
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-10.el7-2b07d5c5a9) - partition with quorum

pcs config : 

Colocation Constraints:
  ip-172.17.0.11 with ovndb_servers-master (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)

Before ovndb_servers-master standby: 

 ip-172.17.0.11	(ocf::heartbeat:IPaddr2):	Started controller-0

 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ controller-0 ]
     Slaves: [ controller-1 controller-2 ]

After ovndb_servers-master standby: 
 ip-172.17.0.11	(ocf::heartbeat:IPaddr2):	Started controller-2

 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ controller-2 ]
     Slaves: [ controller-1 ]
     Stopped: [ controller-0 ]

Comment 19 errata-xmlrpc 2018-04-10 15:34:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0860

Note You need to log in before you can comment on or make changes to this bug.