1188571 – The --wait functionality implementation needs an overhaul

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1188571 - The --wait functionality implementation needs an overhaul

Summary: The --wait functionality implementation needs an overhaul

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1186527 1191272 (view as bug list)
Depends On:
Blocks:	1340467
TreeView+	depends on / blocked

Reported:	2015-02-03 09:37 UTC by Radek Steiger
Modified:	2016-05-27 13:23 UTC (History)
CC List:	3 users (show)
Fixed In Version:	pcs-0.9.140-1.el7
Doc Type:	Bug Fix
Doc Text:	Cause: User runs 'pcs resource' command with --wait flag instructing pcs to wait for the action to finish and report if the action succeeded. Consequence: Pcs reports false error or false success on some 'pcs resource' commands based on its parameters and status of a cluster. Fix: Implement a new waiting mechanism utilizing 'crm_resource --wait' pacemaker command. Result: Pcs waits for the actions to finish properly.
Clone Of:
Clones:	1340467 (view as bug list)
Environment:
Last Closed:	2015-11-19 09:34:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix (101.32 KB, patch) 2015-03-17 14:50 UTC, Tomas Jelinek	no flags	Details \| Diff
proposed fix 2 (2.60 KB, patch) 2015-03-18 11:46 UTC, Tomas Jelinek	no flags	Details \| Diff
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1156311	high	CLOSED	Need ability to start resource and wait until it finishes starting before returning (and show error information if it fa...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1186527	high	CLOSED	pcs reports a failure on stopping M/S resource when only slaves are running	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1187571	medium	CLOSED	ungrouping a resource from a cloned group produces invalid CIB when other resources exist in that group	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1191272	high	CLOSED	pcs resource enable --wait not working with Filesystem clones	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2015:2290	normal	SHIPPED_LIVE	Moderate: pcs security, bug fix, and enhancement update	2015-11-19 09:43:53 UTC

Internal Links: 1156311 1186527 1187571 1191272

Description Radek Steiger 2015-02-03 09:37:24 UTC

In bug 1156311 a new --wait flag has been integrated into various pcs commands. The implementation however is not ideal and fails on various complex situations. 

Multiple issues have been discovered while testing the current implementation, and due to the relatively high number of these issues an overhaul would be suitable. 

Following problems have been discovered so far as of pcs-0.9.137-13.el7 (as disclosed in bug 1156311 and some also reported as bug 1186527).


> Case sensitivity is not taken into account:

[root@virt-010 ~]# time pcs resource create dummy1 Dummy --wait meta target-role=stopped
Error: unable to start: 'dummy1', please check logs for failure information
waiting timed out
Resource 'dummy1' is not running on any node
real	0m20.690s
user	0m4.094s
sys	0m1.669s


> When the number of clones / master roles is limited to 0, waiting fails. This should be filtered the same way as creating a disabled resource: 

[root@virt-010 ~]# pcs resource create dummy1 Dummy --clone meta clone-max=0 --wait
Error: unable to start: 'dummy1', please check logs for failure information
waiting timed out
Resource 'dummy1' is not running on any node

[root@virt-010 ~]# pcs resource create stateful0 Stateful --master meta master-max=1 clone-max=0 --wait=10
Error: unable to start: 'stateful0', please check logs for failure information
waiting timed out
Resource 'stateful0' is not running on any node


> When only slaves are allowed, waiting fails as well although at the resource is actually running, as seen on the last line:

[root@virt-010 ~]# pcs resource create stateful0 Stateful --master meta master-max=0 clone-max=1 --wait=10
Error: unable to start: 'stateful0', please check logs for failure information
waiting timed out
Resource 'stateful0' is slave on node virt-012.


> When one of the nodes is stopped, --wait reports a fake failure when one or more nodes are not running (e.g. configured, but stopped):

[root@virt-010 ~]# pcs resource create dummy0 Dummy --clone --wait
Error: unable to start: 'dummy0', please check logs for failure information
waiting timed out
Resource 'dummy0' is running on nodes virt-010, virt-011.


> Moving from "banned-on-all-nodes" state to a specific node, which clears the ban on that node and therefore allows the resource to start again, --wait still complains and doesn't wait:

[root@virt-010 ~]# time pcs resource move delay0 virt-011 --wait
Warning: Cannot use '--wait' on non-running resources
real	0m0.494s
user	0m0.194s
sys	0m0.043s


> If the resource is allowed to run on a single node and banned on all other nodes, using --wait flag still causes it to wait to start elsewhere, which will never happen:

[root@virt-010 ~]# time pcs resource move dummy0 --wait
Error: Unable to start 'dummy0'
waiting timed out
Resource 'dummy0' is not running on any node
real	0m40.899s
user	0m8.723s
sys	0m3.331s

[root@virt-010 ~]# time pcs resource ban dummy0 virt-012 --wait
Error: Unable to start 'dummy0'
waiting timed out
Resource 'dummy0' is not running on any node
real	0m40.962s
user	0m8.694s
sys	0m3.342s


> Cloning a resource when one of the nodes is stopped fails:

[root@virt-010 ~]# pcs resource create delay0 Delay startdelay=3
[root@virt-010 ~]# time pcs resource clone delay0 --wait
Error: Unable to start clones of 'delay0'
waiting timed out
Resource 'delay0' is running on nodes virt-010, virt-011, virt-012.
real	0m30.747s
user	0m6.272s
sys	0m2.376s


> When the number of clones / master roles is limited to 0, waiting fails:

[root@virt-010 ~]# time pcs resource master stateful0 master-max=0 --wait
Error: unable to promote 'stateful0'
waiting timed out
Resource 'stateful0' is slave on nodes virt-010, virt-011, virt-012.
real	0m20.916s
user	0m4.376s
sys	0m1.673s

[root@virt-010 ~]# time pcs resource clone dummy0 meta clone-max=0 --wait
Error: Unable to start clones of 'dummy0'
waiting timed out
Resource 'dummy0' is not running on any node
real	0m20.861s
user	0m4.163s
sys	0m1.555s


> Referencing a clone itself should not wait for something that never happens and also the error given is a bit misleading:

[root@virt-010 ~]# time pcs resource meta group0-clone target-role=Stopped --wait
Error: Unable to start 'group0-clone'
waiting timed out
Resource 'group0-clone' is running on nodes virt-010, virt-011, virt-012, virt-016.
real	0m21.240s
user	0m4.818s
sys	0m1.713s


> Usual problem arises with clone and one node stopped:

[root@virt-010 ~]# time pcs resource meta group0 target-role=Stopped --wait
Resource 'group0' is not running on any node
real	0m10.792s
user	0m2.655s
sys	0m0.936s

[root@virt-010 ~]# time pcs resource meta group0 target-role=Started --wait
Error: Unable to start 'group0'
waiting timed out
Resource 'group0' is running on nodes virt-010, virt-011, virt-012.
real	0m21.165s
user	0m4.764s
sys	0m1.751s

Comment 2 Tomas Jelinek 2015-03-17 14:50:19 UTC

Created attachment 1002834 [details]
proposed fix

Tests:

> Case sensitivity is not taken into account:
[root@rh70-node1:~]# time pcs resource create dummy1 Dummy --wait meta target-role=stopped
Error: Cannot use '--wait' together with 'target-role=stopped'

real    0m0.104s
user    0m0.084s
sys     0m0.020s


> When the number of clones / master roles is limited to 0, waiting fails. This should be filtered the same way as creating a disabled resource:

[root@rh70-node1:~]# pcs resource create dummy1 Dummy --clone meta clone-max=0 --wait
Error: Cannot use '--wait' together with 'clone-max=0'

[root@rh70-node1:~]# pcs resource create stateful0 Stateful --master meta master-max=1 clone-max=0 --wait=10
Error: Cannot use '--wait' together with 'clone-max=0'


> When only slaves are allowed, waiting fails as well although at the resource is actually running, as seen on the last line:

[root@rh70-node1:~]# pcs resource create stateful0 Stateful --master meta master-max=0 clone-max=1 --wait=10
Warning: changing a monitor operation interval from 10 to 11 to make the operation unique
Resource 'stateful0' is slave on node rh70-node3.
[root@rh70-node1:~]# echo $?
0


> When one of the nodes is stopped, --wait reports a fake failure when one or more nodes are not running (e.g. configured, but stopped):

[root@rh70-node1:~]# pcs status nodes
Pacemaker Nodes:
 Online: rh70-node1 rh70-node2
 Standby:
 Offline: rh70-node3
[root@rh70-node1:~]# pcs resource create dummy0 Dummy --clone --wait
Resource 'dummy0' is running on nodes rh70-node1, rh70-node2.
[root@rh70-node1:~]# echo $?
0


> Moving from "banned-on-all-nodes" state to a specific node, which clears the ban on that node and therefore allows the resource to start again, --wait still complains and doesn't wait:

[root@rh70-node1:~]# time pcs resource move delay0 rh70-node1 --wait
Resource 'delay0' is running on node rh70-node1.

real    0m8.257s
user    0m0.119s
sys     0m0.042s
[root@rh70-node1:~]# echo $?
0


> If the resource is allowed to run on a single node and banned on all other nodes, using --wait flag still causes it to wait to start elsewhere, which will never happen:

[root@rh70-node1:~]# time pcs resource move dummy0 --wait
Error: Resource 'dummy0' is not running on any node

real    0m2.177s
user    0m0.113s
sys     0m0.043s

[root@rh70-node1:~]# time pcs resource ban dummy0 --wait
Error: Resource 'dummy0' is not running on any node

real    0m2.166s
user    0m0.104s
sys     0m0.041s


> Cloning a resource when one of the nodes is stopped fails:

[root@rh70-node1:~]# pcs status nodes
Pacemaker Nodes:
 Online: rh70-node1 rh70-node2 
 Standby: 
 Offline: rh70-node3 
[root@rh70-node1:~]# pcs resource create delay0 Delay startdelay=3
[root@rh70-node1:~]# time pcs resource clone delay0 --wait
Resource 'delay0-clone' is running on nodes rh70-node1, rh70-node2.

real    0m8.201s
user    0m0.136s
sys     0m0.038s
[root@rh70-node1:~]# echo $?
0


> When the number of clones / master roles is limited to 0, waiting fails:

[root@rh70-node1:~]# pcs resource create stateful0 stateful
[root@rh70-node1:~]# time pcs resource master stateful0 master-max=0 --wait
Resource 'stateful0-master' is slave on nodes rh70-node1, rh70-node2, rh70-node3.

real    0m2.164s
user    0m0.108s
sys     0m0.034s
[root@rh70-node1:~]# echo $?
0


> Referencing a clone itself should not wait for something that never happens and also the error given is a bit misleading:

[root@rh70-node1:~]# pcs resource create dummy0 dummy
[root@rh70-node1:~]# pcs resource group add group0 dummy0
[root@rh70-node1:~]# pcs resource clone group0
[root@rh70-node1:~]# time pcs resource meta group0-clone target-role=Stopped --wait
Resource 'group0-clone' is not running on any node

real    0m2.169s
user    0m0.109s
sys     0m0.033s
[root@rh70-node1:~]# echo $?
0


> Usual problem arises with clone and one node stopped:

[root@rh70-node1:~]# pcs status nodes
Pacemaker Nodes:
 Online: rh70-node1 rh70-node2 
 Standby: 
 Offline: rh70-node3 
[root@rh70-node1:~]# pcs resource create dummy0 dummy --group group0
[root@rh70-node1:~]# pcs resource clone group0
[root@rh70-node1:~]# time pcs resource meta group0 target-role=Stopped --wait
Resource 'group0' is not running on any node

real    0m2.172s
user    0m0.121s
sys     0m0.023s
[root@rh70-node1:~]# time pcs resource meta group0 target-role=Started --wait
Resource 'group0' is running on nodes rh70-node1, rh70-node2.

real    0m2.153s
user    0m0.100s
sys     0m0.036s
[root@rh70-node1:~]# echo $?
0

Comment 3 Tomas Jelinek 2015-03-18 11:46:50 UTC

Created attachment 1003170 [details]
proposed fix 2

Comment 6 Radek Steiger 2015-03-26 10:10:45 UTC

*** Bug 1186527 has been marked as a duplicate of this bug. ***

Comment 7 Radek Steiger 2015-03-26 10:11:33 UTC

*** Bug 1191272 has been marked as a duplicate of this bug. ***

Comment 9 Tomas Jelinek 2015-06-05 08:38:53 UTC

Tests:
https://bugzilla.redhat.com/show_bug.cgi?id=1188571#c2
https://bugzilla.redhat.com/show_bug.cgi?id=1186527#c1
https://bugzilla.redhat.com/show_bug.cgi?id=1191272#c2

Comment 15 errata-xmlrpc 2015-11-19 09:34:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2290.html

Note You need to log in before you can comment on or make changes to this bug.