1229822 – [RFE] make "cluster setup --start", "cluster start" and "cluster standby" support --wait as well

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1229822 - [RFE] make "cluster setup --start", "cluster start" and "cluster standby" support --wait as well

Summary: [RFE] make "cluster setup --start", "cluster start" and "cluster standby" sup...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1238874 (view as bug list)
Depends On:
Blocks:	1229826 1251196
TreeView+	depends on / blocked

Reported:	2015-06-09 17:42 UTC by Jan Pokorný [poki]
Modified:	2019-01-02 12:44 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pcs-0.9.151-1.el7
Doc Type:	Enhancement
Doc Text:	Feature: Add an option which makes pcs wait for nodes to start or go to/from standby mode. Reason: If one wants to be sure nodes have fully started or gone to/from standby mode, it is required to periodically check output of 'pcs status', which cannot be easily done e.g. in scripts which call pcs. Result: --wait option added to 'pcs cluster start', 'pcs cluster setup --start', 'pcs cluster node add --start', 'pcs node standby' and 'pcs node unstandby' commands.
Clone Of:	1195703
Clones:	1229826 (view as bug list)
Environment:
Last Closed:	2016-11-03 20:54:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix - standby (35.21 KB, patch) 2016-03-04 15:30 UTC, Tomas Jelinek	no flags	Details \| Diff
proposed fix - start (26.05 KB, patch) 2016-03-04 15:30 UTC, Tomas Jelinek	no flags	Details \| Diff
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1284404	medium	CLOSED	make restarting pcsd a synchronous operation	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1463327	high	CLOSED	Starting a larger cluster times out	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2016:2596	normal	SHIPPED_LIVE	Moderate: pcs security, bug fix, and enhancement update	2016-11-03 12:11:34 UTC

Internal Links: 1284404 1463327

Description Jan Pokorný [poki] 2015-06-09 17:42:51 UTC

In clufter's pcs-commands output (meant for little-to-no-edit execution)
an obstacle was found with "cluster create --start" followed immediately
with "cluster cib-push" because -- apparently -- cluster stack is not
up and running by this time yet.

Workaround is to put a fixed-time sleep in between:
https://github.com/jnpkrn/clufter/commit/a197f61d4ebe404902a8693473ec2c71bd0967c0

Better solution would be to make pcs do a proper job and *not* return
until cluster has started (or return with appropriate exit status
when this hadn't been successful as soon as a failure is detected).


Original bug that requested traceability of "Pacemaker being started"
state on the node, which might be helpful for internally for pcs to
implement this feature as well:

+++ This bug was initially created as a clone of Bug #1195703 +++

Comment 2 Jan Pokorný [poki] 2015-07-27 16:31:21 UTC

Also "pcs cluster standby <node>" might deserve --wait as per the wish
expressed at #clusterlabs Freenode's channel (times in CEST):

17:02 < larsks> After putting a node into standby mode with 'pcs cluster
                standby <node>', is there a way to confirm that
                resources are no longer running on that node?
                Specifically, a way that could be used in an automated
                script (as opposed to, say, "visual inspection of pcs
                status output")...
17:04 < lge> depends on your resources, right?
17:04 < lge> confirm as in how: ask pacemaker if it believes it is done?
17:04 < lge> or double check all resources?
17:05 < larsks> lge: confirm as in "have pacemaker confirm that there are
                no longer any active resources on the node that was put
                into standby mode"
17:05 < lge> "poll" crm_mon.
17:06 < larsks> Hmm.  crm_mon produces largely human-readable output, as
                opposed to something machine-parseable.  I guess I can
                run 'pcs status xml' or 'cibadmin -Q' through an XML
                parser...
17:06 < larsks> All I really want is a --wait flag to
                'pcs cluster standby' :)
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
17:06 < lge> once crmadmin -S returns S_IDLE, pacemaker thinks it has
             nothing more to do. if that works for you...
17:07 < larsks> lge: huh, interesting.  I'll try that out and see how
                that works.
17:08 < lge> need to point it against the DC, though.
17:11 < lge> and that does not tell you if all stop operations have been
             successful.
17:12 < lge> though, if stop fails, that is supposed to be escalated to
             fencing... and once it goes S_IDLE after fencing, you can
             be sure everything is stopped ;-)
17:12 < lge> that's the whole point of fencing, after all.

Comment 3 Jan Pokorný [poki] 2015-08-13 08:05:10 UTC

Another data point (IMHO) supporting "cluster start --wait":
http://clusterlabs.org/pipermail/users/2015-August/001029.html

Comment 4 Chris Feist 2015-09-23 22:12:37 UTC

We also probably want to add a cluster stop --wait as well. (from bz1238874).

Comment 5 Radek Steiger 2016-02-03 15:49:37 UTC

*** Bug 1238874 has been marked as a duplicate of this bug. ***

Comment 6 Tomas Jelinek 2016-03-04 15:30:17 UTC

Created attachment 1133201 [details]
proposed fix - standby

Comment 7 Tomas Jelinek 2016-03-04 15:30:40 UTC

Created attachment 1133202 [details]
proposed fix - start

Comment 8 Tomas Jelinek 2016-03-04 15:46:44 UTC

For standby:

[root@rh72-node1:~]# pcs resource create delay delay startdelay=0 stopdelay=10
[root@rh72-node1:~]# pcs resource
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# time pcs node standby rh72-node2; pcs resource
real    0m0.172s
user    0m0.109s
sys     0m0.028s
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# pcs node unstandby --all
[root@rh72-node1:~]# pcs resource
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# time pcs node standby rh72-node2 --wait; pcs resource
real    0m12.226s
user    0m0.135s
sys     0m0.039s
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node1

Similarly for unstandby.



For start:

[root@rh72-node1:~]# pcs cluster stop --all
rh72-node1: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (corosync)...
rh72-node1: Stopping Cluster (corosync)...
[root@rh72-node1:~]# time pcs cluster start --all; pcs status nodes
rh72-node2: Starting Cluster...
rh72-node1: Starting Cluster...
real    0m1.264s
user    0m0.404s
sys     0m0.068s
Pacemaker Nodes:
 Online: 
 Standby: 
 Offline: rh72-node1 rh72-node2 
Pacemaker Remote Nodes:
 Online: 
 Standby: 
 Offline: 
[root@rh72-node1:~]# pcs cluster stop --all
rh72-node2: Stopping Cluster (pacemaker)...
rh72-node1: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (corosync)...
rh72-node1: Stopping Cluster (corosync)...
[root@rh72-node1:~]# time pcs cluster start --all --wait; pcs status nodes
rh72-node1: Starting Cluster...
rh72-node2: Starting Cluster...
Waiting for node(s) to start...
rh72-node2: Started
rh72-node1: Started
real    0m24.463s
user    0m4.943s
sys     0m0.796s
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2 
 Standby: 
 Offline: 
Pacemaker Remote Nodes:
 Online: 
 Standby: 
 Offline:

Similarly for 'pcs cluster start --wait', 'pcs cluster start node --wait', 'pcs cluster setup --start --wait' and 'pcs cluster node add node --start --wait'.


In all cases it is possible to specify timeout like this: --wait=<timeout>.



For stop:
Pcs stops pacemaker service using systemd or service call, which does not return until pacemaker is fully stopped. So I believe there is nothing to be done. If it does not work, please file another bz with a reproducer.

Comment 9 Jan Pokorný [poki] 2016-03-07 18:18:10 UTC

Great to see this addressed :)

Upstream references for posterity:
https://github.com/feist/pcs/commit/9dc37994c94c6c8d03accd52c3c2f85df431f3ea
https://github.com/feist/pcs/commit/5ccd54fe7193683d4c161e1d2ce4ece66d5d881e

Can you advise on how to figure out that "cluster start" (which, I hope,
implies also the same handling with "cluster setup --start") indeed
supports --wait in an non-intrusive way?
Unfortunately "pcs cluster start --wait" is pretty dumb about being passed
an unsupported switch so I cannot look for possible error exit code.

Comment 10 Jan Pokorný [poki] 2016-03-07 18:28:56 UTC

If not such compatibility mechanism is easily achieved, I'd request that
specifying --wait=-1 (or any negative number) will fail immediately so
that I can do at least some reasoning about what to use in a backward
compatible way.

Comment 11 Tomas Jelinek 2016-03-08 08:12:13 UTC

I think the easiest thing to do is to check pcs version by running 'pcs --version'. Any version higher than 0.9.149 should support this.

Specifying wait=-1 (or any invalid timeout) fails immediately:
[root@rh72-node1:~]# pcs cluster start --all --wait=-1
Error: -1 is not a valid number of seconds to wait
[root@rh72-node1:~]# echo $?
1

Comment 12 Jan Pokorný [poki] 2016-03-09 19:39:45 UTC

Thanks, here's the reflection on clufter side:
https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58

Does a change in emitted pcs commands akin to
https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58#diff-file-1
look sane in your opinion (double backslashes are just a matter of notation)?

Comment 13 Tomas Jelinek 2016-03-10 13:57:24 UTC

Yes, that looks fine to me.

Comment 14 Mike McCune 2016-03-28 23:40:50 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 15 Ivan Devat 2016-05-31 12:27:08 UTC

Setup:
[vm-rhel72-1 ~] $ pcs resource create delay delay startdelay=0 stopdelay=10

Before fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.143-15.el7.x86_64

A) standby
A1) pcs node standby

There is no command "pcs node".

A2) pcs cluster standby

[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1; pcs resource

real    0,23s
user    0,12s
sys     0,07s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ pcs cluster unstandby --all
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1 --wait; pcs resource

real    0,18s
user    0,10s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3

B) start

[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,40s
user    0,39s
sys     0,07s
Pacemaker Nodes:
 Online:
 Standby:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,43s
user    0,36s
sys     0,09s
Pacemaker Nodes:
 Online:
 Standby:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.151-1.el7.x86_64

A) standby
A1) pcs node standby

[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3; pcs resource

real    0,24s
user    0,13s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs node unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3 --wait; pcs resource

real    12,27s
user    0,16s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1


A2) pcs cluster standby

[vm-rhel72-1 ~] $ pcs node unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource                   
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3; pcs resource                                                                                                                                                                                                       

real    0,24s
user    0,14s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs cluster unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3 --wait; pcs resource

real    12,28s
user    0,16s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1


B) start
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,70s
user    0,54s
sys     0,12s
Pacemaker Nodes:
 Online:
 Standby:
 Maintenance:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...
Waiting for node(s) to start...
vm-rhel72-3: Started
vm-rhel72-1: Started

real    26,08s
user    4,44s
sys     0,77s
Pacemaker Nodes:
 Online: vm-rhel72-1
 Standby: vm-rhel72-3
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 22 errata-xmlrpc 2016-11-03 20:54:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html

Note You need to log in before you can comment on or make changes to this bug.