Bug 1229822

Summary: [RFE] make "cluster setup --start", "cluster start" and "cluster standby" support --wait as well
Product: Red Hat Enterprise Linux 7 Reporter: Jan Pokorný [poki] <jpokorny>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 7.2CC: abeekhof, cfeist, cluster-maint, cluster-qe, fdinitto, idevat, michele, phagara, royoung, rsteiger, tojeline
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.151-1.el7 Doc Type: Enhancement
Doc Text:
Feature: Add an option which makes pcs wait for nodes to start or go to/from standby mode. Reason: If one wants to be sure nodes have fully started or gone to/from standby mode, it is required to periodically check output of 'pcs status', which cannot be easily done e.g. in scripts which call pcs. Result: --wait option added to 'pcs cluster start', 'pcs cluster setup --start', 'pcs cluster node add --start', 'pcs node standby' and 'pcs node unstandby' commands.
Story Points: ---
Clone Of: 1195703
: 1229826 (view as bug list) Environment:
Last Closed: 2016-11-03 20:54:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1229826, 1251196    
Attachments:
Description Flags
proposed fix - standby
none
proposed fix - start none

Description Jan Pokorný [poki] 2015-06-09 17:42:51 UTC
In clufter's pcs-commands output (meant for little-to-no-edit execution)
an obstacle was found with "cluster create --start" followed immediately
with "cluster cib-push" because -- apparently -- cluster stack is not
up and running by this time yet.

Workaround is to put a fixed-time sleep in between:
https://github.com/jnpkrn/clufter/commit/a197f61d4ebe404902a8693473ec2c71bd0967c0

Better solution would be to make pcs do a proper job and *not* return
until cluster has started (or return with appropriate exit status
when this hadn't been successful as soon as a failure is detected).


Original bug that requested traceability of "Pacemaker being started"
state on the node, which might be helpful for internally for pcs to
implement this feature as well:

+++ This bug was initially created as a clone of Bug #1195703 +++

Comment 2 Jan Pokorný [poki] 2015-07-27 16:31:21 UTC
Also "pcs cluster standby <node>" might deserve --wait as per the wish
expressed at #clusterlabs Freenode's channel (times in CEST):

17:02 < larsks> After putting a node into standby mode with 'pcs cluster
                standby <node>', is there a way to confirm that
                resources are no longer running on that node?
                Specifically, a way that could be used in an automated
                script (as opposed to, say, "visual inspection of pcs
                status output")...
17:04 < lge> depends on your resources, right?
17:04 < lge> confirm as in how: ask pacemaker if it believes it is done?
17:04 < lge> or double check all resources?
17:05 < larsks> lge: confirm as in "have pacemaker confirm that there are
                no longer any active resources on the node that was put
                into standby mode"
17:05 < lge> "poll" crm_mon.
17:06 < larsks> Hmm.  crm_mon produces largely human-readable output, as
                opposed to something machine-parseable.  I guess I can
                run 'pcs status xml' or 'cibadmin -Q' through an XML
                parser...
17:06 < larsks> All I really want is a --wait flag to
                'pcs cluster standby' :)
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
17:06 < lge> once crmadmin -S returns S_IDLE, pacemaker thinks it has
             nothing more to do. if that works for you...
17:07 < larsks> lge: huh, interesting.  I'll try that out and see how
                that works.
17:08 < lge> need to point it against the DC, though.
17:11 < lge> and that does not tell you if all stop operations have been
             successful.
17:12 < lge> though, if stop fails, that is supposed to be escalated to
             fencing... and once it goes S_IDLE after fencing, you can
             be sure everything is stopped ;-)
17:12 < lge> that's the whole point of fencing, after all.

Comment 3 Jan Pokorný [poki] 2015-08-13 08:05:10 UTC
Another data point (IMHO) supporting "cluster start --wait":
http://clusterlabs.org/pipermail/users/2015-August/001029.html

Comment 4 Chris Feist 2015-09-23 22:12:37 UTC
We also probably want to add a cluster stop --wait as well. (from bz1238874).

Comment 5 Radek Steiger 2016-02-03 15:49:37 UTC
*** Bug 1238874 has been marked as a duplicate of this bug. ***

Comment 6 Tomas Jelinek 2016-03-04 15:30:17 UTC
Created attachment 1133201 [details]
proposed fix - standby

Comment 7 Tomas Jelinek 2016-03-04 15:30:40 UTC
Created attachment 1133202 [details]
proposed fix - start

Comment 8 Tomas Jelinek 2016-03-04 15:46:44 UTC
For standby:

[root@rh72-node1:~]# pcs resource create delay delay startdelay=0 stopdelay=10
[root@rh72-node1:~]# pcs resource
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# time pcs node standby rh72-node2; pcs resource
real    0m0.172s
user    0m0.109s
sys     0m0.028s
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# pcs node unstandby --all
[root@rh72-node1:~]# pcs resource
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# time pcs node standby rh72-node2 --wait; pcs resource
real    0m12.226s
user    0m0.135s
sys     0m0.039s
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node1

Similarly for unstandby.



For start:

[root@rh72-node1:~]# pcs cluster stop --all
rh72-node1: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (corosync)...
rh72-node1: Stopping Cluster (corosync)...
[root@rh72-node1:~]# time pcs cluster start --all; pcs status nodes
rh72-node2: Starting Cluster...
rh72-node1: Starting Cluster...
real    0m1.264s
user    0m0.404s
sys     0m0.068s
Pacemaker Nodes:
 Online: 
 Standby: 
 Offline: rh72-node1 rh72-node2 
Pacemaker Remote Nodes:
 Online: 
 Standby: 
 Offline: 
[root@rh72-node1:~]# pcs cluster stop --all
rh72-node2: Stopping Cluster (pacemaker)...
rh72-node1: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (corosync)...
rh72-node1: Stopping Cluster (corosync)...
[root@rh72-node1:~]# time pcs cluster start --all --wait; pcs status nodes
rh72-node1: Starting Cluster...
rh72-node2: Starting Cluster...
Waiting for node(s) to start...
rh72-node2: Started
rh72-node1: Started
real    0m24.463s
user    0m4.943s
sys     0m0.796s
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2 
 Standby: 
 Offline: 
Pacemaker Remote Nodes:
 Online: 
 Standby: 
 Offline:

Similarly for 'pcs cluster start --wait', 'pcs cluster start node --wait', 'pcs cluster setup --start --wait' and 'pcs cluster node add node --start --wait'.


In all cases it is possible to specify timeout like this: --wait=<timeout>.



For stop:
Pcs stops pacemaker service using systemd or service call, which does not return until pacemaker is fully stopped. So I believe there is nothing to be done. If it does not work, please file another bz with a reproducer.

Comment 9 Jan Pokorný [poki] 2016-03-07 18:18:10 UTC
Great to see this addressed :)

Upstream references for posterity:
https://github.com/feist/pcs/commit/9dc37994c94c6c8d03accd52c3c2f85df431f3ea
https://github.com/feist/pcs/commit/5ccd54fe7193683d4c161e1d2ce4ece66d5d881e

Can you advise on how to figure out that "cluster start" (which, I hope,
implies also the same handling with "cluster setup --start") indeed
supports --wait in an non-intrusive way?
Unfortunately "pcs cluster start --wait" is pretty dumb about being passed
an unsupported switch so I cannot look for possible error exit code.

Comment 10 Jan Pokorný [poki] 2016-03-07 18:28:56 UTC
If not such compatibility mechanism is easily achieved, I'd request that
specifying --wait=-1 (or any negative number) will fail immediately so
that I can do at least some reasoning about what to use in a backward
compatible way.

Comment 11 Tomas Jelinek 2016-03-08 08:12:13 UTC
I think the easiest thing to do is to check pcs version by running 'pcs --version'. Any version higher than 0.9.149 should support this.

Specifying wait=-1 (or any invalid timeout) fails immediately:
[root@rh72-node1:~]# pcs cluster start --all --wait=-1
Error: -1 is not a valid number of seconds to wait
[root@rh72-node1:~]# echo $?
1

Comment 12 Jan Pokorný [poki] 2016-03-09 19:39:45 UTC
Thanks, here's the reflection on clufter side:
https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58

Does a change in emitted pcs commands akin to
https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58#diff-file-1
look sane in your opinion (double backslashes are just a matter of notation)?

Comment 13 Tomas Jelinek 2016-03-10 13:57:24 UTC
Yes, that looks fine to me.

Comment 14 Mike McCune 2016-03-28 23:40:50 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 15 Ivan Devat 2016-05-31 12:27:08 UTC
Setup:
[vm-rhel72-1 ~] $ pcs resource create delay delay startdelay=0 stopdelay=10

Before fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.143-15.el7.x86_64

A) standby
A1) pcs node standby

There is no command "pcs node".

A2) pcs cluster standby

[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1; pcs resource

real    0,23s
user    0,12s
sys     0,07s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ pcs cluster unstandby --all
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1 --wait; pcs resource

real    0,18s
user    0,10s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3

B) start

[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,40s
user    0,39s
sys     0,07s
Pacemaker Nodes:
 Online:
 Standby:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,43s
user    0,36s
sys     0,09s
Pacemaker Nodes:
 Online:
 Standby:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.151-1.el7.x86_64

A) standby
A1) pcs node standby

[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3; pcs resource

real    0,24s
user    0,13s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs node unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3 --wait; pcs resource

real    12,27s
user    0,16s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1


A2) pcs cluster standby

[vm-rhel72-1 ~] $ pcs node unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource                   
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3; pcs resource                                                                                                                                                                                                       

real    0,24s
user    0,14s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs cluster unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3 --wait; pcs resource

real    12,28s
user    0,16s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1


B) start
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,70s
user    0,54s
sys     0,12s
Pacemaker Nodes:
 Online:
 Standby:
 Maintenance:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...
Waiting for node(s) to start...
vm-rhel72-3: Started
vm-rhel72-1: Started

real    26,08s
user    4,44s
sys     0,77s
Pacemaker Nodes:
 Online: vm-rhel72-1
 Standby: vm-rhel72-3
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 22 errata-xmlrpc 2016-11-03 20:54:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html