Bug 1229822
Summary: | [RFE] make "cluster setup --start", "cluster start" and "cluster standby" support --wait as well | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jan Pokorný [poki] <jpokorny> | ||||||
Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 7.2 | CC: | abeekhof, cfeist, cluster-maint, cluster-qe, fdinitto, idevat, michele, phagara, royoung, rsteiger, tojeline | ||||||
Target Milestone: | rc | Keywords: | FutureFeature | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | pcs-0.9.151-1.el7 | Doc Type: | Enhancement | ||||||
Doc Text: |
Feature:
Add an option which makes pcs wait for nodes to start or go to/from standby mode.
Reason:
If one wants to be sure nodes have fully started or gone to/from standby mode, it is required to periodically check output of 'pcs status', which cannot be easily done e.g. in scripts which call pcs.
Result:
--wait option added to 'pcs cluster start', 'pcs cluster setup --start', 'pcs cluster node add --start', 'pcs node standby' and 'pcs node unstandby' commands.
|
Story Points: | --- | ||||||
Clone Of: | 1195703 | ||||||||
: | 1229826 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-11-03 20:54:29 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1229826, 1251196 | ||||||||
Attachments: |
|
Description
Jan Pokorný [poki]
2015-06-09 17:42:51 UTC
Also "pcs cluster standby <node>" might deserve --wait as per the wish
expressed at #clusterlabs Freenode's channel (times in CEST):
17:02 < larsks> After putting a node into standby mode with 'pcs cluster
standby <node>', is there a way to confirm that
resources are no longer running on that node?
Specifically, a way that could be used in an automated
script (as opposed to, say, "visual inspection of pcs
status output")...
17:04 < lge> depends on your resources, right?
17:04 < lge> confirm as in how: ask pacemaker if it believes it is done?
17:04 < lge> or double check all resources?
17:05 < larsks> lge: confirm as in "have pacemaker confirm that there are
no longer any active resources on the node that was put
into standby mode"
17:05 < lge> "poll" crm_mon.
17:06 < larsks> Hmm. crm_mon produces largely human-readable output, as
opposed to something machine-parseable. I guess I can
run 'pcs status xml' or 'cibadmin -Q' through an XML
parser...
17:06 < larsks> All I really want is a --wait flag to
'pcs cluster standby' :)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
17:06 < lge> once crmadmin -S returns S_IDLE, pacemaker thinks it has
nothing more to do. if that works for you...
17:07 < larsks> lge: huh, interesting. I'll try that out and see how
that works.
17:08 < lge> need to point it against the DC, though.
17:11 < lge> and that does not tell you if all stop operations have been
successful.
17:12 < lge> though, if stop fails, that is supposed to be escalated to
fencing... and once it goes S_IDLE after fencing, you can
be sure everything is stopped ;-)
17:12 < lge> that's the whole point of fencing, after all.
Another data point (IMHO) supporting "cluster start --wait": http://clusterlabs.org/pipermail/users/2015-August/001029.html We also probably want to add a cluster stop --wait as well. (from bz1238874). *** Bug 1238874 has been marked as a duplicate of this bug. *** Created attachment 1133201 [details]
proposed fix - standby
Created attachment 1133202 [details]
proposed fix - start
For standby: [root@rh72-node1:~]# pcs resource create delay delay startdelay=0 stopdelay=10 [root@rh72-node1:~]# pcs resource dummy (ocf::heartbeat:Dummy): Started rh72-node1 delay (ocf::heartbeat:Delay): Started rh72-node2 [root@rh72-node1:~]# time pcs node standby rh72-node2; pcs resource real 0m0.172s user 0m0.109s sys 0m0.028s dummy (ocf::heartbeat:Dummy): Started rh72-node1 delay (ocf::heartbeat:Delay): Started rh72-node2 [root@rh72-node1:~]# pcs node unstandby --all [root@rh72-node1:~]# pcs resource dummy (ocf::heartbeat:Dummy): Started rh72-node1 delay (ocf::heartbeat:Delay): Started rh72-node2 [root@rh72-node1:~]# time pcs node standby rh72-node2 --wait; pcs resource real 0m12.226s user 0m0.135s sys 0m0.039s dummy (ocf::heartbeat:Dummy): Started rh72-node1 delay (ocf::heartbeat:Delay): Started rh72-node1 Similarly for unstandby. For start: [root@rh72-node1:~]# pcs cluster stop --all rh72-node1: Stopping Cluster (pacemaker)... rh72-node2: Stopping Cluster (pacemaker)... rh72-node2: Stopping Cluster (corosync)... rh72-node1: Stopping Cluster (corosync)... [root@rh72-node1:~]# time pcs cluster start --all; pcs status nodes rh72-node2: Starting Cluster... rh72-node1: Starting Cluster... real 0m1.264s user 0m0.404s sys 0m0.068s Pacemaker Nodes: Online: Standby: Offline: rh72-node1 rh72-node2 Pacemaker Remote Nodes: Online: Standby: Offline: [root@rh72-node1:~]# pcs cluster stop --all rh72-node2: Stopping Cluster (pacemaker)... rh72-node1: Stopping Cluster (pacemaker)... rh72-node2: Stopping Cluster (corosync)... rh72-node1: Stopping Cluster (corosync)... [root@rh72-node1:~]# time pcs cluster start --all --wait; pcs status nodes rh72-node1: Starting Cluster... rh72-node2: Starting Cluster... Waiting for node(s) to start... rh72-node2: Started rh72-node1: Started real 0m24.463s user 0m4.943s sys 0m0.796s Pacemaker Nodes: Online: rh72-node1 rh72-node2 Standby: Offline: Pacemaker Remote Nodes: Online: Standby: Offline: Similarly for 'pcs cluster start --wait', 'pcs cluster start node --wait', 'pcs cluster setup --start --wait' and 'pcs cluster node add node --start --wait'. In all cases it is possible to specify timeout like this: --wait=<timeout>. For stop: Pcs stops pacemaker service using systemd or service call, which does not return until pacemaker is fully stopped. So I believe there is nothing to be done. If it does not work, please file another bz with a reproducer. Great to see this addressed :) Upstream references for posterity: https://github.com/feist/pcs/commit/9dc37994c94c6c8d03accd52c3c2f85df431f3ea https://github.com/feist/pcs/commit/5ccd54fe7193683d4c161e1d2ce4ece66d5d881e Can you advise on how to figure out that "cluster start" (which, I hope, implies also the same handling with "cluster setup --start") indeed supports --wait in an non-intrusive way? Unfortunately "pcs cluster start --wait" is pretty dumb about being passed an unsupported switch so I cannot look for possible error exit code. If not such compatibility mechanism is easily achieved, I'd request that specifying --wait=-1 (or any negative number) will fail immediately so that I can do at least some reasoning about what to use in a backward compatible way. I think the easiest thing to do is to check pcs version by running 'pcs --version'. Any version higher than 0.9.149 should support this. Specifying wait=-1 (or any invalid timeout) fails immediately: [root@rh72-node1:~]# pcs cluster start --all --wait=-1 Error: -1 is not a valid number of seconds to wait [root@rh72-node1:~]# echo $? 1 Thanks, here's the reflection on clufter side: https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58 Does a change in emitted pcs commands akin to https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58#diff-file-1 look sane in your opinion (double backslashes are just a matter of notation)? Yes, that looks fine to me. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Setup: [vm-rhel72-1 ~] $ pcs resource create delay delay startdelay=0 stopdelay=10 Before fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.143-15.el7.x86_64 A) standby A1) pcs node standby There is no command "pcs node". A2) pcs cluster standby [vm-rhel72-1 ~] $ pcs resource delay (ocf::heartbeat:Delay): Started vm-rhel72-1 [vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1; pcs resource real 0,23s user 0,12s sys 0,07s delay (ocf::heartbeat:Delay): Started vm-rhel72-1 [vm-rhel72-1 ~] $ pcs cluster unstandby --all [vm-rhel72-1 ~] $ pcs resource delay (ocf::heartbeat:Delay): Started vm-rhel72-1 [vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1 --wait; pcs resource real 0,18s user 0,10s sys 0,05s delay (ocf::heartbeat:Delay): Started vm-rhel72-3 B) start [vm-rhel72-1 ~] $ pcs cluster stop --all vm-rhel72-1: Stopping Cluster (pacemaker)... vm-rhel72-3: Stopping Cluster (pacemaker)... vm-rhel72-1: Stopping Cluster (corosync)... vm-rhel72-3: Stopping Cluster (corosync)... [vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes vm-rhel72-1: Starting Cluster... vm-rhel72-3: Starting Cluster... real 1,40s user 0,39s sys 0,07s Pacemaker Nodes: Online: Standby: Offline: vm-rhel72-1 vm-rhel72-3 Pacemaker Remote Nodes: Online: Standby: Offline: [vm-rhel72-1 ~] $ pcs cluster stop --all vm-rhel72-1: Stopping Cluster (pacemaker)... vm-rhel72-3: Stopping Cluster (pacemaker)... vm-rhel72-1: Stopping Cluster (corosync)... vm-rhel72-3: Stopping Cluster (corosync)... [vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes vm-rhel72-1: Starting Cluster... vm-rhel72-3: Starting Cluster... real 1,43s user 0,36s sys 0,09s Pacemaker Nodes: Online: Standby: Offline: vm-rhel72-1 vm-rhel72-3 Pacemaker Remote Nodes: Online: Standby: Offline: After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.151-1.el7.x86_64 A) standby A1) pcs node standby [vm-rhel72-1 ~] $ pcs resource delay (ocf::heartbeat:Delay): Started vm-rhel72-3 [vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3; pcs resource real 0,24s user 0,13s sys 0,05s delay (ocf::heartbeat:Delay): Started vm-rhel72-3 [vm-rhel72-1 ~] $ pcs node unstandby --all --wait [vm-rhel72-1 ~] $ pcs resource delay (ocf::heartbeat:Delay): Started vm-rhel72-3 [vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3 --wait; pcs resource real 12,27s user 0,16s sys 0,05s delay (ocf::heartbeat:Delay): Started vm-rhel72-1 A2) pcs cluster standby [vm-rhel72-1 ~] $ pcs node unstandby --all --wait [vm-rhel72-1 ~] $ pcs resource delay (ocf::heartbeat:Delay): Started vm-rhel72-3 [vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3; pcs resource real 0,24s user 0,14s sys 0,05s delay (ocf::heartbeat:Delay): Started vm-rhel72-3 [vm-rhel72-1 ~] $ pcs cluster unstandby --all --wait [vm-rhel72-1 ~] $ pcs resource delay (ocf::heartbeat:Delay): Started vm-rhel72-3 [vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3 --wait; pcs resource real 12,28s user 0,16s sys 0,05s delay (ocf::heartbeat:Delay): Started vm-rhel72-1 B) start [vm-rhel72-1 ~] $ pcs cluster stop --all vm-rhel72-3: Stopping Cluster (pacemaker)... vm-rhel72-1: Stopping Cluster (pacemaker)... vm-rhel72-1: Stopping Cluster (corosync)... vm-rhel72-3: Stopping Cluster (corosync)... [vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes vm-rhel72-1: Starting Cluster... vm-rhel72-3: Starting Cluster... real 1,70s user 0,54s sys 0,12s Pacemaker Nodes: Online: Standby: Maintenance: Offline: vm-rhel72-1 vm-rhel72-3 Pacemaker Remote Nodes: Online: Standby: Maintenance: Offline: [vm-rhel72-1 ~] $ pcs cluster stop --all vm-rhel72-1: Stopping Cluster (pacemaker)... vm-rhel72-3: Stopping Cluster (pacemaker)... vm-rhel72-1: Stopping Cluster (corosync)... vm-rhel72-3: Stopping Cluster (corosync)... [vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes vm-rhel72-1: Starting Cluster... vm-rhel72-3: Starting Cluster... Waiting for node(s) to start... vm-rhel72-3: Started vm-rhel72-1: Started real 26,08s user 4,44s sys 0,77s Pacemaker Nodes: Online: vm-rhel72-1 Standby: vm-rhel72-3 Maintenance: Offline: Pacemaker Remote Nodes: Online: Standby: Maintenance: Offline: Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2596.html |