Red Hat Bugzilla – Bug 1188571
The --wait functionality implementation needs an overhaul
Last modified: 2016-05-27 09:23:26 EDT
In bug 1156311 a new --wait flag has been integrated into various pcs commands. The implementation however is not ideal and fails on various complex situations. Multiple issues have been discovered while testing the current implementation, and due to the relatively high number of these issues an overhaul would be suitable. Following problems have been discovered so far as of pcs-0.9.137-13.el7 (as disclosed in bug 1156311 and some also reported as bug 1186527). > Case sensitivity is not taken into account: [root@virt-010 ~]# time pcs resource create dummy1 Dummy --wait meta target-role=stopped Error: unable to start: 'dummy1', please check logs for failure information waiting timed out Resource 'dummy1' is not running on any node real 0m20.690s user 0m4.094s sys 0m1.669s > When the number of clones / master roles is limited to 0, waiting fails. This should be filtered the same way as creating a disabled resource: [root@virt-010 ~]# pcs resource create dummy1 Dummy --clone meta clone-max=0 --wait Error: unable to start: 'dummy1', please check logs for failure information waiting timed out Resource 'dummy1' is not running on any node [root@virt-010 ~]# pcs resource create stateful0 Stateful --master meta master-max=1 clone-max=0 --wait=10 Error: unable to start: 'stateful0', please check logs for failure information waiting timed out Resource 'stateful0' is not running on any node > When only slaves are allowed, waiting fails as well although at the resource is actually running, as seen on the last line: [root@virt-010 ~]# pcs resource create stateful0 Stateful --master meta master-max=0 clone-max=1 --wait=10 Error: unable to start: 'stateful0', please check logs for failure information waiting timed out Resource 'stateful0' is slave on node virt-012. > When one of the nodes is stopped, --wait reports a fake failure when one or more nodes are not running (e.g. configured, but stopped): [root@virt-010 ~]# pcs resource create dummy0 Dummy --clone --wait Error: unable to start: 'dummy0', please check logs for failure information waiting timed out Resource 'dummy0' is running on nodes virt-010, virt-011. > Moving from "banned-on-all-nodes" state to a specific node, which clears the ban on that node and therefore allows the resource to start again, --wait still complains and doesn't wait: [root@virt-010 ~]# time pcs resource move delay0 virt-011 --wait Warning: Cannot use '--wait' on non-running resources real 0m0.494s user 0m0.194s sys 0m0.043s > If the resource is allowed to run on a single node and banned on all other nodes, using --wait flag still causes it to wait to start elsewhere, which will never happen: [root@virt-010 ~]# time pcs resource move dummy0 --wait Error: Unable to start 'dummy0' waiting timed out Resource 'dummy0' is not running on any node real 0m40.899s user 0m8.723s sys 0m3.331s [root@virt-010 ~]# time pcs resource ban dummy0 virt-012 --wait Error: Unable to start 'dummy0' waiting timed out Resource 'dummy0' is not running on any node real 0m40.962s user 0m8.694s sys 0m3.342s > Cloning a resource when one of the nodes is stopped fails: [root@virt-010 ~]# pcs resource create delay0 Delay startdelay=3 [root@virt-010 ~]# time pcs resource clone delay0 --wait Error: Unable to start clones of 'delay0' waiting timed out Resource 'delay0' is running on nodes virt-010, virt-011, virt-012. real 0m30.747s user 0m6.272s sys 0m2.376s > When the number of clones / master roles is limited to 0, waiting fails: [root@virt-010 ~]# time pcs resource master stateful0 master-max=0 --wait Error: unable to promote 'stateful0' waiting timed out Resource 'stateful0' is slave on nodes virt-010, virt-011, virt-012. real 0m20.916s user 0m4.376s sys 0m1.673s [root@virt-010 ~]# time pcs resource clone dummy0 meta clone-max=0 --wait Error: Unable to start clones of 'dummy0' waiting timed out Resource 'dummy0' is not running on any node real 0m20.861s user 0m4.163s sys 0m1.555s > Referencing a clone itself should not wait for something that never happens and also the error given is a bit misleading: [root@virt-010 ~]# time pcs resource meta group0-clone target-role=Stopped --wait Error: Unable to start 'group0-clone' waiting timed out Resource 'group0-clone' is running on nodes virt-010, virt-011, virt-012, virt-016. real 0m21.240s user 0m4.818s sys 0m1.713s > Usual problem arises with clone and one node stopped: [root@virt-010 ~]# time pcs resource meta group0 target-role=Stopped --wait Resource 'group0' is not running on any node real 0m10.792s user 0m2.655s sys 0m0.936s [root@virt-010 ~]# time pcs resource meta group0 target-role=Started --wait Error: Unable to start 'group0' waiting timed out Resource 'group0' is running on nodes virt-010, virt-011, virt-012. real 0m21.165s user 0m4.764s sys 0m1.751s
Created attachment 1002834 [details] proposed fix Tests: > Case sensitivity is not taken into account: [root@rh70-node1:~]# time pcs resource create dummy1 Dummy --wait meta target-role=stopped Error: Cannot use '--wait' together with 'target-role=stopped' real 0m0.104s user 0m0.084s sys 0m0.020s > When the number of clones / master roles is limited to 0, waiting fails. This should be filtered the same way as creating a disabled resource: [root@rh70-node1:~]# pcs resource create dummy1 Dummy --clone meta clone-max=0 --wait Error: Cannot use '--wait' together with 'clone-max=0' [root@rh70-node1:~]# pcs resource create stateful0 Stateful --master meta master-max=1 clone-max=0 --wait=10 Error: Cannot use '--wait' together with 'clone-max=0' > When only slaves are allowed, waiting fails as well although at the resource is actually running, as seen on the last line: [root@rh70-node1:~]# pcs resource create stateful0 Stateful --master meta master-max=0 clone-max=1 --wait=10 Warning: changing a monitor operation interval from 10 to 11 to make the operation unique Resource 'stateful0' is slave on node rh70-node3. [root@rh70-node1:~]# echo $? 0 > When one of the nodes is stopped, --wait reports a fake failure when one or more nodes are not running (e.g. configured, but stopped): [root@rh70-node1:~]# pcs status nodes Pacemaker Nodes: Online: rh70-node1 rh70-node2 Standby: Offline: rh70-node3 [root@rh70-node1:~]# pcs resource create dummy0 Dummy --clone --wait Resource 'dummy0' is running on nodes rh70-node1, rh70-node2. [root@rh70-node1:~]# echo $? 0 > Moving from "banned-on-all-nodes" state to a specific node, which clears the ban on that node and therefore allows the resource to start again, --wait still complains and doesn't wait: [root@rh70-node1:~]# time pcs resource move delay0 rh70-node1 --wait Resource 'delay0' is running on node rh70-node1. real 0m8.257s user 0m0.119s sys 0m0.042s [root@rh70-node1:~]# echo $? 0 > If the resource is allowed to run on a single node and banned on all other nodes, using --wait flag still causes it to wait to start elsewhere, which will never happen: [root@rh70-node1:~]# time pcs resource move dummy0 --wait Error: Resource 'dummy0' is not running on any node real 0m2.177s user 0m0.113s sys 0m0.043s [root@rh70-node1:~]# time pcs resource ban dummy0 --wait Error: Resource 'dummy0' is not running on any node real 0m2.166s user 0m0.104s sys 0m0.041s > Cloning a resource when one of the nodes is stopped fails: [root@rh70-node1:~]# pcs status nodes Pacemaker Nodes: Online: rh70-node1 rh70-node2 Standby: Offline: rh70-node3 [root@rh70-node1:~]# pcs resource create delay0 Delay startdelay=3 [root@rh70-node1:~]# time pcs resource clone delay0 --wait Resource 'delay0-clone' is running on nodes rh70-node1, rh70-node2. real 0m8.201s user 0m0.136s sys 0m0.038s [root@rh70-node1:~]# echo $? 0 > When the number of clones / master roles is limited to 0, waiting fails: [root@rh70-node1:~]# pcs resource create stateful0 stateful [root@rh70-node1:~]# time pcs resource master stateful0 master-max=0 --wait Resource 'stateful0-master' is slave on nodes rh70-node1, rh70-node2, rh70-node3. real 0m2.164s user 0m0.108s sys 0m0.034s [root@rh70-node1:~]# echo $? 0 > Referencing a clone itself should not wait for something that never happens and also the error given is a bit misleading: [root@rh70-node1:~]# pcs resource create dummy0 dummy [root@rh70-node1:~]# pcs resource group add group0 dummy0 [root@rh70-node1:~]# pcs resource clone group0 [root@rh70-node1:~]# time pcs resource meta group0-clone target-role=Stopped --wait Resource 'group0-clone' is not running on any node real 0m2.169s user 0m0.109s sys 0m0.033s [root@rh70-node1:~]# echo $? 0 > Usual problem arises with clone and one node stopped: [root@rh70-node1:~]# pcs status nodes Pacemaker Nodes: Online: rh70-node1 rh70-node2 Standby: Offline: rh70-node3 [root@rh70-node1:~]# pcs resource create dummy0 dummy --group group0 [root@rh70-node1:~]# pcs resource clone group0 [root@rh70-node1:~]# time pcs resource meta group0 target-role=Stopped --wait Resource 'group0' is not running on any node real 0m2.172s user 0m0.121s sys 0m0.023s [root@rh70-node1:~]# time pcs resource meta group0 target-role=Started --wait Resource 'group0' is running on nodes rh70-node1, rh70-node2. real 0m2.153s user 0m0.100s sys 0m0.036s [root@rh70-node1:~]# echo $? 0
Created attachment 1003170 [details] proposed fix 2
*** Bug 1186527 has been marked as a duplicate of this bug. ***
*** Bug 1191272 has been marked as a duplicate of this bug. ***
Tests: https://bugzilla.redhat.com/show_bug.cgi?id=1188571#c2 https://bugzilla.redhat.com/show_bug.cgi?id=1186527#c1 https://bugzilla.redhat.com/show_bug.cgi?id=1191272#c2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2290.html