Description of problem: Pacemaker doesn't allow to specifically define an operation timeout on the non-primitive resources in a bundle. Because of this, the default of 20 seconds is inherited and it's not enough to stop most resources (galera, rabbit, cinder, etc). When we shutdown a controller node without "standbying" or stopping proactively pacemaker, pacemaker waits for the Shutdown Escalation timeout to go down, and it takes more than 15 minutes for the node to go down. When fencing is enabled, the node is fenced before this, but our default OSP13 deployment doesn't have fencing enabled. This is a bad user experience for our customers and can be easily worked around by defining a resource op defaults timeout of 120 or 180 seconds without any significant impact. We could also decrease the shutdown escalation timeout. Workaround is to standby or stop the node from the cluster before the reboot, but a better workaround is to increase the default operation timeout. If someone forgets to standby or stop pacemaker, they won't have to wait 15 minutes to have it back. ~~~ [root@controller-0 ~]# pcs resource op defaults timeout=120s Warning: Defaults do not apply to resources which override them with their own defined values [root@controller-0 ~]# pcs resource op defaults timeout: 120s ~~~ Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-8.0.4-20 How reproducible: All the time Steps to Reproduce: 1. Deploy with control-scale=3 2. Shutdown or reboot one of the controller 3. Actual results: Takes at least 15 minutes to go down Expected results: Should go down faster than that. Additional info: [1] This article describes the situation. [2] The monitor for the docker resources loop infinitely [3] Until the Shutdown Escalation timeout is reached. [1] https://access.redhat.com/solutions/3612971 [2] ~~~ [root@controller-0 ~]# journalctl --since "1h ago" --all -o short-precise | grep -iP "Shutting down Pacemaker|Shutdown Escalation|incomplete at shutdown|resource was active at shutdown|Timed Out" Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms Sep 13 17:54:15.799631 controller-0.localdomain crmd[6298]: error: Result of monitor operation for galera-bundle-docker-0 on controller-0: Timed Out Sep 13 17:54:17.181860 controller-0.localdomain lrmd[6284]: warning: redis-bundle-docker-0_monitor_60000 process (PID 226364) timed out Sep 13 17:54:17.183731 controller-0.localdomain lrmd[6284]: warning: redis-bundle-docker-0_monitor_60000:226364 - timed out after 20000ms Sep 13 17:54:17.184597 controller-0.localdomain crmd[6298]: error: Result of monitor operation for redis-bundle-docker-0 on controller-0: Timed Out Sep 13 17:54:19.006839 controller-0.localdomain lrmd[6284]: warning: haproxy-bundle-docker-0_monitor_60000 process (PID 226584) timed out Sep 13 17:54:19.009043 controller-0.localdomain lrmd[6284]: warning: haproxy-bundle-docker-0_monitor_60000:226584 - timed out after 20000ms Sep 13 17:54:19.009577 controller-0.localdomain crmd[6298]: error: Result of monitor operation for haproxy-bundle-docker-0 on controller-0: Timed Out Sep 13 17:54:19.724069 controller-0.localdomain lrmd[6284]: warning: openstack-cinder-volume-docker-0_monitor_60000 process (PID 226656) timed out Sep 13 17:54:19.725534 controller-0.localdomain lrmd[6284]: warning: openstack-cinder-volume-docker-0_monitor_60000:226656 - timed out after 20000ms Sep 13 17:54:19.726156 controller-0.localdomain crmd[6298]: error: Result of monitor operation for openstack-cinder-volume-docker-0 on controller-0: Timed Out Sep 13 17:54:21.641865 controller-0.localdomain lrmd[6284]: warning: rabbitmq-bundle-docker-0_stop_0 process (PID 226749) timed out Sep 13 17:54:21.642235 controller-0.localdomain lrmd[6284]: warning: rabbitmq-bundle-docker-0_stop_0:226749 - timed out after 20000ms Sep 13 17:54:21.643134 controller-0.localdomain crmd[6298]: error: Result of stop operation for rabbitmq-bundle-docker-0 on controller-0: Timed Out [...] ~~~ [3] ~~~ Sep 13 18:13:00.834319 controller-0.localdomain crmd[6298]: error: Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms) Sep 13 18:13:00.838326 controller-0.localdomain crmd[6298]: notice: Recurring action redis-bundle-0:18 (redis-bundle-0_monitor_60000) incomplete at shutdown Sep 13 18:13:00.838342 controller-0.localdomain crmd[6298]: notice: Recurring action haproxy-bundle-docker-0:125 (haproxy-bundle-docker-0_monitor_60000) incomplete at shutdown Sep 13 18:13:00.838357 controller-0.localdomain crmd[6298]: notice: Recurring action openstack-cinder-volume-docker-0:127 (openstack-cinder-volume-docker-0_monitor_60000) incomplete at shutdown Sep 13 18:13:00.838370 controller-0.localdomain crmd[6298]: notice: Recurring action galera-bundle-0:17 (galera-bundle-0_monitor_60000) incomplete at shutdown Sep 13 18:13:00.838381 controller-0.localdomain crmd[6298]: notice: Recurring action redis-bundle-docker-0:123 (redis-bundle-docker-0_monitor_60000) incomplete at shutdown Sep 13 18:13:00.838395 controller-0.localdomain crmd[6298]: notice: Recurring action galera-bundle-docker-0:121 (galera-bundle-docker-0_monitor_60000) incomplete at shutdown Sep 13 18:13:00.987997 controller-0.localdomain crmd[6298]: notice: Recurring action redis:58 (redis_monitor_45000) incomplete at shutdown Sep 13 18:13:00.988008 controller-0.localdomain crmd[6298]: notice: Recurring action redis:52 (redis_monitor_60000) incomplete at shutdown Sep 13 18:13:00.988031 controller-0.localdomain crmd[6298]: error: 1 resource was active at shutdown Sep 13 18:13:00.988053 controller-0.localdomain crmd[6298]: error: 1 resource was active at shutdown Sep 13 18:13:00.988264 controller-0.localdomain crmd[6298]: error: 1 resource was active at shutdown Sep 13 18:13:00.988277 controller-0.localdomain crmd[6298]: error: 1 resource was active at shutdown ~~~
An RFE bug 1628701 was opened with pcs/pacemaker to allow the configuration of operation timeouts on bundled resource. This current BZ is opened to analyze whether or not we could implement a default operation timeout during deployment, or any other workaround.
Verified, (undercloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"rpm -qa|grep 'puppet-tripleo\|python-paunch'" compute-0 | SUCCESS | rc=0 >> puppet-tripleo-8.3.6-2.el7ost.noarch python-paunch-2.5.0-3.el7ost.noarch compute-1 | SUCCESS | rc=0 >> puppet-tripleo-8.3.6-2.el7ost.noarch python-paunch-2.5.0-3.el7ost.noarch controller-2 | SUCCESS | rc=0 >> puppet-tripleo-8.3.6-2.el7ost.noarch python-paunch-2.5.0-3.el7ost.noarch controller-1 | SUCCESS | rc=0 >> puppet-tripleo-8.3.6-2.el7ost.noarch python-paunch-2.5.0-3.el7ost.noarch controller-0 | SUCCESS | rc=0 >> puppet-tripleo-8.3.6-2.el7ost.noarch python-paunch-2.5.0-3.el7ost.noarch (undercloud) [stack@undercloud-0 ~]$ rhos-release -L Installed repositories (rhel-7.6): 13 ceph-3 ceph-osd-3 rhel-7.6 (undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 2018-11-05.3 #fix check : [stack@undercloud-0 ~]$ ansible controller -mshell -b -a'ls -l /etc/systemd/system/resource-agents-deps.target.wants' [WARNING]: Found both group and host with same name: undercloud controller-2 | SUCCESS | rc=0 >> total 0 lrwxrwxrwx. 1 root root 38 Nov 6 21:10 docker.service -> /usr/lib/systemd/system/docker.service lrwxrwxrwx. 1 root root 48 Nov 6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service controller-0 | SUCCESS | rc=0 >> total 0 lrwxrwxrwx. 1 root root 38 Nov 6 21:10 docker.service -> /usr/lib/systemd/system/docker.service lrwxrwxrwx. 1 root root 48 Nov 6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service controller-1 | SUCCESS | rc=0 >> total 0 lrwxrwxrwx. 1 root root 38 Nov 6 21:10 docker.service -> /usr/lib/systemd/system/docker.service lrwxrwxrwx. 1 root root 48 Nov 6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service #disable stonith as in reproducer [root@controller-1 ~]# pcs property set stonith-enabled=false [root@controller-1 ~]# pcs config|grep stonith-enabled stonith-enabled: false #check controller reboot times: [root@controller-1 ~]# reboot Connection to 192.168.24.9 closed by remote host. Connection to 192.168.24.9 closed. #reboot time is no more than 6 min: [stack@undercloud-0 ~]$ date;until `nc -zv 192.168.24.9 22`;do date;sleep 1m;done Wed Nov 7 03:18:52 EST 2018 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connection refused. Wed Nov 7 03:18:53 EST 2018 .... Wed Nov 7 03:24:03 EST 2018 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 192.168.24.9:22. Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3587