Bug 1628705

Summary: [OSP13] Rebooting a clustered control node without previously stopping pacemaker takes more than 15 minutes
Product: Red Hat OpenStack Reporter: David Vallee Delisle <dvd>
Component: puppet-tripleoAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: chjones, dpeacock, dvd, emacchi, jjoyce, jschluet, lmarsh, mburns, michele, nwahl, sbradley, shdunne, slinaber, tvignaud
Target Milestone: z3Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-8.3.4-14.el7ost python-paunch-2.5.0-3.el7ost Doc Type: Bug Fix
Doc Text:
A faulty interaction between rhel-plugin-push.service and the Docker service occurred during system shutdown, which caused the controller reboot to take a long time. WIth this release, the correct shutdown ordering is enforced for these two services. Rebooting a controller takes less time now.
Story Points: ---
Clone Of:
: 1634810 (view as bug list) Environment:
Last Closed: 2018-11-13 22:28:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1634810    

Description David Vallee Delisle 2018-09-13 18:36:07 UTC
Description of problem:
Pacemaker doesn't allow to specifically define an operation timeout on the non-primitive resources in a bundle. 

Because of this, the default of 20 seconds is inherited and it's not enough to stop most resources (galera, rabbit, cinder, etc). 

When we shutdown a controller node without "standbying" or stopping proactively pacemaker, pacemaker waits for the Shutdown Escalation timeout to go down, and it takes more than 15 minutes for the node to go down.

When fencing is enabled, the node is fenced before this, but our default OSP13 deployment doesn't have fencing enabled.

This is a bad user experience for our customers and can be easily worked around by defining a resource op defaults timeout of 120 or 180 seconds without any significant impact. We could also decrease the shutdown escalation timeout. 

Workaround is to standby or stop the node from the cluster before the reboot, but a better workaround is to increase the default operation timeout. If someone forgets to standby or stop pacemaker, they won't have to wait 15 minutes to have it back.
~~~
[root@controller-0 ~]# pcs resource op defaults timeout=120s
Warning: Defaults do not apply to resources which override them with their own defined values
[root@controller-0 ~]# pcs resource op defaults
timeout: 120s
~~~

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.4-20


How reproducible:
All the time

Steps to Reproduce:
1. Deploy with control-scale=3
2. Shutdown or reboot one of the controller
3.

Actual results:
Takes at least 15 minutes to go down

Expected results:
Should go down faster than that.

Additional info:
[1] This article describes the situation.
[2] The monitor for the docker resources loop infinitely
[3] Until the Shutdown Escalation timeout is reached.

[1] https://access.redhat.com/solutions/3612971
[2]
~~~
[root@controller-0 ~]# journalctl --since "1h ago" --all -o short-precise | grep -iP "Shutting down Pacemaker|Shutdown Escalation|incomplete at shutdown|resource was active at shutdown|Timed Out"
Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]:   notice: Shutting down Pacemaker
Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]:  warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]:  warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms
Sep 13 17:54:15.799631 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for galera-bundle-docker-0 on controller-0: Timed Out
Sep 13 17:54:17.181860 controller-0.localdomain lrmd[6284]:  warning: redis-bundle-docker-0_monitor_60000 process (PID 226364) timed out
Sep 13 17:54:17.183731 controller-0.localdomain lrmd[6284]:  warning: redis-bundle-docker-0_monitor_60000:226364 - timed out after 20000ms
Sep 13 17:54:17.184597 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for redis-bundle-docker-0 on controller-0: Timed Out
Sep 13 17:54:19.006839 controller-0.localdomain lrmd[6284]:  warning: haproxy-bundle-docker-0_monitor_60000 process (PID 226584) timed out
Sep 13 17:54:19.009043 controller-0.localdomain lrmd[6284]:  warning: haproxy-bundle-docker-0_monitor_60000:226584 - timed out after 20000ms
Sep 13 17:54:19.009577 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for haproxy-bundle-docker-0 on controller-0: Timed Out
Sep 13 17:54:19.724069 controller-0.localdomain lrmd[6284]:  warning: openstack-cinder-volume-docker-0_monitor_60000 process (PID 226656) timed out
Sep 13 17:54:19.725534 controller-0.localdomain lrmd[6284]:  warning: openstack-cinder-volume-docker-0_monitor_60000:226656 - timed out after 20000ms
Sep 13 17:54:19.726156 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for openstack-cinder-volume-docker-0 on controller-0: Timed Out
Sep 13 17:54:21.641865 controller-0.localdomain lrmd[6284]:  warning: rabbitmq-bundle-docker-0_stop_0 process (PID 226749) timed out
Sep 13 17:54:21.642235 controller-0.localdomain lrmd[6284]:  warning: rabbitmq-bundle-docker-0_stop_0:226749 - timed out after 20000ms
Sep 13 17:54:21.643134 controller-0.localdomain crmd[6298]:    error: Result of stop operation for rabbitmq-bundle-docker-0 on controller-0: Timed Out
[...]
~~~

[3]
~~~
Sep 13 18:13:00.834319 controller-0.localdomain crmd[6298]:    error: Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms)
Sep 13 18:13:00.838326 controller-0.localdomain crmd[6298]:   notice: Recurring action redis-bundle-0:18 (redis-bundle-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838342 controller-0.localdomain crmd[6298]:   notice: Recurring action haproxy-bundle-docker-0:125 (haproxy-bundle-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838357 controller-0.localdomain crmd[6298]:   notice: Recurring action openstack-cinder-volume-docker-0:127 (openstack-cinder-volume-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838370 controller-0.localdomain crmd[6298]:   notice: Recurring action galera-bundle-0:17 (galera-bundle-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838381 controller-0.localdomain crmd[6298]:   notice: Recurring action redis-bundle-docker-0:123 (redis-bundle-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838395 controller-0.localdomain crmd[6298]:   notice: Recurring action galera-bundle-docker-0:121 (galera-bundle-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.987997 controller-0.localdomain crmd[6298]:   notice: Recurring action redis:58 (redis_monitor_45000) incomplete at shutdown
Sep 13 18:13:00.988008 controller-0.localdomain crmd[6298]:   notice: Recurring action redis:52 (redis_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.988031 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
Sep 13 18:13:00.988053 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
Sep 13 18:13:00.988264 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
Sep 13 18:13:00.988277 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
~~~

Comment 1 David Vallee Delisle 2018-09-14 13:08:23 UTC
An RFE bug 1628701 was opened with pcs/pacemaker to allow the configuration of operation timeouts on bundled resource. This current BZ is opened to analyze whether or not we could implement a default operation timeout during deployment, or any other workaround.

Comment 21 pkomarov 2018-11-07 08:29:57 UTC
Verified, 
(undercloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"rpm -qa|grep 'puppet-tripleo\|python-paunch'"

compute-0 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

compute-1 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

controller-2 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

controller-1 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

controller-0 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

(undercloud) [stack@undercloud-0 ~]$ rhos-release -L
Installed repositories (rhel-7.6):
  13
  ceph-3
  ceph-osd-3
  rhel-7.6
(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
2018-11-05.3

#fix check : 
[stack@undercloud-0 ~]$ ansible controller -mshell -b -a'ls -l /etc/systemd/system/resource-agents-deps.target.wants'
 [WARNING]: Found both group and host with same name: undercloud

controller-2 | SUCCESS | rc=0 >>
total 0
lrwxrwxrwx. 1 root root 38 Nov  6 21:10 docker.service -> /usr/lib/systemd/system/docker.service
lrwxrwxrwx. 1 root root 48 Nov  6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service

controller-0 | SUCCESS | rc=0 >>
total 0
lrwxrwxrwx. 1 root root 38 Nov  6 21:10 docker.service -> /usr/lib/systemd/system/docker.service
lrwxrwxrwx. 1 root root 48 Nov  6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service

controller-1 | SUCCESS | rc=0 >>
total 0
lrwxrwxrwx. 1 root root 38 Nov  6 21:10 docker.service -> /usr/lib/systemd/system/docker.service
lrwxrwxrwx. 1 root root 48 Nov  6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service

#disable stonith as in reproducer
[root@controller-1 ~]#  pcs property set stonith-enabled=false
[root@controller-1 ~]# pcs config|grep stonith-enabled
 stonith-enabled: false

#check controller reboot times: 
[root@controller-1 ~]# reboot
Connection to 192.168.24.9 closed by remote host.
Connection to 192.168.24.9 closed.

#reboot time is no more than 6 min: 
[stack@undercloud-0 ~]$ date;until `nc -zv 192.168.24.9 22`;do date;sleep 1m;done
Wed Nov  7 03:18:52 EST 2018
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection refused.
Wed Nov  7 03:18:53 EST 2018

....

Wed Nov  7 03:24:03 EST 2018
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.24.9:22.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

Comment 25 errata-xmlrpc 2018-11-13 22:28:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587