1628705 – [OSP13] Rebooting a clustered control node without previously stopping pacemaker takes more than 15 minutes

Bug 1628705 - [OSP13] Rebooting a clustered control node without previously stopping pacemaker takes more than 15 minutes

Summary: [OSP13] Rebooting a clustered control node without previously stopping pacema...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z3
Target Release:	13.0 (Queens)
Assignee:	Michele Baldessari
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1634810
TreeView+	depends on / blocked

Reported:	2018-09-13 18:36 UTC by David Vallee Delisle
Modified:	2021-12-10 17:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:	puppet-tripleo-8.3.4-14.el7ost python-paunch-2.5.0-3.el7ost
Doc Type:	Bug Fix
Doc Text:	A faulty interaction between rhel-plugin-push.service and the Docker service occurred during system shutdown, which caused the controller reboot to take a long time. WIth this release, the correct shutdown ordering is enforced for these two services. Rebooting a controller takes less time now.
Clone Of:
Clones:	1634810 (view as bug list)
Environment:
Last Closed:	2018-11-13 22:28:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1792701	None	None	None	2018-09-15 13:06:11 UTC
OpenStack gerrit	606849	'None'	MERGED	Make sure rhel-plugin-push.service is stopped after pacemaker stops	2021-02-04 06:19:17 UTC
RDO	16342	None	queens-rdo: MERGED	openstack/paunch-distgit: Make sure paunch-container-shutdown stops before pacemaker (I669e4ea319268fb478dd523b676b3581a...	2018-10-15 11:41:49 UTC
Red Hat Issue Tracker	OSP-11627	None	None	None	2021-12-10 17:37:12 UTC
Red Hat Knowledge Base (Solution)	3612971	None	None	None	2018-09-13 18:37:00 UTC
Red Hat Product Errata	RHBA-2018:3587	None	None	None	2018-11-13 22:29:51 UTC

Description David Vallee Delisle 2018-09-13 18:36:07 UTC

Description of problem:
Pacemaker doesn't allow to specifically define an operation timeout on the non-primitive resources in a bundle. 

Because of this, the default of 20 seconds is inherited and it's not enough to stop most resources (galera, rabbit, cinder, etc). 

When we shutdown a controller node without "standbying" or stopping proactively pacemaker, pacemaker waits for the Shutdown Escalation timeout to go down, and it takes more than 15 minutes for the node to go down.

When fencing is enabled, the node is fenced before this, but our default OSP13 deployment doesn't have fencing enabled.

This is a bad user experience for our customers and can be easily worked around by defining a resource op defaults timeout of 120 or 180 seconds without any significant impact. We could also decrease the shutdown escalation timeout. 

Workaround is to standby or stop the node from the cluster before the reboot, but a better workaround is to increase the default operation timeout. If someone forgets to standby or stop pacemaker, they won't have to wait 15 minutes to have it back.
~~~
[root@controller-0 ~]# pcs resource op defaults timeout=120s
Warning: Defaults do not apply to resources which override them with their own defined values
[root@controller-0 ~]# pcs resource op defaults
timeout: 120s
~~~

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.4-20


How reproducible:
All the time

Steps to Reproduce:
1. Deploy with control-scale=3
2. Shutdown or reboot one of the controller
3.

Actual results:
Takes at least 15 minutes to go down

Expected results:
Should go down faster than that.

Additional info:
[1] This article describes the situation.
[2] The monitor for the docker resources loop infinitely
[3] Until the Shutdown Escalation timeout is reached.

[1] https://access.redhat.com/solutions/3612971
[2]
~~~
[root@controller-0 ~]# journalctl --since "1h ago" --all -o short-precise | grep -iP "Shutting down Pacemaker|Shutdown Escalation|incomplete at shutdown|resource was active at shutdown|Timed Out"
Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]:   notice: Shutting down Pacemaker
Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]:  warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]:  warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms
Sep 13 17:54:15.799631 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for galera-bundle-docker-0 on controller-0: Timed Out
Sep 13 17:54:17.181860 controller-0.localdomain lrmd[6284]:  warning: redis-bundle-docker-0_monitor_60000 process (PID 226364) timed out
Sep 13 17:54:17.183731 controller-0.localdomain lrmd[6284]:  warning: redis-bundle-docker-0_monitor_60000:226364 - timed out after 20000ms
Sep 13 17:54:17.184597 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for redis-bundle-docker-0 on controller-0: Timed Out
Sep 13 17:54:19.006839 controller-0.localdomain lrmd[6284]:  warning: haproxy-bundle-docker-0_monitor_60000 process (PID 226584) timed out
Sep 13 17:54:19.009043 controller-0.localdomain lrmd[6284]:  warning: haproxy-bundle-docker-0_monitor_60000:226584 - timed out after 20000ms
Sep 13 17:54:19.009577 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for haproxy-bundle-docker-0 on controller-0: Timed Out
Sep 13 17:54:19.724069 controller-0.localdomain lrmd[6284]:  warning: openstack-cinder-volume-docker-0_monitor_60000 process (PID 226656) timed out
Sep 13 17:54:19.725534 controller-0.localdomain lrmd[6284]:  warning: openstack-cinder-volume-docker-0_monitor_60000:226656 - timed out after 20000ms
Sep 13 17:54:19.726156 controller-0.localdomain crmd[6298]:    error: Result of monitor operation for openstack-cinder-volume-docker-0 on controller-0: Timed Out
Sep 13 17:54:21.641865 controller-0.localdomain lrmd[6284]:  warning: rabbitmq-bundle-docker-0_stop_0 process (PID 226749) timed out
Sep 13 17:54:21.642235 controller-0.localdomain lrmd[6284]:  warning: rabbitmq-bundle-docker-0_stop_0:226749 - timed out after 20000ms
Sep 13 17:54:21.643134 controller-0.localdomain crmd[6298]:    error: Result of stop operation for rabbitmq-bundle-docker-0 on controller-0: Timed Out
[...]
~~~

[3]
~~~
Sep 13 18:13:00.834319 controller-0.localdomain crmd[6298]:    error: Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms)
Sep 13 18:13:00.838326 controller-0.localdomain crmd[6298]:   notice: Recurring action redis-bundle-0:18 (redis-bundle-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838342 controller-0.localdomain crmd[6298]:   notice: Recurring action haproxy-bundle-docker-0:125 (haproxy-bundle-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838357 controller-0.localdomain crmd[6298]:   notice: Recurring action openstack-cinder-volume-docker-0:127 (openstack-cinder-volume-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838370 controller-0.localdomain crmd[6298]:   notice: Recurring action galera-bundle-0:17 (galera-bundle-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838381 controller-0.localdomain crmd[6298]:   notice: Recurring action redis-bundle-docker-0:123 (redis-bundle-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.838395 controller-0.localdomain crmd[6298]:   notice: Recurring action galera-bundle-docker-0:121 (galera-bundle-docker-0_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.987997 controller-0.localdomain crmd[6298]:   notice: Recurring action redis:58 (redis_monitor_45000) incomplete at shutdown
Sep 13 18:13:00.988008 controller-0.localdomain crmd[6298]:   notice: Recurring action redis:52 (redis_monitor_60000) incomplete at shutdown
Sep 13 18:13:00.988031 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
Sep 13 18:13:00.988053 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
Sep 13 18:13:00.988264 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
Sep 13 18:13:00.988277 controller-0.localdomain crmd[6298]:    error: 1 resource was active at shutdown
~~~

Comment 1 David Vallee Delisle 2018-09-14 13:08:23 UTC

An RFE bug 1628701 was opened with pcs/pacemaker to allow the configuration of operation timeouts on bundled resource. This current BZ is opened to analyze whether or not we could implement a default operation timeout during deployment, or any other workaround.

Comment 21 pkomarov 2018-11-07 08:29:57 UTC

Verified, 
(undercloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"rpm -qa|grep 'puppet-tripleo\|python-paunch'"

compute-0 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

compute-1 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

controller-2 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

controller-1 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

controller-0 | SUCCESS | rc=0 >>
puppet-tripleo-8.3.6-2.el7ost.noarch
python-paunch-2.5.0-3.el7ost.noarch

(undercloud) [stack@undercloud-0 ~]$ rhos-release -L
Installed repositories (rhel-7.6):
  13
  ceph-3
  ceph-osd-3
  rhel-7.6
(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
2018-11-05.3

#fix check : 
[stack@undercloud-0 ~]$ ansible controller -mshell -b -a'ls -l /etc/systemd/system/resource-agents-deps.target.wants'
 [WARNING]: Found both group and host with same name: undercloud

controller-2 | SUCCESS | rc=0 >>
total 0
lrwxrwxrwx. 1 root root 38 Nov  6 21:10 docker.service -> /usr/lib/systemd/system/docker.service
lrwxrwxrwx. 1 root root 48 Nov  6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service

controller-0 | SUCCESS | rc=0 >>
total 0
lrwxrwxrwx. 1 root root 38 Nov  6 21:10 docker.service -> /usr/lib/systemd/system/docker.service
lrwxrwxrwx. 1 root root 48 Nov  6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service

controller-1 | SUCCESS | rc=0 >>
total 0
lrwxrwxrwx. 1 root root 38 Nov  6 21:10 docker.service -> /usr/lib/systemd/system/docker.service
lrwxrwxrwx. 1 root root 48 Nov  6 21:10 rhel-push-plugin.service -> /usr/lib/systemd/system/rhel-push-plugin.service

#disable stonith as in reproducer
[root@controller-1 ~]#  pcs property set stonith-enabled=false
[root@controller-1 ~]# pcs config|grep stonith-enabled
 stonith-enabled: false

#check controller reboot times: 
[root@controller-1 ~]# reboot
Connection to 192.168.24.9 closed by remote host.
Connection to 192.168.24.9 closed.

#reboot time is no more than 6 min: 
[stack@undercloud-0 ~]$ date;until `nc -zv 192.168.24.9 22`;do date;sleep 1m;done
Wed Nov  7 03:18:52 EST 2018
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection refused.
Wed Nov  7 03:18:53 EST 2018

....

Wed Nov  7 03:24:03 EST 2018
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.24.9:22.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

Comment 25 errata-xmlrpc 2018-11-13 22:28:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587

Note You need to log in before you can comment on or make changes to this bug.