1656546 – Docker restart in deploy steps causing malfunction in pacemaker services

Bug 1656546 - Docker restart in deploy steps causing malfunction in pacemaker services

Summary: Docker restart in deploy steps causing malfunction in pacemaker services

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z1
Target Release:	14.0 (Rocky)
Assignee:	Jose Luis Franco
QA Contact:	Ronnie Rasouli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-05 18:18 UTC by Jose Luis Franco
Modified:	2019-04-30 17:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-9.0.1-0.20181013060914.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-30 17:51:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1807418	None	None	None	2018-12-07 16:21:35 UTC
OpenStack gerrit	622969	'None'	MERGED	Perform docker reconfiguration on upgrade_tasks.	2020-03-17 06:58:23 UTC
OpenStack gerrit	625821	'None'	MERGED	Perform docker reconfiguration on upgrade_tasks.	2020-03-17 06:58:23 UTC
Red Hat Product Errata	RHBA-2019:0878	None	None	None	2019-04-30 17:51:25 UTC

Description Jose Luis Franco 2018-12-05 18:18:25 UTC

Description of problem:

Some pacemaker resources are stopped during deploy tasks, causing the upgrade to fail. In this exact case, it was rabbitmq-bundle and the upgrade failed with:

        "Error: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]",
        "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: change from notrun to 0 failed: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]",
        "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!",

When checking the resource status via crm_mon, we could see that rabbit-bundle-docker-0 had a Time Out issue when being stopped:

[root@controller-0 ~]# crm_mon -1                                                                                                                                  [179/1925]
Stack: corosync
Current DC: controller-0 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum
Last updated: Tue Dec  4 11:42:18 2018
Last change: Tue Dec  4 10:24:32 2018 by hacluster via crmd on controller-0

4 nodes configured
17 resources configured

Online: [ controller-0 ]
GuestOnline: [ galera-bundle-0@controller-0 redis-bundle-0@controller-0 ]

Active resources:

 Docker container: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped controller-0
 Docker container: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
 Docker container: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
 ip-192.168.24.10       (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.11 (ocf::heartbeat:IPaddr2):       Started controller-0
 Docker container: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started controller-0
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller-0

Failed Actions:
* rabbitmq-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=72, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec  3 21:09:22 2018', queued=0ms, exec=20007ms
* galera-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=78, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec  3 21:13:43 2018', queued=0ms, exec=0ms
* redis-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=51, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec  3 21:09:20 2018', queued=0ms, exec=0ms


This seemed to be due to a Docker service restart, which could be spotted 5 seconds before the first Time Outed service timestamp:


[root@controller-0 ~]# journalctl | grep 'ing Docker'
Dec 03 21:05:28 controller-0 systemd[1]: Stopping Docker Application Container Engine…
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Storage Setup...
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Application Container Engine...

Looking in the upgrade tasks for the task which performed this restart, we find in package_update.log that it’s caused by the ansible role container-registry :

2018-12-03 16:05:26,942 p=16722 u=mistral |  TASK [container-registry : add deployment user to docker group] ****************
2018-12-03 16:05:26,942 p=16722 u=mistral |  Monday 03 December 2018  16:05:26 -0500 (0:00:00.568)       0:01:44.899 *******
2018-12-03 16:05:26,974 p=16722 u=mistral |  skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-12-03 16:05:26,975 p=16722 u=mistral |  RUNNING HANDLER [container-registry : restart docker] **************************
2018-12-03 16:05:26,976 p=16722 u=mistral |  Monday 03 December 2018  16:05:26 -0500 (0:00:00.033)       0:01:44.932 *******
2018-12-03 16:05:27,412 p=16722 u=mistral |  changed: [controller-0] => {"changed": true, "cmd": ["/bin/true"], "delta": "0:00:00.014320", "end": "2018-12-03 21:05:27.337391", "rc": 0, "start": "2018-12-03 21:05:27.323071", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
2018-12-03 16:05:27,414 p=16722 u=mistral |  RUNNING HANDLER [container-registry : Docker | reload systemd] *****************
2018-12-03 16:05:27,414 p=16722 u=mistral |  Monday 03 December 2018  16:05:27 -0500 (0:00:00.437)       0:01:45.370 *******
2018-12-03 16:05:27,910 p=16722 u=mistral |  ok: [controller-0] => {"changed": false, "name": null, "status": {}}
2018-12-03 16:05:27,911 p=16722 u=mistral |  RUNNING HANDLER [container-registry : Docker | reload docker] ******************
2018-12-03 16:05:27,911 p=16722 u=mistral |  Monday 03 December 2018  16:05:27 -0500 (0:00:00.497)       0:01:45.868 *******

Matching the controller-0 timestamp we observed above. 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Mikey Ariel 2019-02-20 12:44:30 UTC

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Comment 10 errata-xmlrpc 2019-04-30 17:51:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0878

Note You need to log in before you can comment on or make changes to this bug.