Description of problem: Some pacemaker resources are stopped during deploy tasks, causing the upgrade to fail. In this exact case, it was rabbitmq-bundle and the upgrade failed with: "Error: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]", "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: change from notrun to 0 failed: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]", "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!", When checking the resource status via crm_mon, we could see that rabbit-bundle-docker-0 had a Time Out issue when being stopped: [root@controller-0 ~]# crm_mon -1 [179/1925] Stack: corosync Current DC: controller-0 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum Last updated: Tue Dec 4 11:42:18 2018 Last change: Tue Dec 4 10:24:32 2018 by hacluster via crmd on controller-0 4 nodes configured 17 resources configured Online: [ controller-0 ] GuestOnline: [ galera-bundle-0@controller-0 redis-bundle-0@controller-0 ] Active resources: Docker container: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 Docker container: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 Docker container: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 ip-192.168.24.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.11 (ocf::heartbeat:IPaddr2): Started controller-0 Docker container: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-0 Failed Actions: * rabbitmq-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=72, status=Timed Out, exitreason='', last-rc-change='Mon Dec 3 21:09:22 2018', queued=0ms, exec=20007ms * galera-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=78, status=Timed Out, exitreason='', last-rc-change='Mon Dec 3 21:13:43 2018', queued=0ms, exec=0ms * redis-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=51, status=Timed Out, exitreason='', last-rc-change='Mon Dec 3 21:09:20 2018', queued=0ms, exec=0ms This seemed to be due to a Docker service restart, which could be spotted 5 seconds before the first Time Outed service timestamp: [root@controller-0 ~]# journalctl | grep 'ing Docker' Dec 03 21:05:28 controller-0 systemd[1]: Stopping Docker Application Container Engine… Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Storage Setup... Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Application Container Engine... Looking in the upgrade tasks for the task which performed this restart, we find in package_update.log that it’s caused by the ansible role container-registry : 2018-12-03 16:05:26,942 p=16722 u=mistral | TASK [container-registry : add deployment user to docker group] **************** 2018-12-03 16:05:26,942 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.568) 0:01:44.899 ******* 2018-12-03 16:05:26,974 p=16722 u=mistral | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2018-12-03 16:05:26,975 p=16722 u=mistral | RUNNING HANDLER [container-registry : restart docker] ************************** 2018-12-03 16:05:26,976 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.033) 0:01:44.932 ******* 2018-12-03 16:05:27,412 p=16722 u=mistral | changed: [controller-0] => {"changed": true, "cmd": ["/bin/true"], "delta": "0:00:00.014320", "end": "2018-12-03 21:05:27.337391", "rc": 0, "start": "2018-12-03 21:05:27.323071", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} 2018-12-03 16:05:27,414 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload systemd] ***************** 2018-12-03 16:05:27,414 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.437) 0:01:45.370 ******* 2018-12-03 16:05:27,910 p=16722 u=mistral | ok: [controller-0] => {"changed": false, "name": null, "status": {}} 2018-12-03 16:05:27,911 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload docker] ****************** 2018-12-03 16:05:27,911 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.497) 0:01:45.868 ******* Matching the controller-0 timestamp we observed above. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0878