1729192 – OSP 14->15: upgrade of controller-1 fails due to cinder_volume_restart_bundle

Bug 1729192 - OSP 14->15: upgrade of controller-1 fails due to cinder_volume_restart_bundle

Summary: OSP 14->15: upgrade of controller-1 fails due to cinder_volume_restart_bundle

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	15.0 (Stein)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jiri Stransky
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1623864 1727807
TreeView+	depends on / blocked

Reported:	2019-07-11 14:23 UTC by Jiri Stransky
Modified:	2019-12-02 10:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-10.6.1-0.20190815230440.9adae50.el8ost.noarch
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-02 10:11:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	674638	'None'	MERGED	Pacemaker resource upgrade tasks compatible with staged upgrade	2021-02-09 08:23:40 UTC
OpenStack gerrit	676169	'None'	MERGED	Check for rc instead of \|succeeded	2021-02-09 08:23:40 UTC
Red Hat Product Errata	RHBA-2019:4030	None	None	None	2019-12-02 10:11:54 UTC

Description Jiri Stransky 2019-07-11 14:23:52 UTC

Running the upgrade with all latest workarounds [1], controller-0 upgraded ok but when upgrading controller-1, it failed on cinder_volume_restart_bundle (the container itself being executed on controller-0).

Attaching full log of `openstack overcloud upgrade run --limit controller-0,controller-1`.

[1] https://gitlab.cee.redhat.com/osp15/osp-upgrade-el8/blob/master/README.md

Comment 2 Jiri Stransky 2019-07-11 14:28:29 UTC

After the upgrade, pcs status looks like this -- there are some failed actions from the past, but in the end everything is running.


[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum
Last updated: Thu Jul 11 14:26:49 2019
Last change: Thu Jul 11 13:59:37 2019 by root via cibadmin on controller-0

8 nodes configured
28 resources configured

Online: [ controller-0 controller-1 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 redis-bundle-0@controller-0 redis-bundle-1@controller-1 ]

Full list of resources:

 podman container set: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
 podman container set: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
 podman container set: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1
 ip-192.168.24.7        (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-10.0.0.115  (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.13 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.20 (ocf::heartbeat:IPaddr2):       Started controller-0
 podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0      (ocf::heartbeat:podman):        Started controller-0
   haproxy-bundle-podman-1      (ocf::heartbeat:podman):        Started controller-1
   haproxy-bundle-podman-2      (ocf::heartbeat:podman):        Stopped
 podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0     (ocf::heartbeat:podman):        Started controller-1

Failed Resource Actions:
* rabbitmq-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=22, status=complete, exitreason='',
    last-rc-change='Thu Jul 11 13:26:19 2019', queued=0ms, exec=0ms
* redis-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=9, status=Error, exitreason='',
    last-rc-change='Thu Jul 11 13:26:55 2019', queued=0ms, exec=0ms
* openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=93, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-c
inder-volume:pcmklatest',
    last-rc-change='Thu Jul 11 13:23:45 2019', queued=1ms, exec=2069ms
* galera-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=11, status=complete, exitreason='',
    last-rc-change='Thu Jul 11 13:25:18 2019', queued=0ms, exec=0ms
* haproxy-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=84, status=complete, exitreason='',
    last-rc-change='Thu Jul 11 13:24:29 2019', queued=0ms, exec=0ms
* redis-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=33, status=complete, exitreason='',
    last-rc-change='Thu Jul 11 13:27:31 2019', queued=0ms, exec=0ms
* rabbitmq-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=6, status=Error, exitreason='',
    last-rc-change='Thu Jul 11 13:26:19 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 3 Michele Baldessari 2019-07-11 15:08:16 UTC

Ok so the *_restart_bundle containers fundamentally do this:
1) Get invoked when paunch detects a change
2) Do something like the following:
if [ x"${TRIPLEO_MINOR_UPDATE,,}" != x"true" ] &&  /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=PCMKTIMEOUT openstack-cinder-volume; echo "openstack-cinder-volume restart invoked"; fi'

I.e. if a resource exists we restart it.

In this case it fails with:
ASK [Debug output for task: Start containers for step 5] **********************
Thursday 11 July 2019  13:59:52 +0000 (0:00:55.891)       0:29:35.127 *********
fatal: [controller-0]: FAILED! => {
    "failed_when_result": true,
    "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
        "Error running ['podman', 'run', '--name', 'cinder_volume_restart_bundle', '--label', 'config_id=tripleo_step5', '--label', 'container_name=cinder_volume_restart_bundle', '--label', 'managed_by=paunch', '--label', 'config_data={\"command\": [\"/usr/bin/bootstrap_host_exec\", \"cinder_volume\", \"if [ x\\\\\"${TRIPLEO_MINOR_UPDATE,,}\\\\\" != x\\\\\"true\\\\\" ] &&  /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=600 openstack-cinder-volume; echo \\\\\"openstack-cinder-volume restart invoked\\\\\"; fi\"], \"config_volume\": \"cinder\", \"detach\": false, \"environment\": [\"TRIPLEO_MINOR_UPDATE\", \"TRIPLEO_CONFIG_HASH=a8f699fd80eb5a32ffa283b5229704c0\"], \"image\": \"brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:latest\", \"ipc\": \"host\", \"net\": \"host\", \"start_order\": 0, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro\", \"/var/lib/config-data/puppet-generated/cinder/:/var/lib/kolla/config_files/src:ro\"]}', '--conmon-pidfile=/var/run/cinder_volume_restart_bundle.pid', '--log-driver', 'json-file', '--log-opt', 'path=/var/log/containers/stdouts/cinder_volume_restart_bundle.log', '--env=TRIPLEO_MINOR_UPDATE', '--env=TRIPLEO_CONFIG_HASH=a8f699fd80eb5a32ffa283b5229704c0', '--net=host', '--ipc=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro', '--volume=/var/lib/config-data/puppet-generated/cinder/:/var/lib/kolla/config_files/src:ro', 'brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:latest', '/usr/bin/bootstrap_host_exec', 'cinder_volume', 'if [ x\"${TRIPLEO_MINOR_UPDATE,,}\" != x\"true\" ] &&  /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=600 openstack-cinder-volume; echo \"openstack-cinder-volume restart invoked\"; fi']. [1]",
        "",                
        "stdout: Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.",
        " Bundle: openstack-cinder-volume",
        "  Podman: image=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest network=host options=\"--ipc=host --privileged=true --user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\" replicas=1 run-command=\"/bin/bash /usr/local/bin/kolla_start\"",
        "  Storage Mapping:",
        "   options=ro source-dir=/etc/hosts target-dir=/etc/hosts (cinder-volume-etc-hosts)",
        "   options=ro source-dir=/etc/localtime target-dir=/etc/localtime (cinder-volume-etc-localtime)",
        "   options=ro source-dir=/etc/pki/ca-trust/extracted target-dir=/etc/pki/ca-trust/extracted (cinder-volume-etc-pki-ca-trust-extracted)",
        "   options=ro source-dir=/etc/pki/ca-trust/source/anchors target-dir=/etc/pki/ca-trust/source/anchors (cinder-volume-etc-pki-ca-trust-source-anchors)",
        "   options=ro source-dir=/etc/pki/tls/certs/ca-bundle.crt target-dir=/etc/pki/tls/certs/ca-bundle.crt (cinder-volume-etc-pki-tls-certs-ca-bundle.crt)",
        "   options=ro source-dir=/etc/pki/tls/certs/ca-bundle.trust.crt target-dir=/etc/pki/tls/certs/ca-bundle.trust.crt (cinder-volume-etc-pki-tls-certs-ca-bundle.trust.crt)",
        "   options=ro source-dir=/etc/pki/tls/cert.pem target-dir=/etc/pki/tls/cert.pem (cinder-volume-etc-pki-tls-cert.pem)",
        "   options=rw source-dir=/dev/log target-dir=/dev/log (cinder-volume-dev-log)",
        "   options=ro source-dir=/etc/ssh/ssh_known_hosts target-dir=/etc/ssh/ssh_known_hosts (cinder-volume-etc-ssh-ssh_known_hosts)",
        "   options=ro source-dir=/etc/puppet target-dir=/etc/puppet (cinder-volume-etc-puppet)",
        "   options=ro source-dir=/var/lib/kolla/config_files/cinder_volume.json target-dir=/var/lib/kolla/config_files/config.json (cinder-volume-var-lib-kolla-config_files-cinder_volume.json)",
        "   options=ro source-dir=/var/lib/config-data/puppet-generated/cinder/ target-dir=/var/lib/kolla/config_files/src (cinder-volume-var-lib-config-data-puppet-generated-cinder-)",
        "   options=ro source-dir=/etc/iscsi target-dir=/var/lib/kolla/config_files/src-iscsid (cinder-volume-etc-iscsi)",
        "   options=ro source-dir=/etc/ceph target-dir=/var/lib/kolla/config_files/src-ceph (cinder-volume-etc-ceph)",
        "   options=ro source-dir=/lib/modules target-dir=/lib/modules (cinder-volume-lib-modules)",
        "   options=rw source-dir=/dev/ target-dir=/dev/ (cinder-volume-dev-)",
        "   options=rw source-dir=/run/ target-dir=/run/ (cinder-volume-run-)",
        "   options=rw source-dir=/sys target-dir=/sys (cinder-volume-sys)",
        "   options=z source-dir=/var/lib/cinder target-dir=/var/lib/cinder (cinder-volume-var-lib-cinder)",
        "   options=z source-dir=/var/lib/iscsi target-dir=/var/lib/iscsi (cinder-volume-var-lib-iscsi)",
        "   options=z source-dir=/var/log/containers/cinder target-dir=/var/log/cinder (cinder-volume-var-log-containers-cinder)",
        "stderr: Error: Error performing operation: No such device or address",                                                                                                                                                                                                
        "openstack-cinder-volume is not running anywhere and so cannot be restarted",

That error comes from crm_resource as invoked by pcs and we end up in this function in tools/crm_resource_runtime.c:
static bool resource_is_running_on(resource_t *rsc, const char *host)
{                               
    bool found = TRUE;          
    GListPtr hIter = NULL;      
    GListPtr hosts = NULL;      
                                
    if(rsc == NULL) {           
        return FALSE;           
    }                           
                                
    rsc->fns->location(rsc, &hosts, TRUE);
    for (hIter = hosts; host != NULL && hIter != NULL; hIter = hIter->next) {
        pe_node_t *node = (pe_node_t *) hIter->data;
                                
        if(strcmp(host, node->details->uname) == 0) {
            crm_trace("Resource %s is running on %s\n", rsc->id, host);
            goto done;          
        } else if(strcmp(host, node->details->id) == 0) {
            crm_trace("Resource %s is running on %s\n", rsc->id, host);
            goto done;          
        }                       
    }                           
                                
    if(host != NULL) {          
        crm_trace("Resource %s is not running on: %s\n", rsc->id, host);                                                                                                                                                                                                       
        found = FALSE;          
                                
    } else if(host == NULL && hosts == NULL) {
        crm_trace("Resource %s is not running\n", rsc->id);
        found = FALSE;          
    }                           
                                
  done:                         
                                
    g_list_free(hosts);         
    return found;               
}           

So to me the most likely hypothesis is that:
A) pcs resource show openstack-cinder-volume did return 0
B) The openstack-cinder-volume resource was indeed not running anywhere and pcs/pcmk refuse to restart something that is not running.

Testing this theory on OSP15:
pcs resource disable openstack-cinder-volume
pcs resource show openstack-cinder-volume > /dev/null && echo $?
0

So we know we get inside the if branch normally even when the resource is down, which is expected. What is not expected is that restarting a resource that is not running barfs:
[root@controller-0 ~]# pcs resource restart openstack-cinder-volume
Error: Error performing operation: No such device or address
openstack-cinder-volume is not running anywhere and so cannot be restarted

So I think the fix here should be that we make the '/usr/sbin/pcs resource show openstack-cinder-volume' also consider the case when the resource is stopped for whatever reason.

Comment 4 Jiri Stransky 2019-07-12 12:42:34 UTC

Before upgrading controller-1, the cinder-volume bundle was running on controller-0 but it got stopped and then errored:

* openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=81, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest',
    last-rc-change='Fri Jul 12 11:45:46 2019', queued=0ms, exec=3183ms

However the image it can't pull is present on the node, so i'm not sure why it's attempt to pull that name at all:

[root@controller-0 ~]# podman images | grep cinder-volume
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume               pcmklatest   16f3aca78029   12 hours ago   1.2 GB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume               latest       16f3aca78029   12 hours ago   1.2 GB



So i think we have two possibly connected issues here:

1) The cinder-volume resource is probably still trying to upgrade itself on controller-0 despite being already upgraded there when we run with `--limit controller-0,controller-1`. We have a bunch of tasks which deal with the pcmklatest image tagging:

https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L260-L320

We probably need to improve the idempotency of those somehow.

1.A) We shouldn't stop and edit the resource when it in fact does not need to be stopped and edited. These tasks should be a complete no-op during `--limit controller-0,controller-1`, they only need to run once per cluster, and they ran with `--limit controller-0`.

https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L298-L307

1.B) I'm not sure if this is an issue but i wonder if we should also prevent the re-tagging from happening when it's not needed. Perhaps it's atomic enough so that re-execution doesn't matter, but not sure... During `--limit controller-0,controller-1`, these tasks should be a no-op on controller-0 (as they were already applied there during `--limit controller-0`), but they must run on controller-1 where they're being executed for the first time.


2) I'm puzzled why did pacemaker try to pull the image when it was already present. Perhaps it is some momentary interaction with the re-tagging tasks (problem 1.B above) and when we fix that, this issue would disappear...


More services are probably affected by these issues right now cluster state is:

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum
Last updated: Fri Jul 12 12:40:17 2019
Last change: Fri Jul 12 12:23:04 2019 by hacluster via crmd on controller-0

5 nodes configured
19 resources configured

Online: [ controller-0 controller-1 ]
GuestOnline: [ galera-bundle-0@controller-0 redis-bundle-0@controller-0 ]

Full list of resources:

 podman container: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
 podman container: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 podman container: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
 ip-192.168.24.7        (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-10.0.0.115  (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.13 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.20 (ocf::heartbeat:IPaddr2):       Started controller-0
 podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0      (ocf::heartbeat:podman):        Started controller-0
   haproxy-bundle-podman-1      (ocf::heartbeat:podman):        Stopped
   haproxy-bundle-podman-2      (ocf::heartbeat:podman):        Stopped
 podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0     (ocf::heartbeat:podman):        Stopped

Failed Resource Actions:
* rabbitmq-bundle-podman-0_start_0 on controller-0 'unknown error' (1): call=97, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest',
    last-rc-change='Fri Jul 12 11:49:07 2019', queued=0ms, exec=2019ms
* openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=81, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest',
    last-rc-change='Fri Jul 12 11:45:46 2019', queued=0ms, exec=3183ms
* redis-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=9, status=Error, exitreason='',
    last-rc-change='Fri Jul 12 11:49:39 2019', queued=0ms, exec=0ms
* galera-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=12, status=complete, exitreason='',
    last-rc-change='Fri Jul 12 11:47:40 2019', queued=0ms, exec=0ms
* haproxy-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=86, status=complete, exitreason='',
    last-rc-change='Fri Jul 12 11:47:40 2019', queued=0ms, exec=0ms
* redis-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=73, status=complete, exitreason='',
    last-rc-change='Fri Jul 12 11:49:41 2019', queued=0ms, exec=0ms
* galera_monitor_10000 on galera-bundle-0 'not running' (7): call=139, status=complete, exitreason='',
    last-rc-change='Fri Jul 12 11:48:45 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 5 Jiri Stransky 2019-07-12 12:46:22 UTC

After running `pcs resource cleanup` pacemaker will start the services which "failed to pull image" without issues and it doesn't attempt to pull those images. That makes me think it is indeed some race condition with the Ansible tasks which re-tag the `pcmklatest` images.

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum
Last updated: Fri Jul 12 12:43:59 2019
Last change: Fri Jul 12 12:43:45 2019 by hacluster via crmd on controller-0

5 nodes configured
19 resources configured

Online: [ controller-0 controller-1 ]
GuestOnline: [ galera-bundle-0@controller-0 rabbitmq-bundle-0@controller-0 redis-bundle-0@controller-0 ]

Full list of resources:

 podman container: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
 podman container: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Starting controller-0
 podman container: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
 ip-192.168.24.7        (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-10.0.0.115  (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.13 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.20 (ocf::heartbeat:IPaddr2):       Started controller-0
 podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0      (ocf::heartbeat:podman):        Started controller-0
   haproxy-bundle-podman-1      (ocf::heartbeat:podman):        Stopped
   haproxy-bundle-podman-2      (ocf::heartbeat:podman):        Stopped
 podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0     (ocf::heartbeat:podman):        Started controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 6 Jiri Stransky 2019-07-12 12:53:28 UTC

Forgot to put a link to the image re-tagging tasks. They're included here:

https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L317-L320

and defined here:

https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L229-L256

Comment 7 Jiri Stransky 2019-08-05 11:52:26 UTC

I have a patch proposed here: https://review.opendev.org/#/c/673456/9

While the cluster status is not entirely correct when the upgrade is finished (pasted it into a commit message there), the patch does at least gets us through the upgrade without crashing. Galera scaled up fine to all 3 nodes but other services only scaled up upto 2. This is another bug to look at subsequently, but at least the patch should unblock the critical path in testing and we can focus further on individual issues without the whole upgrade being outright blocked.

Comment 8 Jiri Stransky 2019-08-13 12:06:27 UTC

Fix merged and backported to stable/stein.

Comment 10 Shelley Dunne 2019-09-19 18:29:49 UTC

Re-setting Target Milestone z1 to --- to begin the 15z1 Maintenance Release.

Comment 14 Lon Hohberger 2019-10-07 18:03:52 UTC

openstack-tripleo-heat-templates-10.6.2-0.20190923210442.7db107a.el8ost - which is newer than openstack-tripleo-heat-templates-10.6.1-0.20190815230440.9adae50.el8ost.noarch - is available in RHEL OSP 15.0 repositories

Comment 15 pkomarov 2019-10-23 23:33:36 UTC

Verified , 

Via automation : 

http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/upgrade/job/DFG-upgrades-upgrade-upgrade-14-15_director-rhel-virthost-3cont_2comp-ipv4-vxlan-poc/87/

deploy logs at : 
http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/upgrade/job/DFG-upgrades-upgrade-upgrade-14-15_director-rhel-virthost-3cont_2comp-ipv4-vxlan-poc/87/artifact/undercloud-0.tar.gz

#check openstack-tripleo-heat-templates version:
undercloud-0]$ grep openstack-tripleo-heat-templates-10 var/log/rpm.list
openstack-tripleo-heat-templates-10.6.2-0.20190923210442.7db107a.el8ost.noarch

#check resource-agents version:
[r@r undercloud-0]$ grep -q "Installed: resource-agents-4.1.1-17.el8_0.6.x86_64" home/stack/overcloud_upgrade_run_controller-2.log&& echo "resource-agents-4.1.1-17.el8_0.6.x86_64 installed on overcloud "
resource-agents-4.1.1-17.el8_0.6.x86_64 installed on overcloud

Comment 17 errata-xmlrpc 2019-12-02 10:11:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4030

Note You need to log in before you can comment on or make changes to this bug.