Running the upgrade with all latest workarounds [1], controller-0 upgraded ok but when upgrading controller-1, it failed on cinder_volume_restart_bundle (the container itself being executed on controller-0). Attaching full log of `openstack overcloud upgrade run --limit controller-0,controller-1`. [1] https://gitlab.cee.redhat.com/osp15/osp-upgrade-el8/blob/master/README.md
After the upgrade, pcs status looks like this -- there are some failed actions from the past, but in the end everything is running. [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum Last updated: Thu Jul 11 14:26:49 2019 Last change: Thu Jul 11 13:59:37 2019 by root via cibadmin on controller-0 8 nodes configured 28 resources configured Online: [ controller-0 controller-1 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 redis-bundle-0@controller-0 redis-bundle-1@controller-1 ] Full list of resources: podman container set: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 podman container set: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 podman container set: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 ip-192.168.24.7 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.115 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.20 (ocf::heartbeat:IPaddr2): Started controller-0 podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest] haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1 haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-1 Failed Resource Actions: * rabbitmq-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=22, status=complete, exitreason='', last-rc-change='Thu Jul 11 13:26:19 2019', queued=0ms, exec=0ms * redis-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=9, status=Error, exitreason='', last-rc-change='Thu Jul 11 13:26:55 2019', queued=0ms, exec=0ms * openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=93, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-c inder-volume:pcmklatest', last-rc-change='Thu Jul 11 13:23:45 2019', queued=1ms, exec=2069ms * galera-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=11, status=complete, exitreason='', last-rc-change='Thu Jul 11 13:25:18 2019', queued=0ms, exec=0ms * haproxy-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=84, status=complete, exitreason='', last-rc-change='Thu Jul 11 13:24:29 2019', queued=0ms, exec=0ms * redis-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=33, status=complete, exitreason='', last-rc-change='Thu Jul 11 13:27:31 2019', queued=0ms, exec=0ms * rabbitmq-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=6, status=Error, exitreason='', last-rc-change='Thu Jul 11 13:26:19 2019', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Ok so the *_restart_bundle containers fundamentally do this: 1) Get invoked when paunch detects a change 2) Do something like the following: if [ x"${TRIPLEO_MINOR_UPDATE,,}" != x"true" ] && /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=PCMKTIMEOUT openstack-cinder-volume; echo "openstack-cinder-volume restart invoked"; fi' I.e. if a resource exists we restart it. In this case it fails with: ASK [Debug output for task: Start containers for step 5] ********************** Thursday 11 July 2019 13:59:52 +0000 (0:00:55.891) 0:29:35.127 ********* fatal: [controller-0]: FAILED! => { "failed_when_result": true, "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [ "Error running ['podman', 'run', '--name', 'cinder_volume_restart_bundle', '--label', 'config_id=tripleo_step5', '--label', 'container_name=cinder_volume_restart_bundle', '--label', 'managed_by=paunch', '--label', 'config_data={\"command\": [\"/usr/bin/bootstrap_host_exec\", \"cinder_volume\", \"if [ x\\\\\"${TRIPLEO_MINOR_UPDATE,,}\\\\\" != x\\\\\"true\\\\\" ] && /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=600 openstack-cinder-volume; echo \\\\\"openstack-cinder-volume restart invoked\\\\\"; fi\"], \"config_volume\": \"cinder\", \"detach\": false, \"environment\": [\"TRIPLEO_MINOR_UPDATE\", \"TRIPLEO_CONFIG_HASH=a8f699fd80eb5a32ffa283b5229704c0\"], \"image\": \"brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:latest\", \"ipc\": \"host\", \"net\": \"host\", \"start_order\": 0, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro\", \"/var/lib/config-data/puppet-generated/cinder/:/var/lib/kolla/config_files/src:ro\"]}', '--conmon-pidfile=/var/run/cinder_volume_restart_bundle.pid', '--log-driver', 'json-file', '--log-opt', 'path=/var/log/containers/stdouts/cinder_volume_restart_bundle.log', '--env=TRIPLEO_MINOR_UPDATE', '--env=TRIPLEO_CONFIG_HASH=a8f699fd80eb5a32ffa283b5229704c0', '--net=host', '--ipc=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro', '--volume=/var/lib/config-data/puppet-generated/cinder/:/var/lib/kolla/config_files/src:ro', 'brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:latest', '/usr/bin/bootstrap_host_exec', 'cinder_volume', 'if [ x\"${TRIPLEO_MINOR_UPDATE,,}\" != x\"true\" ] && /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=600 openstack-cinder-volume; echo \"openstack-cinder-volume restart invoked\"; fi']. [1]", "", "stdout: Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.", " Bundle: openstack-cinder-volume", " Podman: image=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest network=host options=\"--ipc=host --privileged=true --user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\" replicas=1 run-command=\"/bin/bash /usr/local/bin/kolla_start\"", " Storage Mapping:", " options=ro source-dir=/etc/hosts target-dir=/etc/hosts (cinder-volume-etc-hosts)", " options=ro source-dir=/etc/localtime target-dir=/etc/localtime (cinder-volume-etc-localtime)", " options=ro source-dir=/etc/pki/ca-trust/extracted target-dir=/etc/pki/ca-trust/extracted (cinder-volume-etc-pki-ca-trust-extracted)", " options=ro source-dir=/etc/pki/ca-trust/source/anchors target-dir=/etc/pki/ca-trust/source/anchors (cinder-volume-etc-pki-ca-trust-source-anchors)", " options=ro source-dir=/etc/pki/tls/certs/ca-bundle.crt target-dir=/etc/pki/tls/certs/ca-bundle.crt (cinder-volume-etc-pki-tls-certs-ca-bundle.crt)", " options=ro source-dir=/etc/pki/tls/certs/ca-bundle.trust.crt target-dir=/etc/pki/tls/certs/ca-bundle.trust.crt (cinder-volume-etc-pki-tls-certs-ca-bundle.trust.crt)", " options=ro source-dir=/etc/pki/tls/cert.pem target-dir=/etc/pki/tls/cert.pem (cinder-volume-etc-pki-tls-cert.pem)", " options=rw source-dir=/dev/log target-dir=/dev/log (cinder-volume-dev-log)", " options=ro source-dir=/etc/ssh/ssh_known_hosts target-dir=/etc/ssh/ssh_known_hosts (cinder-volume-etc-ssh-ssh_known_hosts)", " options=ro source-dir=/etc/puppet target-dir=/etc/puppet (cinder-volume-etc-puppet)", " options=ro source-dir=/var/lib/kolla/config_files/cinder_volume.json target-dir=/var/lib/kolla/config_files/config.json (cinder-volume-var-lib-kolla-config_files-cinder_volume.json)", " options=ro source-dir=/var/lib/config-data/puppet-generated/cinder/ target-dir=/var/lib/kolla/config_files/src (cinder-volume-var-lib-config-data-puppet-generated-cinder-)", " options=ro source-dir=/etc/iscsi target-dir=/var/lib/kolla/config_files/src-iscsid (cinder-volume-etc-iscsi)", " options=ro source-dir=/etc/ceph target-dir=/var/lib/kolla/config_files/src-ceph (cinder-volume-etc-ceph)", " options=ro source-dir=/lib/modules target-dir=/lib/modules (cinder-volume-lib-modules)", " options=rw source-dir=/dev/ target-dir=/dev/ (cinder-volume-dev-)", " options=rw source-dir=/run/ target-dir=/run/ (cinder-volume-run-)", " options=rw source-dir=/sys target-dir=/sys (cinder-volume-sys)", " options=z source-dir=/var/lib/cinder target-dir=/var/lib/cinder (cinder-volume-var-lib-cinder)", " options=z source-dir=/var/lib/iscsi target-dir=/var/lib/iscsi (cinder-volume-var-lib-iscsi)", " options=z source-dir=/var/log/containers/cinder target-dir=/var/log/cinder (cinder-volume-var-log-containers-cinder)", "stderr: Error: Error performing operation: No such device or address", "openstack-cinder-volume is not running anywhere and so cannot be restarted", That error comes from crm_resource as invoked by pcs and we end up in this function in tools/crm_resource_runtime.c: static bool resource_is_running_on(resource_t *rsc, const char *host) { bool found = TRUE; GListPtr hIter = NULL; GListPtr hosts = NULL; if(rsc == NULL) { return FALSE; } rsc->fns->location(rsc, &hosts, TRUE); for (hIter = hosts; host != NULL && hIter != NULL; hIter = hIter->next) { pe_node_t *node = (pe_node_t *) hIter->data; if(strcmp(host, node->details->uname) == 0) { crm_trace("Resource %s is running on %s\n", rsc->id, host); goto done; } else if(strcmp(host, node->details->id) == 0) { crm_trace("Resource %s is running on %s\n", rsc->id, host); goto done; } } if(host != NULL) { crm_trace("Resource %s is not running on: %s\n", rsc->id, host); found = FALSE; } else if(host == NULL && hosts == NULL) { crm_trace("Resource %s is not running\n", rsc->id); found = FALSE; } done: g_list_free(hosts); return found; } So to me the most likely hypothesis is that: A) pcs resource show openstack-cinder-volume did return 0 B) The openstack-cinder-volume resource was indeed not running anywhere and pcs/pcmk refuse to restart something that is not running. Testing this theory on OSP15: pcs resource disable openstack-cinder-volume pcs resource show openstack-cinder-volume > /dev/null && echo $? 0 So we know we get inside the if branch normally even when the resource is down, which is expected. What is not expected is that restarting a resource that is not running barfs: [root@controller-0 ~]# pcs resource restart openstack-cinder-volume Error: Error performing operation: No such device or address openstack-cinder-volume is not running anywhere and so cannot be restarted So I think the fix here should be that we make the '/usr/sbin/pcs resource show openstack-cinder-volume' also consider the case when the resource is stopped for whatever reason.
Before upgrading controller-1, the cinder-volume bundle was running on controller-0 but it got stopped and then errored: * openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=81, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest', last-rc-change='Fri Jul 12 11:45:46 2019', queued=0ms, exec=3183ms However the image it can't pull is present on the node, so i'm not sure why it's attempt to pull that name at all: [root@controller-0 ~]# podman images | grep cinder-volume brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume pcmklatest 16f3aca78029 12 hours ago 1.2 GB brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume latest 16f3aca78029 12 hours ago 1.2 GB So i think we have two possibly connected issues here: 1) The cinder-volume resource is probably still trying to upgrade itself on controller-0 despite being already upgraded there when we run with `--limit controller-0,controller-1`. We have a bunch of tasks which deal with the pcmklatest image tagging: https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L260-L320 We probably need to improve the idempotency of those somehow. 1.A) We shouldn't stop and edit the resource when it in fact does not need to be stopped and edited. These tasks should be a complete no-op during `--limit controller-0,controller-1`, they only need to run once per cluster, and they ran with `--limit controller-0`. https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L298-L307 1.B) I'm not sure if this is an issue but i wonder if we should also prevent the re-tagging from happening when it's not needed. Perhaps it's atomic enough so that re-execution doesn't matter, but not sure... During `--limit controller-0,controller-1`, these tasks should be a no-op on controller-0 (as they were already applied there during `--limit controller-0`), but they must run on controller-1 where they're being executed for the first time. 2) I'm puzzled why did pacemaker try to pull the image when it was already present. Perhaps it is some momentary interaction with the re-tagging tasks (problem 1.B above) and when we fix that, this issue would disappear... More services are probably affected by these issues right now cluster state is: [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum Last updated: Fri Jul 12 12:40:17 2019 Last change: Fri Jul 12 12:23:04 2019 by hacluster via crmd on controller-0 5 nodes configured 19 resources configured Online: [ controller-0 controller-1 ] GuestOnline: [ galera-bundle-0@controller-0 redis-bundle-0@controller-0 ] Full list of resources: podman container: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 podman container: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped podman container: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 ip-192.168.24.7 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.115 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.20 (ocf::heartbeat:IPaddr2): Started controller-0 podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest] haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped Failed Resource Actions: * rabbitmq-bundle-podman-0_start_0 on controller-0 'unknown error' (1): call=97, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest', last-rc-change='Fri Jul 12 11:49:07 2019', queued=0ms, exec=2019ms * openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=81, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest', last-rc-change='Fri Jul 12 11:45:46 2019', queued=0ms, exec=3183ms * redis-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=9, status=Error, exitreason='', last-rc-change='Fri Jul 12 11:49:39 2019', queued=0ms, exec=0ms * galera-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=12, status=complete, exitreason='', last-rc-change='Fri Jul 12 11:47:40 2019', queued=0ms, exec=0ms * haproxy-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=86, status=complete, exitreason='', last-rc-change='Fri Jul 12 11:47:40 2019', queued=0ms, exec=0ms * redis-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=73, status=complete, exitreason='', last-rc-change='Fri Jul 12 11:49:41 2019', queued=0ms, exec=0ms * galera_monitor_10000 on galera-bundle-0 'not running' (7): call=139, status=complete, exitreason='', last-rc-change='Fri Jul 12 11:48:45 2019', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
After running `pcs resource cleanup` pacemaker will start the services which "failed to pull image" without issues and it doesn't attempt to pull those images. That makes me think it is indeed some race condition with the Ansible tasks which re-tag the `pcmklatest` images. [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum Last updated: Fri Jul 12 12:43:59 2019 Last change: Fri Jul 12 12:43:45 2019 by hacluster via crmd on controller-0 5 nodes configured 19 resources configured Online: [ controller-0 controller-1 ] GuestOnline: [ galera-bundle-0@controller-0 rabbitmq-bundle-0@controller-0 redis-bundle-0@controller-0 ] Full list of resources: podman container: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 podman container: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Starting controller-0 podman container: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 ip-192.168.24.7 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.115 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.20 (ocf::heartbeat:IPaddr2): Started controller-0 podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest] haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-0 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Forgot to put a link to the image re-tagging tasks. They're included here: https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L317-L320 and defined here: https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L229-L256
I have a patch proposed here: https://review.opendev.org/#/c/673456/9 While the cluster status is not entirely correct when the upgrade is finished (pasted it into a commit message there), the patch does at least gets us through the upgrade without crashing. Galera scaled up fine to all 3 nodes but other services only scaled up upto 2. This is another bug to look at subsequently, but at least the patch should unblock the critical path in testing and we can focus further on individual issues without the whole upgrade being outright blocked.
Fix merged and backported to stable/stein.
Re-setting Target Milestone z1 to --- to begin the 15z1 Maintenance Release.
openstack-tripleo-heat-templates-10.6.2-0.20190923210442.7db107a.el8ost - which is newer than openstack-tripleo-heat-templates-10.6.1-0.20190815230440.9adae50.el8ost.noarch - is available in RHEL OSP 15.0 repositories
Verified , Via automation : http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/upgrade/job/DFG-upgrades-upgrade-upgrade-14-15_director-rhel-virthost-3cont_2comp-ipv4-vxlan-poc/87/ deploy logs at : http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/upgrade/job/DFG-upgrades-upgrade-upgrade-14-15_director-rhel-virthost-3cont_2comp-ipv4-vxlan-poc/87/artifact/undercloud-0.tar.gz #check openstack-tripleo-heat-templates version: undercloud-0]$ grep openstack-tripleo-heat-templates-10 var/log/rpm.list openstack-tripleo-heat-templates-10.6.2-0.20190923210442.7db107a.el8ost.noarch #check resource-agents version: [r@r undercloud-0]$ grep -q "Installed: resource-agents-4.1.1-17.el8_0.6.x86_64" home/stack/overcloud_upgrade_run_controller-2.log&& echo "resource-agents-4.1.1-17.el8_0.6.x86_64 installed on overcloud " resource-agents-4.1.1-17.el8_0.6.x86_64 installed on overcloud
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4030