Bug 1729192
| Summary: | OSP 14->15: upgrade of controller-1 fails due to cinder_volume_restart_bundle | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jiri Stransky <jstransk> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Jiri Stransky <jstransk> |
| Status: | CLOSED ERRATA | QA Contact: | pkomarov |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 15.0 (Stein) | CC: | mburns, michele, pkomarov |
| Target Milestone: | --- | Keywords: | TestOnly, Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-heat-templates-10.6.1-0.20190815230440.9adae50.el8ost.noarch | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-12-02 10:11:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1623864, 1727807 | ||
|
Description
Jiri Stransky
2019-07-11 14:23:52 UTC
After the upgrade, pcs status looks like this -- there are some failed actions from the past, but in the end everything is running.
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum
Last updated: Thu Jul 11 14:26:49 2019
Last change: Thu Jul 11 13:59:37 2019 by root via cibadmin on controller-0
8 nodes configured
28 resources configured
Online: [ controller-0 controller-1 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 redis-bundle-0@controller-0 redis-bundle-1@controller-1 ]
Full list of resources:
podman container set: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest]
galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
galera-bundle-1 (ocf::heartbeat:galera): Master controller-1
podman container set: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest]
rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0
rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1
podman container set: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest]
redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1
ip-192.168.24.7 (ocf::heartbeat:IPaddr2): Started controller-0
ip-10.0.0.115 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.4.20 (ocf::heartbeat:IPaddr2): Started controller-0
podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest]
haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0
haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1
haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped
podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-1
Failed Resource Actions:
* rabbitmq-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=22, status=complete, exitreason='',
last-rc-change='Thu Jul 11 13:26:19 2019', queued=0ms, exec=0ms
* redis-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=9, status=Error, exitreason='',
last-rc-change='Thu Jul 11 13:26:55 2019', queued=0ms, exec=0ms
* openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=93, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-c
inder-volume:pcmklatest',
last-rc-change='Thu Jul 11 13:23:45 2019', queued=1ms, exec=2069ms
* galera-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=11, status=complete, exitreason='',
last-rc-change='Thu Jul 11 13:25:18 2019', queued=0ms, exec=0ms
* haproxy-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=84, status=complete, exitreason='',
last-rc-change='Thu Jul 11 13:24:29 2019', queued=0ms, exec=0ms
* redis-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=33, status=complete, exitreason='',
last-rc-change='Thu Jul 11 13:27:31 2019', queued=0ms, exec=0ms
* rabbitmq-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=6, status=Error, exitreason='',
last-rc-change='Thu Jul 11 13:26:19 2019', queued=0ms, exec=0ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Ok so the *_restart_bundle containers fundamentally do this:
1) Get invoked when paunch detects a change
2) Do something like the following:
if [ x"${TRIPLEO_MINOR_UPDATE,,}" != x"true" ] && /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=PCMKTIMEOUT openstack-cinder-volume; echo "openstack-cinder-volume restart invoked"; fi'
I.e. if a resource exists we restart it.
In this case it fails with:
ASK [Debug output for task: Start containers for step 5] **********************
Thursday 11 July 2019 13:59:52 +0000 (0:00:55.891) 0:29:35.127 *********
fatal: [controller-0]: FAILED! => {
"failed_when_result": true,
"outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
"Error running ['podman', 'run', '--name', 'cinder_volume_restart_bundle', '--label', 'config_id=tripleo_step5', '--label', 'container_name=cinder_volume_restart_bundle', '--label', 'managed_by=paunch', '--label', 'config_data={\"command\": [\"/usr/bin/bootstrap_host_exec\", \"cinder_volume\", \"if [ x\\\\\"${TRIPLEO_MINOR_UPDATE,,}\\\\\" != x\\\\\"true\\\\\" ] && /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=600 openstack-cinder-volume; echo \\\\\"openstack-cinder-volume restart invoked\\\\\"; fi\"], \"config_volume\": \"cinder\", \"detach\": false, \"environment\": [\"TRIPLEO_MINOR_UPDATE\", \"TRIPLEO_CONFIG_HASH=a8f699fd80eb5a32ffa283b5229704c0\"], \"image\": \"brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:latest\", \"ipc\": \"host\", \"net\": \"host\", \"start_order\": 0, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro\", \"/var/lib/config-data/puppet-generated/cinder/:/var/lib/kolla/config_files/src:ro\"]}', '--conmon-pidfile=/var/run/cinder_volume_restart_bundle.pid', '--log-driver', 'json-file', '--log-opt', 'path=/var/log/containers/stdouts/cinder_volume_restart_bundle.log', '--env=TRIPLEO_MINOR_UPDATE', '--env=TRIPLEO_CONFIG_HASH=a8f699fd80eb5a32ffa283b5229704c0', '--net=host', '--ipc=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro', '--volume=/var/lib/config-data/puppet-generated/cinder/:/var/lib/kolla/config_files/src:ro', 'brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:latest', '/usr/bin/bootstrap_host_exec', 'cinder_volume', 'if [ x\"${TRIPLEO_MINOR_UPDATE,,}\" != x\"true\" ] && /usr/sbin/pcs resource show openstack-cinder-volume; then /usr/sbin/pcs resource restart --wait=600 openstack-cinder-volume; echo \"openstack-cinder-volume restart invoked\"; fi']. [1]",
"",
"stdout: Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.",
" Bundle: openstack-cinder-volume",
" Podman: image=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest network=host options=\"--ipc=host --privileged=true --user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\" replicas=1 run-command=\"/bin/bash /usr/local/bin/kolla_start\"",
" Storage Mapping:",
" options=ro source-dir=/etc/hosts target-dir=/etc/hosts (cinder-volume-etc-hosts)",
" options=ro source-dir=/etc/localtime target-dir=/etc/localtime (cinder-volume-etc-localtime)",
" options=ro source-dir=/etc/pki/ca-trust/extracted target-dir=/etc/pki/ca-trust/extracted (cinder-volume-etc-pki-ca-trust-extracted)",
" options=ro source-dir=/etc/pki/ca-trust/source/anchors target-dir=/etc/pki/ca-trust/source/anchors (cinder-volume-etc-pki-ca-trust-source-anchors)",
" options=ro source-dir=/etc/pki/tls/certs/ca-bundle.crt target-dir=/etc/pki/tls/certs/ca-bundle.crt (cinder-volume-etc-pki-tls-certs-ca-bundle.crt)",
" options=ro source-dir=/etc/pki/tls/certs/ca-bundle.trust.crt target-dir=/etc/pki/tls/certs/ca-bundle.trust.crt (cinder-volume-etc-pki-tls-certs-ca-bundle.trust.crt)",
" options=ro source-dir=/etc/pki/tls/cert.pem target-dir=/etc/pki/tls/cert.pem (cinder-volume-etc-pki-tls-cert.pem)",
" options=rw source-dir=/dev/log target-dir=/dev/log (cinder-volume-dev-log)",
" options=ro source-dir=/etc/ssh/ssh_known_hosts target-dir=/etc/ssh/ssh_known_hosts (cinder-volume-etc-ssh-ssh_known_hosts)",
" options=ro source-dir=/etc/puppet target-dir=/etc/puppet (cinder-volume-etc-puppet)",
" options=ro source-dir=/var/lib/kolla/config_files/cinder_volume.json target-dir=/var/lib/kolla/config_files/config.json (cinder-volume-var-lib-kolla-config_files-cinder_volume.json)",
" options=ro source-dir=/var/lib/config-data/puppet-generated/cinder/ target-dir=/var/lib/kolla/config_files/src (cinder-volume-var-lib-config-data-puppet-generated-cinder-)",
" options=ro source-dir=/etc/iscsi target-dir=/var/lib/kolla/config_files/src-iscsid (cinder-volume-etc-iscsi)",
" options=ro source-dir=/etc/ceph target-dir=/var/lib/kolla/config_files/src-ceph (cinder-volume-etc-ceph)",
" options=ro source-dir=/lib/modules target-dir=/lib/modules (cinder-volume-lib-modules)",
" options=rw source-dir=/dev/ target-dir=/dev/ (cinder-volume-dev-)",
" options=rw source-dir=/run/ target-dir=/run/ (cinder-volume-run-)",
" options=rw source-dir=/sys target-dir=/sys (cinder-volume-sys)",
" options=z source-dir=/var/lib/cinder target-dir=/var/lib/cinder (cinder-volume-var-lib-cinder)",
" options=z source-dir=/var/lib/iscsi target-dir=/var/lib/iscsi (cinder-volume-var-lib-iscsi)",
" options=z source-dir=/var/log/containers/cinder target-dir=/var/log/cinder (cinder-volume-var-log-containers-cinder)",
"stderr: Error: Error performing operation: No such device or address",
"openstack-cinder-volume is not running anywhere and so cannot be restarted",
That error comes from crm_resource as invoked by pcs and we end up in this function in tools/crm_resource_runtime.c:
static bool resource_is_running_on(resource_t *rsc, const char *host)
{
bool found = TRUE;
GListPtr hIter = NULL;
GListPtr hosts = NULL;
if(rsc == NULL) {
return FALSE;
}
rsc->fns->location(rsc, &hosts, TRUE);
for (hIter = hosts; host != NULL && hIter != NULL; hIter = hIter->next) {
pe_node_t *node = (pe_node_t *) hIter->data;
if(strcmp(host, node->details->uname) == 0) {
crm_trace("Resource %s is running on %s\n", rsc->id, host);
goto done;
} else if(strcmp(host, node->details->id) == 0) {
crm_trace("Resource %s is running on %s\n", rsc->id, host);
goto done;
}
}
if(host != NULL) {
crm_trace("Resource %s is not running on: %s\n", rsc->id, host);
found = FALSE;
} else if(host == NULL && hosts == NULL) {
crm_trace("Resource %s is not running\n", rsc->id);
found = FALSE;
}
done:
g_list_free(hosts);
return found;
}
So to me the most likely hypothesis is that:
A) pcs resource show openstack-cinder-volume did return 0
B) The openstack-cinder-volume resource was indeed not running anywhere and pcs/pcmk refuse to restart something that is not running.
Testing this theory on OSP15:
pcs resource disable openstack-cinder-volume
pcs resource show openstack-cinder-volume > /dev/null && echo $?
0
So we know we get inside the if branch normally even when the resource is down, which is expected. What is not expected is that restarting a resource that is not running barfs:
[root@controller-0 ~]# pcs resource restart openstack-cinder-volume
Error: Error performing operation: No such device or address
openstack-cinder-volume is not running anywhere and so cannot be restarted
So I think the fix here should be that we make the '/usr/sbin/pcs resource show openstack-cinder-volume' also consider the case when the resource is stopped for whatever reason.
Before upgrading controller-1, the cinder-volume bundle was running on controller-0 but it got stopped and then errored:
* openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=81, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest',
last-rc-change='Fri Jul 12 11:45:46 2019', queued=0ms, exec=3183ms
However the image it can't pull is present on the node, so i'm not sure why it's attempt to pull that name at all:
[root@controller-0 ~]# podman images | grep cinder-volume
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume pcmklatest 16f3aca78029 12 hours ago 1.2 GB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume latest 16f3aca78029 12 hours ago 1.2 GB
So i think we have two possibly connected issues here:
1) The cinder-volume resource is probably still trying to upgrade itself on controller-0 despite being already upgraded there when we run with `--limit controller-0,controller-1`. We have a bunch of tasks which deal with the pcmklatest image tagging:
https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L260-L320
We probably need to improve the idempotency of those somehow.
1.A) We shouldn't stop and edit the resource when it in fact does not need to be stopped and edited. These tasks should be a complete no-op during `--limit controller-0,controller-1`, they only need to run once per cluster, and they ran with `--limit controller-0`.
https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L298-L307
1.B) I'm not sure if this is an issue but i wonder if we should also prevent the re-tagging from happening when it's not needed. Perhaps it's atomic enough so that re-execution doesn't matter, but not sure... During `--limit controller-0,controller-1`, these tasks should be a no-op on controller-0 (as they were already applied there during `--limit controller-0`), but they must run on controller-1 where they're being executed for the first time.
2) I'm puzzled why did pacemaker try to pull the image when it was already present. Perhaps it is some momentary interaction with the re-tagging tasks (problem 1.B above) and when we fix that, this issue would disappear...
More services are probably affected by these issues right now cluster state is:
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum
Last updated: Fri Jul 12 12:40:17 2019
Last change: Fri Jul 12 12:23:04 2019 by hacluster via crmd on controller-0
5 nodes configured
19 resources configured
Online: [ controller-0 controller-1 ]
GuestOnline: [ galera-bundle-0@controller-0 redis-bundle-0@controller-0 ]
Full list of resources:
podman container: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest]
galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
podman container: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest]
rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped
podman container: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest]
redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
ip-192.168.24.7 (ocf::heartbeat:IPaddr2): Started controller-0
ip-10.0.0.115 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.4.20 (ocf::heartbeat:IPaddr2): Started controller-0
podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest]
haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0
haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped
haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped
podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped
Failed Resource Actions:
* rabbitmq-bundle-podman-0_start_0 on controller-0 'unknown error' (1): call=97, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest',
last-rc-change='Fri Jul 12 11:49:07 2019', queued=0ms, exec=2019ms
* openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=81, status=complete, exitreason='failed to pull image brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest',
last-rc-change='Fri Jul 12 11:45:46 2019', queued=0ms, exec=3183ms
* redis-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=9, status=Error, exitreason='',
last-rc-change='Fri Jul 12 11:49:39 2019', queued=0ms, exec=0ms
* galera-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=12, status=complete, exitreason='',
last-rc-change='Fri Jul 12 11:47:40 2019', queued=0ms, exec=0ms
* haproxy-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=86, status=complete, exitreason='',
last-rc-change='Fri Jul 12 11:47:40 2019', queued=0ms, exec=0ms
* redis-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=73, status=complete, exitreason='',
last-rc-change='Fri Jul 12 11:49:41 2019', queued=0ms, exec=0ms
* galera_monitor_10000 on galera-bundle-0 'not running' (7): call=139, status=complete, exitreason='',
last-rc-change='Fri Jul 12 11:48:45 2019', queued=0ms, exec=0ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
After running `pcs resource cleanup` pacemaker will start the services which "failed to pull image" without issues and it doesn't attempt to pull those images. That makes me think it is indeed some race condition with the Ansible tasks which re-tag the `pcmklatest` images. [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum Last updated: Fri Jul 12 12:43:59 2019 Last change: Fri Jul 12 12:43:45 2019 by hacluster via crmd on controller-0 5 nodes configured 19 resources configured Online: [ controller-0 controller-1 ] GuestOnline: [ galera-bundle-0@controller-0 rabbitmq-bundle-0@controller-0 redis-bundle-0@controller-0 ] Full list of resources: podman container: galera-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 podman container: rabbitmq-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Starting controller-0 podman container: redis-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 ip-192.168.24.7 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.115 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.20 (ocf::heartbeat:IPaddr2): Started controller-0 podman container set: haproxy-bundle [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:pcmklatest] haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped podman container: openstack-cinder-volume [brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-0 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Forgot to put a link to the image re-tagging tasks. They're included here: https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L317-L320 and defined here: https://github.com/openstack/tripleo-heat-templates/blob/9e90d875c7dcb91d367090d574b99229574ce369/deployment/cinder/cinder-volume-pacemaker-puppet.yaml#L229-L256 I have a patch proposed here: https://review.opendev.org/#/c/673456/9 While the cluster status is not entirely correct when the upgrade is finished (pasted it into a commit message there), the patch does at least gets us through the upgrade without crashing. Galera scaled up fine to all 3 nodes but other services only scaled up upto 2. This is another bug to look at subsequently, but at least the patch should unblock the critical path in testing and we can focus further on individual issues without the whole upgrade being outright blocked. Fix merged and backported to stable/stein. Re-setting Target Milestone z1 to --- to begin the 15z1 Maintenance Release. openstack-tripleo-heat-templates-10.6.2-0.20190923210442.7db107a.el8ost - which is newer than openstack-tripleo-heat-templates-10.6.1-0.20190815230440.9adae50.el8ost.noarch - is available in RHEL OSP 15.0 repositories Verified , Via automation : http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/upgrade/job/DFG-upgrades-upgrade-upgrade-14-15_director-rhel-virthost-3cont_2comp-ipv4-vxlan-poc/87/ deploy logs at : http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/upgrade/job/DFG-upgrades-upgrade-upgrade-14-15_director-rhel-virthost-3cont_2comp-ipv4-vxlan-poc/87/artifact/undercloud-0.tar.gz #check openstack-tripleo-heat-templates version: undercloud-0]$ grep openstack-tripleo-heat-templates-10 var/log/rpm.list openstack-tripleo-heat-templates-10.6.2-0.20190923210442.7db107a.el8ost.noarch #check resource-agents version: [r@r undercloud-0]$ grep -q "Installed: resource-agents-4.1.1-17.el8_0.6.x86_64" home/stack/overcloud_upgrade_run_controller-2.log&& echo "resource-agents-4.1.1-17.el8_0.6.x86_64 installed on overcloud " resource-agents-4.1.1-17.el8_0.6.x86_64 installed on overcloud Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4030 |