Bug 1810119 - OSP13 update from GA to latest, image name registry change make pacemaker fail to restart.
Summary: OSP13 update from GA to latest, image name registry change make pacemaker fai...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z12
: 13.0 (Queens)
Assignee: Sofer Athlan-Guyot
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-04 15:13 UTC by Sofer Athlan-Guyot
Modified: 2020-06-24 11:33 UTC (History)
4 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.4.1-53.el7ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-24 11:33:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1854730 0 None None None 2020-03-04 15:40:00 UTC
OpenStack gerrit 713412 0 None MERGED HA: minor update of arbitrary container image name 2021-02-03 22:27:21 UTC
Red Hat Product Errata RHBA-2020:2718 0 None None None 2020-06-24 11:33:56 UTC

Description Sofer Athlan-Guyot 2020-03-04 15:13:32 UTC
Description of problem:

Update from osp13 ga to latest with rhel-7.8 fails with that error:

     2020-03-02 16:30:47 | TASK [Debug output for task: Start containers for step 2] **********************
     2020-03-02 16:30:47 | Monday 02 March 2020  16:29:58 -0500 (0:08:32.890)       1:45:39.032 **********
     2020-03-02 16:30:47 | fatal: [controller-0]: FAILED! => {
             ....
     2020-03-02 16:30:47 |         "Warning: Undefined variable 'deploy_config_name'; ",
     2020-03-02 16:30:47 |         "   (file & line not available)",
     2020-03-02 16:30:47 |         "Warning: ModuleLoader: module 'pacemaker' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules",
     2020-03-02 16:30:47 |         "error: Could not connect to cluster (is it running?)",
     2020-03-02 16:30:47 |         "Warning: ModuleLoader: module 'rabbitmq' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-32oa59 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-controller-0]/Pcmk_property[property-controller-0-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-1w3xng4 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-controller-1]/Pcmk_property[property-controller-1-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-1mcggh8 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-controller-2]/Pcmk_property[property-controller-2-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-11hlqc9 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Bundle[rabbitmq-bundle]/Pcmk_bundle[rabbitmq-bundle]: Skipping because of failed dependencies",
     2020-03-02 16:30:47 |         "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Ocf[rabbitmq]/Pcmk_resource[rabbitmq]: Skipping because of failed dependencies",
     2020-03-02 16:30:47 |         "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]: Skipping because of failed dependencies",
     2020-03-02 16:30:47 |         "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!",
     2020-03-02 16:30:47 |         "+ rc=1",
     2020-03-02 16:30:47 |         "+ set -e",
     2020-03-02 16:30:47 |         "+ set +ux",
     2020-03-02 16:30:47 |         "Error running ['docker', 'run', '--name', 'rabbitmq_init_bundle', '--label', 'config_id=tripleo_step2', '--label', 'container_name=rabbitmq_init_bundle', '--label', 'managed_by=paunch', '--label', 'config_data={\"start_order\": 1, \"image\": \"192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:20200220.1\", \"environment\": [\"TRIPLEO_DEPLOY_IDENTIFIER=1583169594\"], \"command\": [\"/docker_puppet_apply.sh\", \"2\", \"file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,rabbitmq_policy,rabbitmq_user,rabbitmq_ready\", \"include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::rabbitmq_bundle\", \"--debug\"], \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\",\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/var/lib/docker-config-scripts/docker_puppet_apply.sh:/docker_puppet_apply.sh:ro\", \"/etc/puppet:/tmp/puppet-etc:ro\", \"/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro\", \"/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro\", \"/dev/shm:/dev/shm:rw\", \"/bin/true:/bin/epmd\"], \"net\": \"host\", \"detach\": false}', '--env=TRIPLEO_DEPLOY_IDENTIFIER=1583169594', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/var/lib/docker-config-scripts/docker_puppet_apply.sh:/docker_puppet_apply.sh:ro', '--volume=/etc/puppet:/tmp/puppet-etc:ro', '--volume=/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro', '--volume=/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro', '--volume=/dev/shm:/dev/shm:rw', '--volume=/bin/true:/bin/epmd', '--cpuset-cpus=0,1,2,3,4,5,6,7', '192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:20200220.1', '/docker_puppet_apply.sh', '2', 'file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,rabbitmq_policy,rabbitmq_user,rabbitmq_ready', 'include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::rabbitmq_bundle', '--debug']. [1]",

On the cluster the status is bad:

ast updated: Wed Mar  4 14:53:15 2020
Last change: Mon Mar  2 20:59:00 2020 by hacluster via crmd on controller-2

12 nodes configured
38 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
galera-bundle-0      (ocf::heartbeat:galera):        Stopped
galera-bundle-1      (ocf::heartbeat:galera):        Stopped
galera-bundle-2      (ocf::heartbeat:galera):        Stopped
Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
redis-bundle-0       (ocf::heartbeat:redis): Stopped
redis-bundle-1       (ocf::heartbeat:redis): Stopped
redis-bundle-2       (ocf::heartbeat:redis): Stopped
ip-192.168.24.11       (ocf::heartbeat:IPaddr2):       Stopped
ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.1.18 (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.1.10 (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.3.19 (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Stopped
Docker container: openstack-cinder-backup [192.168.24.1:8787/rhosp13/openstack-cinder-backup:pcmklatest]
openstack-cinder-backup-docker-0     (ocf::heartbeat:docker):        Stopped

Failed Resource Actions:
* rabbitmq-bundle-docker-0_start_0 on controller-0 'unknown error' (1): call=46, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest',
last-rc-change='Mon Mar  2 21:14:18 2020', queued=0ms, exec=348ms
* rabbitmq-bundle-docker-1_start_0 on controller-0 'unknown error' (1): call=100, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest',
last-rc-change='Mon Mar  2 21:14:19 2020', queued=0ms, exec=388ms
....

The issue is that there is a change in the repository name.

There are multiple re-tag actions:

020-03-02 16:02:18 | TASK [Pull latest Redis images] ************************************************
2020-03-02 16:02:18 | Monday 02 March 2020  16:01:32 -0500 (0:00:00.785)       1:17:12.592 **********
2020-03-02 16:02:18 | changed: [controller-0] => {"changed": true, "cmd": ["docker", "pull", "192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:20200220.1"], "delta": "0:00:04.504547", "end": "2020-03-02 21:01:36.861551", "rc": 0, "start
": "2020-03-02 21:01:32.357004", "stderr": "", "stderr_lines": [], "stdout": "Trying to pull repository 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis ... \n20200220.1: Pulling from 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis\nc
9ff3e9281bc: Already exists\nf897b9608c98: Already exists\n70081c7899d3: Already exists\n1c1b3adaaec4: Pulling fs layer\nc83e69f55d16: Pulling fs layer\n1c1b3adaaec4: Verifying Checksum\n1c1b3adaaec4: Download complete\n1c1b3adaaec4: Pul
l complete\nc83e69f55d16: Verifying Checksum\nc83e69f55d16: Download complete\nc83e69f55d16: Pull complete\nDigest: sha256:df9148da34f58bcddbc8ab4dc582653fe333306c0eb12b837836d67295c12888\nStatus: Downloaded newer image for 192.168.24.1:
8787/rh-osbs/rhosp13-openstack-redis:20200220.1", "stdout_lines": ["Trying to pull repository 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis ... ", "20200220.1: Pulling from 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis", "c9ff3e9
281bc: Already exists", "f897b9608c98: Already exists", "70081c7899d3: Already exists", "1c1b3adaaec4: Pulling fs layer", "c83e69f55d16: Pulling fs layer", "1c1b3adaaec4: Verifying Checksum", "1c1b3adaaec4: Download complete", "1c1b3adaa
ec4: Pull complete", "c83e69f55d16: Verifying Checksum", "c83e69f55d16: Download complete", "c83e69f55d16: Pull complete", "Digest: sha256:df9148da34f58bcddbc8ab4dc582653fe333306c0eb12b837836d67295c12888", "Status: Downloaded newer image
for 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:20200220.1"]}
2020-03-02 16:02:18 |
2020-03-02 16:02:18 | TASK [Retag pcmklatest to latest Redis image] **********************************
2020-03-02 16:02:18 | Monday 02 March 2020  16:01:36 -0500 (0:00:04.879)       1:17:17.471 **********
2020-03-02 16:02:18 | changed: [controller-0] => {"changed": true, "cmd": "docker tag 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:20200220.1 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:pcmklatest", "delta": "0:00:00.031311", "end": "2020-03-02 21:01:37.268249", "rc": 0, "start": "2020-03-02 21:01:37.236938", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

2020-03-02 16:21:21 | Monday 02 March 2020  16:20:53 -0500 (0:00:15.026)       1:36:34.339 **********
2020-03-02 16:21:21 | ok: [controller-0] => {
        ....

020-03-02 16:21:21 |         "$ docker run --name mysql_image_tag --label config_id=tripleo_step1 --label container_name=mysql_image_tag --label managed_by=paunch --label config_data={\"start_order\": 2, \"command\": [\"/bin/bash\", \"-
c\", \"/usr/bin/docker tag '192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1' '192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:pcmklatest'\"], \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/local
time:/etc/localtime:ro\", \"/dev/shm:/dev/shm:rw\", \"/etc/sysconfig/docker:/etc/sysconfig/docker:ro\", \"/usr/bin/docker:/usr/bin/docker:ro\", \"/usr/bin/docker-current:/usr/bin/docker-current:ro\", \"/var/run/docker.sock:/var/run/docke
r.sock:rw\"], \"image\": \"192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1\", \"detach\": false, \"net\": \"host\"} --net=host --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volum
e=/dev/shm:/dev/shm:rw --volume=/etc/sysconfig/docker:/etc/sysconfig/docker:ro --volume=/usr/bin/docker:/usr/bin/docker:ro --volume=/usr/bin/docker-current:/usr/bin/docker-current:ro --volume=/var/run/docker.sock:/var/run/docker.sock:rw
--cpuset-cpus=0,1,2,3,4,5,6,7 192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1 /bin/bash -c /usr/bin/docker tag '192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1' '192.168.24.1:8787/rh-osbs/rhosp13-openstack-m
ariadb:pcmklatest'"

but they point pcmklatest to rh-osbs/rhosp13-openstack instead of  192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest which is expected by the resource.

This doesn't happen when updating to rhel-7.7, certainly the repo path doesn't change there.

Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-8.4.1-42.el7ost.noarch
Red Hat Enterprise Linux Server release 7.8 (Maipo)
Update from GA to 2020-02-24.2

How reproducible: all the time.

Comment 1 Sofer Athlan-Guyot 2020-03-04 15:40:00 UTC
Solved upstream for osp16, need somehow to be backported.

Comment 2 Sofer Athlan-Guyot 2020-03-10 18:23:10 UTC
Hi,

so the last puddle we can update to is 2020-01-15.3[1].  Starting with 2020-02-10.8[2] we have a new path in the registry that breaks update of HA containers.

Note that ci can still be green as pacemaker can recover during update and thus the "breakage" (which need to be formally analysed) can stay unseen in
ci.  This means that sequence is happening:


   1. stop pacemaker on ctl-0; ctl-1,2 are still up and running;
   2. update the resource with the new pcmklatest on ctl-0;
   3. the change is taken into right away by ctl-1 and ctl-2, they try to pull that new image and fail;
   4. at that time all HA services are down but on ctl-0.

So at 3. we shouldn't have a cut in api as ctl-0 will take the load, but ctl-1 and ctl-2 will be down.  They will recover when we get to update those node, but we loose High availability during the update.

They may be other consequences, that need to be further analysed.

Thanks,

[1] http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2020-01-15.3/overcloud_container_image_prepare.yaml
[2] http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2020-02-10.8/overcloud_container_image_prepare.yaml

Comment 18 errata-xmlrpc 2020-06-24 11:33:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2718


Note You need to log in before you can comment on or make changes to this bug.