Bug 1810119

Summary: OSP13 update from GA to latest, image name registry change make pacemaker fail to restart.
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: chjones, jjoyce, lmiccini, mburns
Target Milestone: z12Keywords: TestBlocker, Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.4.1-53.el7ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-24 11:33:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2020-03-04 15:13:32 UTC
Description of problem:

Update from osp13 ga to latest with rhel-7.8 fails with that error:

     2020-03-02 16:30:47 | TASK [Debug output for task: Start containers for step 2] **********************
     2020-03-02 16:30:47 | Monday 02 March 2020  16:29:58 -0500 (0:08:32.890)       1:45:39.032 **********
     2020-03-02 16:30:47 | fatal: [controller-0]: FAILED! => {
             ....
     2020-03-02 16:30:47 |         "Warning: Undefined variable 'deploy_config_name'; ",
     2020-03-02 16:30:47 |         "   (file & line not available)",
     2020-03-02 16:30:47 |         "Warning: ModuleLoader: module 'pacemaker' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules",
     2020-03-02 16:30:47 |         "error: Could not connect to cluster (is it running?)",
     2020-03-02 16:30:47 |         "Warning: ModuleLoader: module 'rabbitmq' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-32oa59 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-controller-0]/Pcmk_property[property-controller-0-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-1w3xng4 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-controller-1]/Pcmk_property[property-controller-1-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-1mcggh8 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-controller-2]/Pcmk_property[property-controller-2-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200302-12-11hlqc9 failed with code: 1 -> Error: unable to get cib",
     2020-03-02 16:30:47 |         "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Bundle[rabbitmq-bundle]/Pcmk_bundle[rabbitmq-bundle]: Skipping because of failed dependencies",
     2020-03-02 16:30:47 |         "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Ocf[rabbitmq]/Pcmk_resource[rabbitmq]: Skipping because of failed dependencies",
     2020-03-02 16:30:47 |         "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]: Skipping because of failed dependencies",
     2020-03-02 16:30:47 |         "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!",
     2020-03-02 16:30:47 |         "+ rc=1",
     2020-03-02 16:30:47 |         "+ set -e",
     2020-03-02 16:30:47 |         "+ set +ux",
     2020-03-02 16:30:47 |         "Error running ['docker', 'run', '--name', 'rabbitmq_init_bundle', '--label', 'config_id=tripleo_step2', '--label', 'container_name=rabbitmq_init_bundle', '--label', 'managed_by=paunch', '--label', 'config_data={\"start_order\": 1, \"image\": \"192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:20200220.1\", \"environment\": [\"TRIPLEO_DEPLOY_IDENTIFIER=1583169594\"], \"command\": [\"/docker_puppet_apply.sh\", \"2\", \"file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,rabbitmq_policy,rabbitmq_user,rabbitmq_ready\", \"include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::rabbitmq_bundle\", \"--debug\"], \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\",\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/var/lib/docker-config-scripts/docker_puppet_apply.sh:/docker_puppet_apply.sh:ro\", \"/etc/puppet:/tmp/puppet-etc:ro\", \"/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro\", \"/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro\", \"/dev/shm:/dev/shm:rw\", \"/bin/true:/bin/epmd\"], \"net\": \"host\", \"detach\": false}', '--env=TRIPLEO_DEPLOY_IDENTIFIER=1583169594', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/var/lib/docker-config-scripts/docker_puppet_apply.sh:/docker_puppet_apply.sh:ro', '--volume=/etc/puppet:/tmp/puppet-etc:ro', '--volume=/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro', '--volume=/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro', '--volume=/dev/shm:/dev/shm:rw', '--volume=/bin/true:/bin/epmd', '--cpuset-cpus=0,1,2,3,4,5,6,7', '192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:20200220.1', '/docker_puppet_apply.sh', '2', 'file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,rabbitmq_policy,rabbitmq_user,rabbitmq_ready', 'include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::rabbitmq_bundle', '--debug']. [1]",

On the cluster the status is bad:

ast updated: Wed Mar  4 14:53:15 2020
Last change: Mon Mar  2 20:59:00 2020 by hacluster via crmd on controller-2

12 nodes configured
38 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
galera-bundle-0      (ocf::heartbeat:galera):        Stopped
galera-bundle-1      (ocf::heartbeat:galera):        Stopped
galera-bundle-2      (ocf::heartbeat:galera):        Stopped
Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
redis-bundle-0       (ocf::heartbeat:redis): Stopped
redis-bundle-1       (ocf::heartbeat:redis): Stopped
redis-bundle-2       (ocf::heartbeat:redis): Stopped
ip-192.168.24.11       (ocf::heartbeat:IPaddr2):       Stopped
ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.1.18 (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.1.10 (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.3.19 (ocf::heartbeat:IPaddr2):       Stopped
ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Stopped
Docker container: openstack-cinder-backup [192.168.24.1:8787/rhosp13/openstack-cinder-backup:pcmklatest]
openstack-cinder-backup-docker-0     (ocf::heartbeat:docker):        Stopped

Failed Resource Actions:
* rabbitmq-bundle-docker-0_start_0 on controller-0 'unknown error' (1): call=46, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest',
last-rc-change='Mon Mar  2 21:14:18 2020', queued=0ms, exec=348ms
* rabbitmq-bundle-docker-1_start_0 on controller-0 'unknown error' (1): call=100, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest',
last-rc-change='Mon Mar  2 21:14:19 2020', queued=0ms, exec=388ms
....

The issue is that there is a change in the repository name.

There are multiple re-tag actions:

020-03-02 16:02:18 | TASK [Pull latest Redis images] ************************************************
2020-03-02 16:02:18 | Monday 02 March 2020  16:01:32 -0500 (0:00:00.785)       1:17:12.592 **********
2020-03-02 16:02:18 | changed: [controller-0] => {"changed": true, "cmd": ["docker", "pull", "192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:20200220.1"], "delta": "0:00:04.504547", "end": "2020-03-02 21:01:36.861551", "rc": 0, "start
": "2020-03-02 21:01:32.357004", "stderr": "", "stderr_lines": [], "stdout": "Trying to pull repository 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis ... \n20200220.1: Pulling from 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis\nc
9ff3e9281bc: Already exists\nf897b9608c98: Already exists\n70081c7899d3: Already exists\n1c1b3adaaec4: Pulling fs layer\nc83e69f55d16: Pulling fs layer\n1c1b3adaaec4: Verifying Checksum\n1c1b3adaaec4: Download complete\n1c1b3adaaec4: Pul
l complete\nc83e69f55d16: Verifying Checksum\nc83e69f55d16: Download complete\nc83e69f55d16: Pull complete\nDigest: sha256:df9148da34f58bcddbc8ab4dc582653fe333306c0eb12b837836d67295c12888\nStatus: Downloaded newer image for 192.168.24.1:
8787/rh-osbs/rhosp13-openstack-redis:20200220.1", "stdout_lines": ["Trying to pull repository 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis ... ", "20200220.1: Pulling from 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis", "c9ff3e9
281bc: Already exists", "f897b9608c98: Already exists", "70081c7899d3: Already exists", "1c1b3adaaec4: Pulling fs layer", "c83e69f55d16: Pulling fs layer", "1c1b3adaaec4: Verifying Checksum", "1c1b3adaaec4: Download complete", "1c1b3adaa
ec4: Pull complete", "c83e69f55d16: Verifying Checksum", "c83e69f55d16: Download complete", "c83e69f55d16: Pull complete", "Digest: sha256:df9148da34f58bcddbc8ab4dc582653fe333306c0eb12b837836d67295c12888", "Status: Downloaded newer image
for 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:20200220.1"]}
2020-03-02 16:02:18 |
2020-03-02 16:02:18 | TASK [Retag pcmklatest to latest Redis image] **********************************
2020-03-02 16:02:18 | Monday 02 March 2020  16:01:36 -0500 (0:00:04.879)       1:17:17.471 **********
2020-03-02 16:02:18 | changed: [controller-0] => {"changed": true, "cmd": "docker tag 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:20200220.1 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:pcmklatest", "delta": "0:00:00.031311", "end": "2020-03-02 21:01:37.268249", "rc": 0, "start": "2020-03-02 21:01:37.236938", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

2020-03-02 16:21:21 | Monday 02 March 2020  16:20:53 -0500 (0:00:15.026)       1:36:34.339 **********
2020-03-02 16:21:21 | ok: [controller-0] => {
        ....

020-03-02 16:21:21 |         "$ docker run --name mysql_image_tag --label config_id=tripleo_step1 --label container_name=mysql_image_tag --label managed_by=paunch --label config_data={\"start_order\": 2, \"command\": [\"/bin/bash\", \"-
c\", \"/usr/bin/docker tag '192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1' '192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:pcmklatest'\"], \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/local
time:/etc/localtime:ro\", \"/dev/shm:/dev/shm:rw\", \"/etc/sysconfig/docker:/etc/sysconfig/docker:ro\", \"/usr/bin/docker:/usr/bin/docker:ro\", \"/usr/bin/docker-current:/usr/bin/docker-current:ro\", \"/var/run/docker.sock:/var/run/docke
r.sock:rw\"], \"image\": \"192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1\", \"detach\": false, \"net\": \"host\"} --net=host --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volum
e=/dev/shm:/dev/shm:rw --volume=/etc/sysconfig/docker:/etc/sysconfig/docker:ro --volume=/usr/bin/docker:/usr/bin/docker:ro --volume=/usr/bin/docker-current:/usr/bin/docker-current:ro --volume=/var/run/docker.sock:/var/run/docker.sock:rw
--cpuset-cpus=0,1,2,3,4,5,6,7 192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1 /bin/bash -c /usr/bin/docker tag '192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:20200220.1' '192.168.24.1:8787/rh-osbs/rhosp13-openstack-m
ariadb:pcmklatest'"

but they point pcmklatest to rh-osbs/rhosp13-openstack instead of  192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest which is expected by the resource.

This doesn't happen when updating to rhel-7.7, certainly the repo path doesn't change there.

Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-8.4.1-42.el7ost.noarch
Red Hat Enterprise Linux Server release 7.8 (Maipo)
Update from GA to 2020-02-24.2

How reproducible: all the time.

Comment 1 Sofer Athlan-Guyot 2020-03-04 15:40:00 UTC
Solved upstream for osp16, need somehow to be backported.

Comment 2 Sofer Athlan-Guyot 2020-03-10 18:23:10 UTC
Hi,

so the last puddle we can update to is 2020-01-15.3[1].  Starting with 2020-02-10.8[2] we have a new path in the registry that breaks update of HA containers.

Note that ci can still be green as pacemaker can recover during update and thus the "breakage" (which need to be formally analysed) can stay unseen in
ci.  This means that sequence is happening:


   1. stop pacemaker on ctl-0; ctl-1,2 are still up and running;
   2. update the resource with the new pcmklatest on ctl-0;
   3. the change is taken into right away by ctl-1 and ctl-2, they try to pull that new image and fail;
   4. at that time all HA services are down but on ctl-0.

So at 3. we shouldn't have a cut in api as ctl-0 will take the load, but ctl-1 and ctl-2 will be down.  They will recover when we get to update those node, but we loose High availability during the update.

They may be other consequences, that need to be further analysed.

Thanks,

[1] http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2020-01-15.3/overcloud_container_image_prepare.yaml
[2] http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2020-02-10.8/overcloud_container_image_prepare.yaml

Comment 18 errata-xmlrpc 2020-06-24 11:33:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2718