Description of problem: ----------------------- After minor update finishes on nodes running pcs services (galera/redis/haproxy) those containers ain't restared with latest images. [root@controller-2 ~]# docker images | grep haproxy 192.168.24.1:8787/rhosp12/openstack-haproxy 12.0-20171201.1 3ad3a1214956 6 days ago 781 MB 192.168.24.1:8787/rhosp12/openstack-haproxy pcmklatest 3ad3a1214956 6 days ago 781 MB 192.168.24.1:8787/rhosp12/openstack-haproxy 12.0-20171129.1 2a9767dddf79 8 days ago 774.4 MB [root@controller-2 ~]# docker ps | grep -v 12.0-20171201.1 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a8f4bf26dba1 2a9767dddf79 "/bin/bash /usr/local" 36 minutes ago Up 36 minutes haproxy-bundle-docker-2 2fc167ecda48 03bca6ccbf7f "/bin/bash /usr/local" 36 minutes ago Up 36 minutes redis-bundle-docker-2 7097c4bcc0d8 6d6f0bc78831 "/bin/bash /usr/local" 36 minutes ago Up 36 minutes galera-bundle-docker-2 2eceb3d172cc 716b358a3921 "/bin/bash /usr/local" 36 minutes ago Up 36 minutes (healthy) rabbitmq-bundle-docker-2 ed6200f0c6a7 docker-registry.engineering.redhat.com/ceph/rhceph-2-rhel7:2.4-4 "/entrypoint.sh" About an hour ago Up About an hour ceph-mon-controller-2 [root@controller-2 ~]# docker images | grep redis 192.168.24.1:8787/rhosp12/openstack-redis 12.0-20171201.1 d0f7ca7536a3 6 days ago 780.7 MB 192.168.24.1:8787/rhosp12/openstack-redis pcmklatest d0f7ca7536a3 6 days ago 780.7 MB 192.168.24.1:8787/rhosp12/openstack-redis 12.0-20171129.1 03bca6ccbf7f 8 days ago 774.1 MB [root@controller-2 ~]# docker images | grep rabbit 192.168.24.1:8787/rhosp12/openstack-rabbitmq 12.0-20171201.1 1c544a8ea1af 6 days ago 815.4 MB 192.168.24.1:8787/rhosp12/openstack-rabbitmq pcmklatest 1c544a8ea1af 6 days ago 815.4 MB 192.168.24.1:8787/rhosp12/openstack-rabbitmq 12.0-20171129.1 716b358a3921 8 days ago 808.8 MB [root@controller-2 ~]# docker images | grep mari 192.168.24.1:8787/rhosp12/openstack-mariadb 12.0-20171201.1 491a0fe8d922 6 days ago 911.5 MB 192.168.24.1:8787/rhosp12/openstack-mariadb pcmklatest 491a0fe8d922 6 days ago 911.5 MB 192.168.24.1:8787/rhosp12/openstack-mariadb 12.0-20171129.1 6d6f0bc78831 8 days ago 904.9 MB From update log we can see that docker run uses correct image: -------------------------------------------------------------- u' "Digest: sha256:cc2420e3dd8d989d0f86dd7dd3912d37d921fb4e0b1376889fbfb42b1b2b66c7", ', u' "2017-12-08 11:10:17,607 DEBUG: 393285 -- NET_HOST enabled", ', u' "2017-12-08 11:10:17,608 DEBUG: 393285 -- Running docker command: /usr/bin/docker run --user root --name docker-puppet-haproxy --health-cmd /bin/true --env PUPPET_TAGS=file,file_line,concat,augeas,cron,haproxy_config --env NAME=haproxy --env HOSTNAME=controller-0 --env NO_ARCHIVE= --env STEP=6 --volume /tmp/tmpRcXUdO:/etc/config.pp:ro --volume /etc/puppet/:/tmp/puppet-etc/:ro --volume /usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro --volume /var/lib/config-data:/var/lib/config-data/:rw --volume tripleo_logs:/var/log/tripleo/ --volume /dev/log:/dev/log --volume /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume /var/lib/docker-puppet/docker-puppet.sh:/var/lib/docker-puppet/docker-puppet.sh:rw --volume /etc/ipa/ca.crt:/etc/ipa/ca.crt:ro --volume /etc/pki/tls/private/haproxy:/etc/pki/tls/private/haproxy:ro --volume /etc/pki/tls/certs/haproxy:/etc/pki/tls/certs/haproxy:ro --volume /etc/pki/tls/private/overcloud_endpoint.pem:/etc/pki/tls/private/overcloud_endpoint.pem:ro --entrypoint /var/lib/docker-puppet/docker-puppet.sh --net host --volume /etc/hosts:/etc/hosts:ro 192.168.24.1:8787/rhosp12/openstack-haproxy:12.0-20171201.1", ', u' "2017-12-08 11:10:18,032 DEBUG: 393286 -- Trying to pull repository 192.168.24.1:8787/rhosp12/openstack-ceilometer-central ... ", ', Version-Release number of selected component (if applicable): ------------------------------------------------------------- openstack-tripleo-common-7.6.3-8.el7ost.noarch openstack-tripleo-image-elements-7.0.1-1.el7ost.noarch openstack-tripleo-common-containers-7.6.3-8.el7ost.noarch openstack-tripleo-ui-7.4.3-4.el7ost.noarch openstack-tripleo-validations-7.4.2-1.el7ost.noarch puppet-tripleo-7.4.3-11.el7ost.noarch openstack-tripleo-heat-templates-7.0.3-18.el7ost.noarch openstack-tripleo-puppet-elements-7.0.1-2.el7ost.noarch python-tripleoclient-7.3.3-7.el7ost.noarch pcs-0.9.158-6.el7_4.1.x86_64 puppet-pacemaker-0.6.0-2.el7ost.noarch pacemaker-cli-1.1.16-12.el7_4.5.x86_64 ansible-pacemaker-1.0.3-2.el7ost.noarch pacemaker-1.1.16-12.el7_4.5.x86_64 userspace-rcu-0.7.16-1.el7cp.x86_64 pacemaker-remote-1.1.16-12.el7_4.5.x86_64 pacemaker-libs-1.1.16-12.el7_4.5.x86_64 pacemaker-cluster-libs-1.1.16-12.el7_4.5.x86_64 docker-rhel-push-plugin-1.12.6-68.gitec8512b.el7.x86_64 python-docker-pycreds-1.10.6-3.el7.noarch docker-common-1.12.6-68.gitec8512b.el7.x86_64 python-heat-agent-docker-cmd-1.4.0-1.el7ost.noarch docker-client-1.12.6-68.gitec8512b.el7.x86_64 docker-1.12.6-68.gitec8512b.el7.x86_64 python-docker-py-1.10.6-3.el7.noarch Steps to Reproduce: ------------------- 1. Update uc to 2017-12-01.4 2. Setup latest repos on oc 3. Run init-minor-update to setup heat output 4. Run minor update of oc nodes 5. Upload latest docker images to uc registry 6. Generate file with images 7. Run init-minor-update to setup heat output 8. Run minor update against nodes hosting pcs services Actual results: --------------- Latest images are downloaded to nodes, correctly retagged for pcs. PCS managed containers are stared with previous images. Expected results: ----------------- PCS managed services are restarted with latest images. Additional info: ---------------- Virtual setup: 3controllers + 2computes + 3ceph Re-run of update command restarts services with correct images.
Hi, so in the ansible run we can see (for haproxy for instance): u'TASK [Get a list of container using Haproxy image] *****************************', u'skipping: [192.168.24.20]', u'', u'TASK [Remove any container using the same Haproxy image] ***********************', u'skipping: [192.168.24.20]', u'', u'TASK [Remove previous Haproxy images] ******************************************', u'skipping: [192.168.24.20]', u'', u'TASK [Pull latest Haproxy images] **********************************************', u'skipping: [192.168.24.20]', u'', u'TASK [Retag pcmklatest to latest Haproxy image] ********************************', u'skipping: [192.168.24.20]', u'', the crucial tasks are skipped.
Previous comment has to be ignore, this is done later on.
So the problem seems to be that we stop pcs cluster at step 1 and search for pcs managed containers at step 2. Problem is that containers are stopped and we run 'docker ps -q -f ancestor=<image_id>', which by default show just running containers.
So Damien, Yurii and I spent some more time on this. We started from a clean environment and we could not reproduce the problem: - Each controller did exactly as we expected it and updated to the latest pacemaker image Tomorrow we will run some more tests. Right now the theory is that some additional steps need to happen for us to see the issue (maybe rerunning some steps like the minor-init-update or the config download multiple times). I think we need to fully understand the root cause before we look at throwing any patches at the problem.
So the issue here is that the config container is updated before the heat stack update is finished, thats why the config doesn't get all the latest docker images. The workaround would be to run --init-minor-update twice for GA only if we want to update the docker registry file. For 0 day or Z release, I have something that fix this wrong behavior.
LP and master review attached
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0602
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days