Description of problem: nova_hybris_state tasks[1] fails while performing an upgrade framework on the latest OSP bits. ~~~ [1] openstack overcloud upgrade run --stack msufiyan --playbook upgrade_steps_playbook.yaml --tags nova_hybrid_state --limit all --yes ~~~ - Below tasks has updated file "/var/lib/tripleo-config/docker-container-startup-config-step_4.json" with new nova-compute container image (per osp16.1) ~~~ TASK [Check if we need to update the paunch config] **************************** Wednesday 11 November 2020 17:27:41 -0500 (0:00:00.336) 0:00:08.321 **** changed: [msufiyan-novacomputeiha-1] => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": true, "cmd": "set -o pipefail\njq .\"nova_compute\".\"image\" /var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.043347", "end": "2020-11-11 22:27:41.687854", "rc": 0, "start": "2020-11-11 22:27:41.644507", "stderr": "", "stderr_lines": [], "stdout": "\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\"", "stdout_lines": ["\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\""]} changed: [msufiyan-novacomputeiha-0] => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": true, "cmd": "set -o pipefail\njq .\"nova_compute\".\"image\" /var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.062543", "end": "2020-11-11 22:27:41.720983", "rc": 0, "start": "2020-11-11 22:27:41.658440", "stderr": "", "stderr_lines": [], "stdout": "\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\"", "stdout_lines": ["\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\""]} TASK [Update the nova_compute paunch image in config] ************************** Wednesday 11 November 2020 17:27:41 -0500 (0:00:00.602) 0:00:08.923 **** changed: [msufiyan-novacomputeiha-0] => {"changed": true, "cmd": "set -o pipefail\ncat <<< $(jq '.nova_compute.image = \"192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1\"' /var/lib/tripleo-config/docker-container-startup-config-step_4.json) >\\\n/var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.009826", "end": "2020-11-11 22:27:41.995720", "rc": 0, "start": "2020-11-11 22:27:41.985894", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} changed: [msufiyan-novacomputeiha-1] => {"changed": true, "cmd": "set -o pipefail\ncat <<< $(jq '.nova_compute.image = \"192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1\"' /var/lib/tripleo-config/docker-container-startup-config-step_4.json) >\\\n/var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.013412", "end": "2020-11-11 22:27:42.017576", "rc": 0, "start": "2020-11-11 22:27:42.004164", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ~~~ ~~~ [root@msufiyan-novacomputeiha-1 ~]# cat /var/lib/tripleo-config/docker-container-startup-config-step_4.json | jq . | grep -i 16.1 "image": "192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1", ~~~ - But subsequent paunch task gets failed as it was trying to pull old container images which were present in the "/var/lib/tripleo-config/docker-container-startup-config-step_4.json" ~~~ TASK [Apply paunch config] ***************************************************** Wednesday 11 November 2020 17:27:57 -0500 (0:00:14.272) 0:00:25.081 **** fatal: [msufiyan-novacomputeiha-0]: FAILED! => {"changed": true, "cmd": "paunch apply --file /var/lib/tripleo-config/docker-container-startup-config-step_4.json --config-id tripleo_step4", "delta": "0:02:14.699 104", "end": "2020-11-11 22:30:12.860224", "msg": "non-zero return code", "rc": 1, "start": "2020-11-11 22:27:58.161120", "stderr": "", "stderr_lines": [], "stdout": "Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1\nError executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1\nError executing ~~~ ~~~ [root@msufiyan-novacomputeiha-1 ~]# cat /var/lib/tripleo-config/docker-container-startup-config-step_4.json | jq . | grep -i 13 "image": "192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155", "image": "192.168.24.1:8787/rhosp13/openstack-cron:13.0-141" "nofile=131072", "image": "192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155", "image": "192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent:13.0-136" "image": "192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-164" "image": "192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128" "image": "192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155", ~~~ - While running the command manually on the failed node, we noticed paunch is trying to diretly pull the image from local repository(undercloud) and it failed as there are no rhosp13 base image exist on undercloud and all container images were already moved from rhosp13 to rhosp16.1 based image while upgrading the diretor from 13 to 16.1 ~~~ [root@msufiyan-novacomputeiha-1 ~]# paunch apply --file /var/lib/tripleo-config/docker-container-startup-config-step_4.json --config-id tripleo_step4 Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1 Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1 Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1 Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1 ~~~ - Ideally, paunch should not diretly pull the image from local registry and it should inspect the availibility of image and container on the node first (in our case node was - msufiyan-novacomputeiha-1) before trying to pull the image from local registry. If the image/container is already present on the node then it should simply skip the pull and move ahead for the pull of new 16.1 based image instead of rhosp13 ~~~ [root@msufiyan-novacomputeiha-1 ~]# docker images | grep -i rhosp13 192.168.24.1:8787/rhosp13/openstack-nova-compute 13.0-155 458c9eb4f662 3 weeks ago 1.78 GB 192.168.24.1:8787/rhosp13/openstack-nova-libvirt 13.0-164 f2e7f1f17ab4 3 weeks ago 1.67 GB 192.168.24.1:8787/rhosp13/openstack-neutron-server 13.0-136 6cca3e5fb200 5 weeks ago 875 MB 192.168.24.1:8787/rhosp13/openstack-ceilometer-central 13.0-126 bd765ae40f0f 5 weeks ago 724 MB 192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent 13.0-136 e939b58d5d42 5 weeks ago 824 MB 192.168.24.1:8787/rhosp13/openstack-ceilometer-compute 13.0-128 8b24514ed1cb 5 weeks ago 724 MB 192.168.24.1:8787/rhosp13/openstack-iscsid 13.0-133 c1274026df23 5 weeks ago 510 MB 192.168.24.1:8787/rhosp13/openstack-cron 13.0-141 759f813adb01 5 weeks ago 505 MB [root@msufiyan-novacomputeiha-1 ~]# ~~~ ~~~ [root@msufiyan-novacomputeiha-1 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 339c946b66d9 192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent:13.0-136 "dumb-init --singl..." 2 hours ago Up 2 hours (healthy) neutron_ovs_agent 5971ee647064 192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1 "dumb-init --singl..." 2 hours ago Restarting (1) 42 minutes ago nova_compute 61db54bb00b6 192.168.24.1:8787/rhosp13/openstack-cron:13.0-141 "dumb-init --singl..." 2 hours ago Up 2 hours logrotate_crond ed7557410b64 192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155 "dumb-init --singl..." 2 hours ago Up 2 hours (healthy) nova_migration_target 0705b9f2faca 192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128 "dumb-init --singl..." 2 hours ago Up 2 hours ceilometer_agent_compute ecaba2e27c9f 192.168.24.1:8787/rhosp13/openstack-iscsid:13.0-133 "dumb-init --singl..." 30 hours ago Up 19 hours (healthy) iscsid 25a1ddedd551 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-164 "dumb-init --singl..." 30 hours ago Up 19 hours nova_libvirt ac19f2cbd698 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-164 "dumb-init --singl..." 30 hours ago Up 19 hours nova_virtlogd [root@msufiyan-novacomputeiha-1 ~ ~~~ Workaround:- This issue seems to be in latest "python-paunch" version which is not inspecting the container/images at all and after downgrading the python-paunch package from "2.5.3-9" to "2.5.3-7" we were able to bypass[2] the this problem. ~~~ [root@msufiyan-novacomputeiha-0 ~]# yum downgrade python-paunch [425/1916$ Loaded plugins: product-id, search-disabled-repos, subscription-manager Resolving Dependencies --> Running transaction check ---> Package python-paunch.noarch 0:2.5.3-7.el7ost will be a downgrade ---> Package python-paunch.noarch 0:2.5.3-9.el7ost will be erased --> Finished Dependency Resolution . . . Removed: python-paunch.noarch 0:2.5.3-9.el7ost Installed: python-paunch.noarch 0:2.5.3-7.el7ost Complete! ~~~ ~~~ openstack overcloud upgrade run --stack msufiyan --playbook upgrade_steps_playbook.yaml --tags nova_hybrid_state --limit all --yes . . . PLAY RECAP ********************************************************************* msufiyan-ceph-0 : ok=19 changed=0 unreachable=0 failed=0 skipped=15 rescued=0 ignored=0 msufiyan-ceph-1 : ok=18 changed=0 unreachable=0 failed=0 skipped=15 rescued=0 ignored=0 msufiyan-ceph-2 : ok=18 changed=0 unreachable=0 failed=0 skipped=15 rescued=0 ignored=0 msufiyan-controller-0 : ok=17 changed=0 unreachable=0 failed=0 skipped=16 rescued=0 ignored=0 msufiyan-controller-1 : ok=17 changed=0 unreachable=0 failed=0 skipped=16 rescued=0 ignored=0 msufiyan-controller-2 : ok=17 changed=0 unreachable=0 failed=0 skipped=16 rescued=0 ignored=0 msufiyan-novacomputeiha-0 : ok=26 changed=7 unreachable=0 failed=0 skipped=16 rescued=0 ignored=0 msufiyan-novacomputeiha-1 : ok=26 changed=7 unreachable=0 failed=0 skipped=16 rescued=0 ignored=0 undercloud : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 ~~~ ~~~ [root@msufiyan-novacomputeiha-1 ~]# cat /etc/rhosp-release Red Hat OpenStack Platform release 13.0.14 (Queens) python-paunch.noarch 0:2.5.3-9.el7ost ~~~ How reproducible: Only when you are using latest bits
There is no good workaround as customers might not be able to downgrade python-paunch on the OSP13 computes. Proposing this as blocker for 16.1.3.
The fix for this patch isn't working as expected. It is causing a malfunction in the compute nodes, getting all the neutron containers down and spawning some unexpected containers instead: [root@compute-0 heat-admin]# docker ps --all CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4c4ba7f4c24e undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.1_20201111.1 "dumb-init --singl..." 21 hours ago Up 21 hours (unhealthy) nova_compute a6e83ea4791c 192.168.24.1:8787/rh-osbs/rhosp13-openstack-iscsid:20201103.1 "dumb-init --singl..." 24 hours ago Up 24 hours (healthy) iscsid 1389e2b9427e 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1 "dumb-init --singl..." 24 hours ago Up 24 hours nova_libvirt 0eab4414985b 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1 "dumb-init --singl..." 24 hours ago Up 24 hours nova_virtlogd d1ee59c1c055 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1 "dumb-init --singl..." 24 hours ago Exited (0) 24 hours ago nova_statedir_owner bd534d0d5a3d 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-server:20201103.1 "dumb-init --singl..." 24 hours ago Exited (0) 24 hours ago neutron_ovs_bridge fb8bcceb1019 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1 "dumb-init --singl..." 24 hours ago Exited (0) 24 hours ago nova_compute_init_log Until we'll find out where the problem is, I wouldn't merge this patch.
(In reply to Jose Luis Franco from comment #5) > The fix for this patch isn't working as expected. It is causing a > malfunction in the compute nodes, getting all the neutron containers down > and spawning some unexpected containers instead: > > [root@compute-0 heat-admin]# docker ps --all > CONTAINER ID IMAGE > COMMAND CREATED STATUS PORTS > NAMES > 4c4ba7f4c24e > undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova- > compute:16.1_20201111.1 "dumb-init --singl..." 21 hours ago Up 21 > hours (unhealthy) nova_compute > a6e83ea4791c > 192.168.24.1:8787/rh-osbs/rhosp13-openstack-iscsid:20201103.1 > "dumb-init --singl..." 24 hours ago Up 24 hours (healthy) > iscsid > 1389e2b9427e > 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1 > "dumb-init --singl..." 24 hours ago Up 24 hours > nova_libvirt > 0eab4414985b > 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1 > "dumb-init --singl..." 24 hours ago Up 24 hours > nova_virtlogd > d1ee59c1c055 > 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1 > "dumb-init --singl..." 24 hours ago Exited (0) 24 hours ago > nova_statedir_owner > bd534d0d5a3d > 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-server:20201103.1 > "dumb-init --singl..." 24 hours ago Exited (0) 24 hours ago > neutron_ovs_bridge > fb8bcceb1019 > 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1 > "dumb-init --singl..." 24 hours ago Exited (0) 24 hours ago > nova_compute_init_log > > Until we'll find out where the problem is, I wouldn't merge this patch. Bugzilla covering this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1898503
*** Bug 1898503 has been marked as a duplicate of this bug. ***
Validated in CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph-from-passed_phase2/34/ 2020-11-26 23:06:19 | changed: [compute-0] => {"changed": true} 2020-11-26 23:06:19 | 2020-11-26 23:06:19 | TASK [Apply paunch config for nova_compute] ************************************ 2020-11-26 23:06:19 | Thursday 26 November 2020 23:05:26 -0500 (0:00:11.079) 0:00:25.588 ***** 2020-11-26 23:06:19 | changed: [compute-0] => {"changed": true, "cmd": "paunch apply --file /var/lib/tripleo-config/docker-container-hybrid_nova_compute.json --config-id hybrid_nova_compute\n", "delta": "0:00:48.043659", "end": "2020-11-27 04:06:14.515175", "rc": 0, "start": "2020-11-27 04:05:26.471516", "stderr": "", "stderr_lines": [], "stdout": "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id\nDid not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\"", "stdout_lines": ["Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id", "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\""]} 2020-11-26 23:06:19 | 2020-11-26 23:06:19 | changed: [compute-1] => {"changed": true, "cmd": "paunch apply --file /var/lib/tripleo-config/docker-container-hybrid_nova_compute.json --config-id hybrid_nova_compute\n", "delta": "0:00:48.740666", "end": "2020-11-27 04:06:15.250868", "rc": 0, "start": "2020-11-27 04:05:26.510202", "stderr": "", "stderr_lines": [], "stdout": "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id\nDid not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\"", "stdout_lines": ["Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id", "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\""]} 2020-11-26 23:06:19 | Packages: openstack-tripleo-heat-templates-11.3.2-1.20200914170175.el8ost.noarch http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph-from-passed_phase2/34/undercloud-0/var/log/dnf.rpm.log.gz
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:5413