Bug 1897169 - [RHOSP 13 to 16.1 Upgrades] nova_hybrid_state task failed with latest python-paunch package
Summary: [RHOSP 13 to 16.1 Upgrades] nova_hybrid_state task failed with latest python-...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: All
urgent
urgent
Target Milestone: z3
: 16.1 (Train on RHEL 8.2)
Assignee: Lukas Bezdicka
QA Contact: Jose Luis Franco
URL:
Whiteboard:
: 1898503 (view as bug list)
Depends On: 1879531
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-12 13:56 UTC by MD Sufiyan
Modified: 2020-12-15 18:37 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170175.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-15 18:37:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 762649 0 None MERGED [train-only][ffwd] Create specific paunch config for hybrid state 2021-02-18 14:59:10 UTC
OpenStack gerrit 763140 0 None MERGED [train-only][ffwd] Dont reuse tripleo_step4 for hybrid state 2021-02-18 14:59:10 UTC
Red Hat Product Errata RHEA-2020:5413 0 None None None 2020-12-15 18:37:58 UTC

Description MD Sufiyan 2020-11-12 13:56:07 UTC
Description of problem:

nova_hybris_state tasks[1] fails while performing an upgrade framework on the latest OSP bits.

~~~
[1] openstack overcloud upgrade run --stack msufiyan --playbook upgrade_steps_playbook.yaml --tags nova_hybrid_state --limit all --yes 
~~~

- Below tasks has updated file "/var/lib/tripleo-config/docker-container-startup-config-step_4.json" with new nova-compute container image (per osp16.1)

~~~
TASK [Check if we need to update the paunch config] ****************************
Wednesday 11 November 2020  17:27:41 -0500 (0:00:00.336)       0:00:08.321 ****
changed: [msufiyan-novacomputeiha-1] => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": true, "cmd": "set -o pipefail\njq .\"nova_compute\".\"image\" /var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.043347", "end": "2020-11-11 22:27:41.687854", "rc": 0, "start": "2020-11-11 22:27:41.644507", "stderr": "", "stderr_lines": [], "stdout": "\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\"", "stdout_lines": ["\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\""]}
changed: [msufiyan-novacomputeiha-0] => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": true, "cmd": "set -o pipefail\njq .\"nova_compute\".\"image\" /var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.062543", "end": "2020-11-11 22:27:41.720983", "rc": 0, "start": "2020-11-11 22:27:41.658440", "stderr": "", "stderr_lines": [], "stdout": "\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\"", "stdout_lines": ["\"192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155\""]}
 
TASK [Update the nova_compute paunch image in config] **************************
Wednesday 11 November 2020  17:27:41 -0500 (0:00:00.602)       0:00:08.923 ****
changed: [msufiyan-novacomputeiha-0] => {"changed": true, "cmd": "set -o pipefail\ncat <<< $(jq '.nova_compute.image = \"192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1\"' /var/lib/tripleo-config/docker-container-startup-config-step_4.json) >\\\n/var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.009826", "end": "2020-11-11 22:27:41.995720", "rc": 0, "start": "2020-11-11 22:27:41.985894", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
changed: [msufiyan-novacomputeiha-1] => {"changed": true, "cmd": "set -o pipefail\ncat <<< $(jq '.nova_compute.image = \"192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1\"' /var/lib/tripleo-config/docker-container-startup-config-step_4.json) >\\\n/var/lib/tripleo-config/docker-container-startup-config-step_4.json\n", "delta": "0:00:00.013412", "end": "2020-11-11 22:27:42.017576", "rc": 0, "start": "2020-11-11 22:27:42.004164", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
~~~

~~~
[root@msufiyan-novacomputeiha-1 ~]# cat /var/lib/tripleo-config/docker-container-startup-config-step_4.json | jq . | grep -i 16.1
    "image": "192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1",
~~~

- But subsequent paunch task gets failed as it was trying to pull old container images which were present in the "/var/lib/tripleo-config/docker-container-startup-config-step_4.json"

~~~
TASK [Apply paunch config] *****************************************************
Wednesday 11 November 2020  17:27:57 -0500 (0:00:14.272)       0:00:25.081 **** 
fatal: [msufiyan-novacomputeiha-0]: FAILED! => {"changed": true, "cmd": "paunch apply --file /var/lib/tripleo-config/docker-container-startup-config-step_4.json --config-id tripleo_step4", "delta": "0:02:14.699
104", "end": "2020-11-11 22:30:12.860224", "msg": "non-zero return code", "rc": 1, "start": "2020-11-11 22:27:58.161120", "stderr": "", "stderr_lines": [], "stdout": "Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1\nError executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1\nError executing
~~~

~~~
[root@msufiyan-novacomputeiha-1 ~]# cat /var/lib/tripleo-config/docker-container-startup-config-step_4.json | jq . | grep -i 13
    "image": "192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155",
    "image": "192.168.24.1:8787/rhosp13/openstack-cron:13.0-141"
      "nofile=131072",
    "image": "192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155",
    "image": "192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent:13.0-136"
    "image": "192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-164"
    "image": "192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128"
    "image": "192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155",
~~~

- While running the command manually on the failed node, we noticed paunch is trying to diretly pull the image from local repository(undercloud) and it failed as there are no rhosp13 base image exist on undercloud and all container images were already moved from rhosp13 to rhosp16.1 based image while upgrading the diretor from 13 to 16.1

~~~
[root@msufiyan-novacomputeiha-1 ~]# paunch apply --file /var/lib/tripleo-config/docker-container-startup-config-step_4.json --config-id tripleo_step4
Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1
Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1
Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1
Error executing ['docker', 'pull', '192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128']: returned 1
~~~

- Ideally, paunch should not diretly pull the image from local registry and it should inspect the availibility of image and container on the node first (in our case node was - msufiyan-novacomputeiha-1) before trying to pull the image from local registry. If the image/container is already present on the node then it should simply skip the pull and move ahead for the pull of new 16.1 based image instead of rhosp13

~~~
[root@msufiyan-novacomputeiha-1 ~]# docker images | grep -i rhosp13
192.168.24.1:8787/rhosp13/openstack-nova-compute                13.0-155            458c9eb4f662        3 weeks ago         1.78 GB
192.168.24.1:8787/rhosp13/openstack-nova-libvirt                13.0-164            f2e7f1f17ab4        3 weeks ago         1.67 GB
192.168.24.1:8787/rhosp13/openstack-neutron-server              13.0-136            6cca3e5fb200        5 weeks ago         875 MB
192.168.24.1:8787/rhosp13/openstack-ceilometer-central          13.0-126            bd765ae40f0f        5 weeks ago         724 MB
192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent   13.0-136            e939b58d5d42        5 weeks ago         824 MB
192.168.24.1:8787/rhosp13/openstack-ceilometer-compute          13.0-128            8b24514ed1cb        5 weeks ago         724 MB
192.168.24.1:8787/rhosp13/openstack-iscsid                      13.0-133            c1274026df23        5 weeks ago         510 MB
192.168.24.1:8787/rhosp13/openstack-cron                        13.0-141            759f813adb01        5 weeks ago         505 MB
[root@msufiyan-novacomputeiha-1 ~]# 
~~~

~~~
[root@msufiyan-novacomputeiha-1 ~]# docker ps 
CONTAINER ID        IMAGE                                                                    COMMAND                  CREATED             STATUS                          PORTS               NAMES
339c946b66d9        192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent:13.0-136   "dumb-init --singl..."   2 hours ago         Up 2 hours (healthy)                                neutron_ovs_agent
5971ee647064        192.168.24.1:8787/rhosp-rhel8/openstack-nova-compute:16.1                "dumb-init --singl..."   2 hours ago         Restarting (1) 42 minutes ago                       nova_compute
61db54bb00b6        192.168.24.1:8787/rhosp13/openstack-cron:13.0-141                        "dumb-init --singl..."   2 hours ago         Up 2 hours                                          logrotate_crond
ed7557410b64        192.168.24.1:8787/rhosp13/openstack-nova-compute:13.0-155                "dumb-init --singl..."   2 hours ago         Up 2 hours (healthy)                                nova_migration_target
0705b9f2faca        192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:13.0-128          "dumb-init --singl..."   2 hours ago         Up 2 hours                                          ceilometer_agent_compute
ecaba2e27c9f        192.168.24.1:8787/rhosp13/openstack-iscsid:13.0-133                      "dumb-init --singl..."   30 hours ago        Up 19 hours (healthy)                               iscsid
25a1ddedd551        192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-164                "dumb-init --singl..."   30 hours ago        Up 19 hours                                         nova_libvirt
ac19f2cbd698        192.168.24.1:8787/rhosp13/openstack-nova-libvirt:13.0-164                "dumb-init --singl..."   30 hours ago        Up 19 hours                                         nova_virtlogd
[root@msufiyan-novacomputeiha-1 ~
~~~

Workaround:-

This issue seems to be in latest "python-paunch" version which is not inspecting the container/images at all and after downgrading the python-paunch package from "2.5.3-9" to "2.5.3-7" we were able to bypass[2] the this problem.

~~~
[root@msufiyan-novacomputeiha-0 ~]# yum downgrade python-paunch                                                                                                                                         [425/1916$
Loaded plugins: product-id, search-disabled-repos, subscription-manager
Resolving Dependencies
--> Running transaction check
---> Package python-paunch.noarch 0:2.5.3-7.el7ost will be a downgrade
---> Package python-paunch.noarch 0:2.5.3-9.el7ost will be erased
--> Finished Dependency Resolution
.
.
.
Removed:
  python-paunch.noarch 0:2.5.3-9.el7ost                                                                                                                                                                           

Installed:
  python-paunch.noarch 0:2.5.3-7.el7ost                                                                                                                                                                           

Complete!
~~~


~~~
openstack overcloud upgrade run --stack msufiyan --playbook upgrade_steps_playbook.yaml --tags nova_hybrid_state --limit all --yes 
.
.
.
PLAY RECAP *********************************************************************
msufiyan-ceph-0            : ok=19   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0
msufiyan-ceph-1            : ok=18   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0
msufiyan-ceph-2            : ok=18   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0
msufiyan-controller-0      : ok=17   changed=0    unreachable=0    failed=0    skipped=16   rescued=0    ignored=0
msufiyan-controller-1      : ok=17   changed=0    unreachable=0    failed=0    skipped=16   rescued=0    ignored=0
msufiyan-controller-2      : ok=17   changed=0    unreachable=0    failed=0    skipped=16   rescued=0    ignored=0
msufiyan-novacomputeiha-0  : ok=26   changed=7    unreachable=0    failed=0    skipped=16   rescued=0    ignored=0
msufiyan-novacomputeiha-1  : ok=26   changed=7    unreachable=0    failed=0    skipped=16   rescued=0    ignored=0
undercloud                 : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
~~~

~~~
[root@msufiyan-novacomputeiha-1 ~]# cat /etc/rhosp-release 
Red Hat OpenStack Platform release 13.0.14 (Queens)
python-paunch.noarch 0:2.5.3-9.el7ost
~~~

How reproducible:
Only when you are using latest bits

Comment 1 Lukas Bezdicka 2020-11-12 15:47:43 UTC
There is no good workaround as customers might not be able to downgrade python-paunch on the OSP13 computes. Proposing this as blocker for 16.1.3.

Comment 5 Jose Luis Franco 2020-11-17 10:23:06 UTC
The fix for this patch isn't working as expected. It is causing a malfunction in the compute nodes, getting all the neutron containers down and spawning some unexpected containers instead:

[root@compute-0 heat-admin]# docker ps --all
CONTAINER ID        IMAGE                                                                                            COMMAND                  CREATED             STATUS                    PORTS               NAMES
4c4ba7f4c24e        undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.1_20201111.1   "dumb-init --singl..."   21 hours ago        Up 21 hours (unhealthy)                       nova_compute
a6e83ea4791c        192.168.24.1:8787/rh-osbs/rhosp13-openstack-iscsid:20201103.1                                    "dumb-init --singl..."   24 hours ago        Up 24 hours (healthy)                         iscsid
1389e2b9427e        192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1                              "dumb-init --singl..."   24 hours ago        Up 24 hours                                   nova_libvirt
0eab4414985b        192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1                              "dumb-init --singl..."   24 hours ago        Up 24 hours                                   nova_virtlogd
d1ee59c1c055        192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1                              "dumb-init --singl..."   24 hours ago        Exited (0) 24 hours ago                       nova_statedir_owner
bd534d0d5a3d        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-server:20201103.1                            "dumb-init --singl..."   24 hours ago        Exited (0) 24 hours ago                       neutron_ovs_bridge
fb8bcceb1019        192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1                              "dumb-init --singl..."   24 hours ago        Exited (0) 24 hours ago                       nova_compute_init_log

Until we'll find out where the problem is, I wouldn't merge this patch.

Comment 6 Jose Luis Franco 2020-11-17 11:36:09 UTC
(In reply to Jose Luis Franco from comment #5)
> The fix for this patch isn't working as expected. It is causing a
> malfunction in the compute nodes, getting all the neutron containers down
> and spawning some unexpected containers instead:
> 
> [root@compute-0 heat-admin]# docker ps --all
> CONTAINER ID        IMAGE                                                   
> COMMAND                  CREATED             STATUS                    PORTS
> NAMES
> 4c4ba7f4c24e       
> undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-
> compute:16.1_20201111.1   "dumb-init --singl..."   21 hours ago        Up 21
> hours (unhealthy)                       nova_compute
> a6e83ea4791c       
> 192.168.24.1:8787/rh-osbs/rhosp13-openstack-iscsid:20201103.1               
> "dumb-init --singl..."   24 hours ago        Up 24 hours (healthy)          
> iscsid
> 1389e2b9427e       
> 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1         
> "dumb-init --singl..."   24 hours ago        Up 24 hours                    
> nova_libvirt
> 0eab4414985b       
> 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-libvirt:20201103.1         
> "dumb-init --singl..."   24 hours ago        Up 24 hours                    
> nova_virtlogd
> d1ee59c1c055       
> 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1         
> "dumb-init --singl..."   24 hours ago        Exited (0) 24 hours ago        
> nova_statedir_owner
> bd534d0d5a3d       
> 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-server:20201103.1       
> "dumb-init --singl..."   24 hours ago        Exited (0) 24 hours ago        
> neutron_ovs_bridge
> fb8bcceb1019       
> 192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-compute:20201103.1         
> "dumb-init --singl..."   24 hours ago        Exited (0) 24 hours ago        
> nova_compute_init_log
> 
> Until we'll find out where the problem is, I wouldn't merge this patch.

Bugzilla covering this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1898503

Comment 8 Lukas Bezdicka 2020-11-19 10:41:56 UTC
*** Bug 1898503 has been marked as a duplicate of this bug. ***

Comment 12 Jose Luis Franco 2020-11-30 12:03:08 UTC
Validated in CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph-from-passed_phase2/34/


2020-11-26 23:06:19 | changed: [compute-0] => {"changed": true}
2020-11-26 23:06:19 | 
2020-11-26 23:06:19 | TASK [Apply paunch config for nova_compute] ************************************
2020-11-26 23:06:19 | Thursday 26 November 2020  23:05:26 -0500 (0:00:11.079)       0:00:25.588 ***** 
2020-11-26 23:06:19 | changed: [compute-0] => {"changed": true, "cmd": "paunch apply --file /var/lib/tripleo-config/docker-container-hybrid_nova_compute.json --config-id hybrid_nova_compute\n", "delta": "0:00:48.043659", "end": "2020-11-27 04:06:14.515175", "rc": 0, "start": "2020-11-27 04:05:26.471516", "stderr": "", "stderr_lines": [], "stdout": "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id\nDid not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\"", "stdout_lines": ["Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id", "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\""]}
2020-11-26 23:06:19 | 
2020-11-26 23:06:19 | changed: [compute-1] => {"changed": true, "cmd": "paunch apply --file /var/lib/tripleo-config/docker-container-hybrid_nova_compute.json --config-id hybrid_nova_compute\n", "delta": "0:00:48.740666", "end": "2020-11-27 04:06:15.250868", "rc": 0, "start": "2020-11-27 04:05:26.510202", "stderr": "", "stderr_lines": [], "stdout": "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id\nDid not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\"", "stdout_lines": ["Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--filter', 'label=config_id=hybrid_nova_compute', '--format', '{{.Names}}']\" - retrying without config_id", "Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_compute', '--format', '{{.Names}}']\""]}
2020-11-26 23:06:19 | 


Packages:
openstack-tripleo-heat-templates-11.3.2-1.20200914170175.el8ost.noarch

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph-from-passed_phase2/34/undercloud-0/var/log/dnf.rpm.log.gz

Comment 20 errata-xmlrpc 2020-12-15 18:37:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413


Note You need to log in before you can comment on or make changes to this bug.