Bug 1749443

Summary: [OSP15] Overcloud deployment fails when pulling images from a remote registry(not the undercloud) because nova_wait_for_compute_service container on compute nodes exits with rc 1
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Martin Schuppert <mschuppe>
Status: CLOSED ERRATA QA Contact: Archit Modi <amodi>
Severity: high Docs Contact:
Priority: high    
Version: 15.0 (Stein)CC: amcleod, amodi, ccopello, dbecker, emacchi, gregraka, igallagh, jamsmith, lyarwood, mburns, morazi, owalsh, pbabbar
Target Milestone: z2Keywords: Triaged, ZStream
Target Release: 15.0 (Stein)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-10.6.2-0.20191029010436.5c36542.el8ost Doc Type: Known Issue
Doc Text:
The Compute services (nova) can fail to deploy because the nova_wait_for_compute_service script is unable to query the Nova API. If you use a remote container image registry outside of the undercloud, the Nova API service might not finish deploying in time. The workaround is to rerun the deployment command, or to use a local container image registry on the undercloud.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-05 12:00:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2019-09-05 15:50:14 UTC
Description of problem:

Overcloud deployment fails when pulling images from a remote registry(not the undercloud) because nova_wait_for_compute_service container on compute nodes exits with rc 1.

From compute:

[root@compute-0 heat-admin]# podman ps -a | grep nova_wait
6b562095fc46  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-compute:20190904.1                dumb-init --singl...  13 hours ago  Exited (1) 13 hours ago         nova_wait_for_compute_service
41a6bb77e1c6  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-compute:20190904.1                dumb-init --singl...  13 hours ago  Exited (1) 13 hours ago         nova_wait_for_placement_service


[root@compute-0 heat-admin]# podman inspect nova_wait_for_compute_service | grep StartedAt
            "StartedAt": "2019-09-05T02:47:35.14452906Z",
[root@compute-0 heat-admin]# podman inspect nova_wait_for_compute_service | grep FinishedAt
            "FinishedAt": "2019-09-05T02:57:37.880221163Z"



From controller:

[root@controller-0 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:46.315255257Z",
[root@controller-1 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:45.849809526Z",
[root@controller-2 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:56.508654288Z"

So we can see that the nova_api containers started after the nova_wait_for_compute_service container exited.

This appears to be a race condition which only reproduces when using a remote registry. A workaround for this issue is to upload the container images to the undercloud registry and pull images from it on overcloud nodes.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-10.6.1-0.20190904124632.4e2dddb.el8ost.noarch
python3-paunch-4.5.1-0.20190829080435.f9349e0.el8ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with images pulled from a remote registry, not from the undercloud registry

Actual results:
Deployment fails because nova_wait_for_compute_service container on compute nodes fails, exiting before the nova api service container started on the controller nodes.

Expected results:
No failure.

Additional info:
Adding links to the failed jobs.

Comment 16 errata-xmlrpc 2020-03-05 12:00:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0643