Bug 1749443 - [OSP15] Overcloud deployment fails when pulling images from a remote registry(not the undercloud) because nova_wait_for_compute_service container on compute nodes exits with rc 1
Summary: [OSP15] Overcloud deployment fails when pulling images from a remote registry...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 15.0 (Stein)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 15.0 (Stein)
Assignee: Martin Schuppert
QA Contact: Archit Modi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-05 15:50 UTC by Marius Cornea
Modified: 2020-12-21 19:33 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-heat-templates-10.6.2-0.20191029010436.5c36542.el8ost
Doc Type: Known Issue
Doc Text:
The Compute services (nova) can fail to deploy because the nova_wait_for_compute_service script is unable to query the Nova API. If you use a remote container image registry outside of the undercloud, the Nova API service might not finish deploying in time. The workaround is to rerun the deployment command, or to use a local container image registry on the undercloud.
Clone Of:
Environment:
Last Closed: 2020-03-05 12:00:13 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1842948 0 None None None 2019-09-05 16:20:06 UTC
OpenStack gerrit 688349 0 'None' MERGED Ensure nova-api is running before starting nova-compute containers 2020-06-09 20:38:17 UTC
Red Hat Product Errata RHBA-2020:0643 0 None None None 2020-03-05 12:00:34 UTC

Description Marius Cornea 2019-09-05 15:50:14 UTC
Description of problem:

Overcloud deployment fails when pulling images from a remote registry(not the undercloud) because nova_wait_for_compute_service container on compute nodes exits with rc 1.

From compute:

[root@compute-0 heat-admin]# podman ps -a | grep nova_wait
6b562095fc46  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-compute:20190904.1                dumb-init --singl...  13 hours ago  Exited (1) 13 hours ago         nova_wait_for_compute_service
41a6bb77e1c6  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-compute:20190904.1                dumb-init --singl...  13 hours ago  Exited (1) 13 hours ago         nova_wait_for_placement_service


[root@compute-0 heat-admin]# podman inspect nova_wait_for_compute_service | grep StartedAt
            "StartedAt": "2019-09-05T02:47:35.14452906Z",
[root@compute-0 heat-admin]# podman inspect nova_wait_for_compute_service | grep FinishedAt
            "FinishedAt": "2019-09-05T02:57:37.880221163Z"



From controller:

[root@controller-0 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:46.315255257Z",
[root@controller-1 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:45.849809526Z",
[root@controller-2 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:56.508654288Z"

So we can see that the nova_api containers started after the nova_wait_for_compute_service container exited.

This appears to be a race condition which only reproduces when using a remote registry. A workaround for this issue is to upload the container images to the undercloud registry and pull images from it on overcloud nodes.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-10.6.1-0.20190904124632.4e2dddb.el8ost.noarch
python3-paunch-4.5.1-0.20190829080435.f9349e0.el8ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with images pulled from a remote registry, not from the undercloud registry

Actual results:
Deployment fails because nova_wait_for_compute_service container on compute nodes fails, exiting before the nova api service container started on the controller nodes.

Expected results:
No failure.

Additional info:
Adding links to the failed jobs.

Comment 16 errata-xmlrpc 2020-03-05 12:00:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0643


Note You need to log in before you can comment on or make changes to this bug.