1749443 – [OSP15] Overcloud deployment fails when pulling images from a remote registry(not the undercloud) because nova_wait_for_compute_service container on compute nodes exits with rc 1

Bug 1749443 - [OSP15] Overcloud deployment fails when pulling images from a remote registry(not the undercloud) because nova_wait_for_compute_service container on compute nodes exits with rc 1

Summary: [OSP15] Overcloud deployment fails when pulling images from a remote registry...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	15.0 (Stein)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	15.0 (Stein)
Assignee:	Martin Schuppert
QA Contact:	Archit Modi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-05 15:50 UTC by Marius Cornea
Modified:	2020-12-21 19:33 UTC (History)
CC List:	13 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-10.6.2-0.20191029010436.5c36542.el8ost
Doc Type:	Known Issue
Doc Text:	The Compute services (nova) can fail to deploy because the nova_wait_for_compute_service script is unable to query the Nova API. If you use a remote container image registry outside of the undercloud, the Nova API service might not finish deploying in time. The workaround is to rerun the deployment command, or to use a local container image registry on the undercloud.
Clone Of:
Environment:
Last Closed:	2020-03-05 12:00:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1842948	None	None	None	2019-09-05 16:20:06 UTC
OpenStack gerrit	688349	'None'	MERGED	Ensure nova-api is running before starting nova-compute containers	2020-06-09 20:38:17 UTC
Red Hat Product Errata	RHBA-2020:0643	None	None	None	2020-03-05 12:00:34 UTC

Description Marius Cornea 2019-09-05 15:50:14 UTC

Description of problem:

Overcloud deployment fails when pulling images from a remote registry(not the undercloud) because nova_wait_for_compute_service container on compute nodes exits with rc 1.

From compute:

[root@compute-0 heat-admin]# podman ps -a | grep nova_wait
6b562095fc46  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-compute:20190904.1                dumb-init --singl...  13 hours ago  Exited (1) 13 hours ago         nova_wait_for_compute_service
41a6bb77e1c6  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-compute:20190904.1                dumb-init --singl...  13 hours ago  Exited (1) 13 hours ago         nova_wait_for_placement_service


[root@compute-0 heat-admin]# podman inspect nova_wait_for_compute_service | grep StartedAt
            "StartedAt": "2019-09-05T02:47:35.14452906Z",
[root@compute-0 heat-admin]# podman inspect nova_wait_for_compute_service | grep FinishedAt
            "FinishedAt": "2019-09-05T02:57:37.880221163Z"



From controller:

[root@controller-0 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:46.315255257Z",
[root@controller-1 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:45.849809526Z",
[root@controller-2 heat-admin]# podman inspect nova_api | grep StartedAt
            "StartedAt": "2019-09-05T02:58:56.508654288Z"

So we can see that the nova_api containers started after the nova_wait_for_compute_service container exited.

This appears to be a race condition which only reproduces when using a remote registry. A workaround for this issue is to upload the container images to the undercloud registry and pull images from it on overcloud nodes.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-10.6.1-0.20190904124632.4e2dddb.el8ost.noarch
python3-paunch-4.5.1-0.20190829080435.f9349e0.el8ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with images pulled from a remote registry, not from the undercloud registry

Actual results:
Deployment fails because nova_wait_for_compute_service container on compute nodes fails, exiting before the nova api service container started on the controller nodes.

Expected results:
No failure.

Additional info:
Adding links to the failed jobs.

Comment 16 errata-xmlrpc 2020-03-05 12:00:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0643

Note You need to log in before you can comment on or make changes to this bug.