Bug 1940889

Summary:	Installation failures in OpenStack release jobs
Product:	OpenShift Container Platform	Reporter:	Petr Muller <pmuller>
Component:	Installer	Assignee:	Pierre Prinetti <pprinett>
Installer sub component:	OpenShift on OpenStack	QA Contact:	Jon Uriarte <juriarte>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	urgent	CC:	m.andre, pprinett
Version:	4.8	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	test: operator.Run template e2e-openstack - e2e-openstack container setup
Last Closed:	2021-07-27 22:54:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Muller 2021-03-19 13:34:18 UTC

On March 18 the release-gating jobs for OpenStack started to fail during cluster installation:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-4.8
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.8

Example:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.8/1372850516870041600

level=info msg=Waiting up to 20m0s for the Kubernetes API at https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443...
...
level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 38.102.83.74:6443: connect: no route to host 

Martin André:
> bootstrap fails to get its ignition file: A start job is running for Ignition (fetch) (23min 18s / no limit)
> not sure why yet

I'm sorry for very vague subject, I'm not able to diagnose installation failures

Comment 1 Martin André 2021-03-22 09:10:36 UTC

The initial investigation shows that the bootstrap node is unable to fetch its ignition file:

    A start job is running for Ignition (fetch) (23min 49s / no limit)

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.8/1373877230773473280/artifacts/e2e-openstack-serial/bootstrap/nova.log

It affects all jobs running on vexxhost and not just 4.8 periodics. 4.6 and 4.7 periodics are also affected as well as pre-submit. Setting the priority to urgent as this means we're currently navigating blind without CI.

Also deploying master installer + latest RHCOS + latest nightly release image in a different environment works fine, confirming that the breakage is limited to Vexxhost.

Comment 2 Martin André 2021-03-22 09:56:40 UTC

They seem to have networking issues: nova-metadata service is down. It's been reported already.

Comment 3 Martin André 2021-03-22 10:13:23 UTC

Looking at the console of a bootstap node shows it can't talk to nova-metadata:

A start job is running for Ignition (fetch) (49s / no limit)[   54.004386] ignition[720]: GET http://169.254.169.254/openstack/latest/user_data: attempt #8

Comment 4 Martin André 2021-03-23 15:44:05 UTC

Vexxhost made some networking changes and they no longer automatically serve DNS from the DHCP server. We now need to specify a DNS on the subnets we create via the `externalDNS` parameter of install-config.yaml. We're working on a fix for our CI jobs.

Comment 5 Martin André 2021-03-23 15:46:00 UTC

Vexxhost also fixed the failing nova-metadata service yesterday, so after we configure our jobs to use an externalDNS resolver, that should fix the jobs.

Comment 6 Martin André 2021-03-24 10:13:05 UTC

Seems to be fixed now, at least for the pre-submit jobs
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-openstack

Let's wait a bit more see if this also fixed the periodic jobs.

Comment 7 Martin André 2021-03-24 13:23:45 UTC

Periodic jobs work too, https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.7/1374662306641743872

Moving to VERIFIED.

Comment 10 errata-xmlrpc 2021-07-27 22:54:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438