Bug 1940889

Summary: Installation failures in OpenStack release jobs
Product: OpenShift Container Platform Reporter: Petr Muller <pmuller>
Component: InstallerAssignee: Pierre Prinetti <pprinett>
Installer sub component: OpenShift on OpenStack QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: m.andre, pprinett
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
test: operator.Run template e2e-openstack - e2e-openstack container setup
Last Closed: 2021-07-27 22:54:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Muller 2021-03-19 13:34:18 UTC
On March 18 the release-gating jobs for OpenStack started to fail during cluster installation:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-4.8
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.8

Example:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.8/1372850516870041600

level=info msg=Waiting up to 20m0s for the Kubernetes API at https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443...
...
level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 38.102.83.74:6443: connect: no route to host 

Martin André:
> bootstrap fails to get its ignition file: A start job is running for Ignition (fetch) (23min 18s / no limit)
> not sure why yet

I'm sorry for very vague subject, I'm not able to diagnose installation failures

Comment 1 Martin André 2021-03-22 09:10:36 UTC
The initial investigation shows that the bootstrap node is unable to fetch its ignition file:

    A start job is running for Ignition (fetch) (23min 49s / no limit)

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.8/1373877230773473280/artifacts/e2e-openstack-serial/bootstrap/nova.log

It affects all jobs running on vexxhost and not just 4.8 periodics. 4.6 and 4.7 periodics are also affected as well as pre-submit. Setting the priority to urgent as this means we're currently navigating blind without CI.

Also deploying master installer + latest RHCOS + latest nightly release image in a different environment works fine, confirming that the breakage is limited to Vexxhost.

Comment 2 Martin André 2021-03-22 09:56:40 UTC
They seem to have networking issues: nova-metadata service is down. It's been reported already.

Comment 3 Martin André 2021-03-22 10:13:23 UTC
Looking at the console of a bootstap node shows it can't talk to nova-metadata:

A start job is running for Ignition (fetch) (49s / no limit)[   54.004386] ignition[720]: GET http://169.254.169.254/openstack/latest/user_data: attempt #8

Comment 4 Martin André 2021-03-23 15:44:05 UTC
Vexxhost made some networking changes and they no longer automatically serve DNS from the DHCP server. We now need to specify a DNS on the subnets we create via the `externalDNS` parameter of install-config.yaml. We're working on a fix for our CI jobs.

Comment 5 Martin André 2021-03-23 15:46:00 UTC
Vexxhost also fixed the failing nova-metadata service yesterday, so after we configure our jobs to use an externalDNS resolver, that should fix the jobs.

Comment 6 Martin André 2021-03-24 10:13:05 UTC
Seems to be fixed now, at least for the pre-submit jobs
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-openstack

Let's wait a bit more see if this also fixed the periodic jobs.

Comment 10 errata-xmlrpc 2021-07-27 22:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438