Bug 1940889 - Installation failures in OpenStack release jobs
Summary: Installation failures in OpenStack release jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.8.0
Assignee: Pierre Prinetti
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-19 13:34 UTC by Petr Muller
Modified: 2021-07-27 22:55 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
test: operator.Run template e2e-openstack - e2e-openstack container setup
Last Closed: 2021-07-27 22:54:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift release pull 17066 0 None open Bug 1940889: openstack: Add IPv4 external DNS to install-config 2021-03-23 16:25:13 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:55:01 UTC

Description Petr Muller 2021-03-19 13:34:18 UTC
On March 18 the release-gating jobs for OpenStack started to fail during cluster installation:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-4.8
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.8

Example:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.8/1372850516870041600

level=info msg=Waiting up to 20m0s for the Kubernetes API at https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443...
...
level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 38.102.83.74:6443: connect: no route to host 

Martin André:
> bootstrap fails to get its ignition file: A start job is running for Ignition (fetch) (23min 18s / no limit)
> not sure why yet

I'm sorry for very vague subject, I'm not able to diagnose installation failures

Comment 1 Martin André 2021-03-22 09:10:36 UTC
The initial investigation shows that the bootstrap node is unable to fetch its ignition file:

    A start job is running for Ignition (fetch) (23min 49s / no limit)

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.8/1373877230773473280/artifacts/e2e-openstack-serial/bootstrap/nova.log

It affects all jobs running on vexxhost and not just 4.8 periodics. 4.6 and 4.7 periodics are also affected as well as pre-submit. Setting the priority to urgent as this means we're currently navigating blind without CI.

Also deploying master installer + latest RHCOS + latest nightly release image in a different environment works fine, confirming that the breakage is limited to Vexxhost.

Comment 2 Martin André 2021-03-22 09:56:40 UTC
They seem to have networking issues: nova-metadata service is down. It's been reported already.

Comment 3 Martin André 2021-03-22 10:13:23 UTC
Looking at the console of a bootstap node shows it can't talk to nova-metadata:

A start job is running for Ignition (fetch) (49s / no limit)[   54.004386] ignition[720]: GET http://169.254.169.254/openstack/latest/user_data: attempt #8

Comment 4 Martin André 2021-03-23 15:44:05 UTC
Vexxhost made some networking changes and they no longer automatically serve DNS from the DHCP server. We now need to specify a DNS on the subnets we create via the `externalDNS` parameter of install-config.yaml. We're working on a fix for our CI jobs.

Comment 5 Martin André 2021-03-23 15:46:00 UTC
Vexxhost also fixed the failing nova-metadata service yesterday, so after we configure our jobs to use an externalDNS resolver, that should fix the jobs.

Comment 6 Martin André 2021-03-24 10:13:05 UTC
Seems to be fixed now, at least for the pre-submit jobs
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-openstack

Let's wait a bit more see if this also fixed the periodic jobs.

Comment 10 errata-xmlrpc 2021-07-27 22:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.