On March 18 the release-gating jobs for OpenStack started to fail during cluster installation: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-4.8 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.8 Example: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.8/1372850516870041600 level=info msg=Waiting up to 20m0s for the Kubernetes API at https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443... ... level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.k8iyygc3-d8ea2.shiftstack.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 38.102.83.74:6443: connect: no route to host Martin André: > bootstrap fails to get its ignition file: A start job is running for Ignition (fetch) (23min 18s / no limit) > not sure why yet I'm sorry for very vague subject, I'm not able to diagnose installation failures
The initial investigation shows that the bootstrap node is unable to fetch its ignition file: A start job is running for Ignition (fetch) (23min 49s / no limit) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.8/1373877230773473280/artifacts/e2e-openstack-serial/bootstrap/nova.log It affects all jobs running on vexxhost and not just 4.8 periodics. 4.6 and 4.7 periodics are also affected as well as pre-submit. Setting the priority to urgent as this means we're currently navigating blind without CI. Also deploying master installer + latest RHCOS + latest nightly release image in a different environment works fine, confirming that the breakage is limited to Vexxhost.
They seem to have networking issues: nova-metadata service is down. It's been reported already.
Looking at the console of a bootstap node shows it can't talk to nova-metadata: A start job is running for Ignition (fetch) (49s / no limit)[ 54.004386] ignition[720]: GET http://169.254.169.254/openstack/latest/user_data: attempt #8
Vexxhost made some networking changes and they no longer automatically serve DNS from the DHCP server. We now need to specify a DNS on the subnets we create via the `externalDNS` parameter of install-config.yaml. We're working on a fix for our CI jobs.
Vexxhost also fixed the failing nova-metadata service yesterday, so after we configure our jobs to use an externalDNS resolver, that should fix the jobs.
Seems to be fixed now, at least for the pre-submit jobs https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-openstack Let's wait a bit more see if this also fixed the periodic jobs.
Periodic jobs work too, https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.7/1374662306641743872 Moving to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438