Bug 1963932 - Installation failures in bootstrap in OpenStack release jobs
Summary: Installation failures in bootstrap in OpenStack release jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Martin André
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-24 12:26 UTC by Petr Muller
Modified: 2021-07-27 23:10 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
test: operator.Run template e2e-openstack - e2e-openstack container setup [sig-sippy] infrastructure should work
Last Closed: 2021-07-27 23:10:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:10:27 UTC

Description Petr Muller 2021-05-24 12:26:18 UTC
On May 22, OpenStack CI jobs started to fail installations:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.8
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-openstack-4.8

https://search.ci.openshift.org/?search=failed+to+lookup+masters%3A+resource+not+found+&maxAge=48h&context=0&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Example jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.8/1396768057287774208
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.8/1396647260615348224

 INFO[2021-05-24T02:46:06Z] level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.nhmqbc00-d8ea2.shiftstack.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 38.102.83.104:6443: connect: no route to host 
INFO[2021-05-24T02:46:06Z] level=debug msg=Fetching Bootstrap SSH Key Pair... 
INFO[2021-05-24T02:46:06Z] level=debug msg=Loading Bootstrap SSH Key Pair... 
INFO[2021-05-24T02:46:06Z] level=debug msg=Using Bootstrap SSH Key Pair loaded from state file 
INFO[2021-05-24T02:46:06Z] level=debug msg=Reusing previously-fetched Bootstrap SSH Key Pair 
INFO[2021-05-24T02:46:06Z] level=debug msg=Fetching Install Config...   
INFO[2021-05-24T02:46:06Z] level=debug msg=    Loading Platform...      
INFO[2021-05-24T02:46:06Z] level=debug msg=  Loading Pull Secret...     
INFO[2021-05-24T02:46:06Z] level=debug msg=  Loading Platform...        
INFO[2021-05-24T02:46:06Z] level=debug msg=Using Install Config loaded from state file 
INFO[2021-05-24T02:46:06Z] level=debug msg=Reusing previously-fetched Install Config 
INFO[2021-05-24T02:46:06Z] level=error msg=failed to lookup masters: resource not found 
INFO[2021-05-24T02:46:06Z] level=info msg=Pulling debug logs from the bootstrap machine 
INFO[2021-05-24T02:46:06Z] level=debug msg=Added /tmp/bootstrap-ssh437206137 to installer's internal agent 
INFO[2021-05-24T02:46:06Z] level=debug msg=Added /tmp/.ssh/id_rsa to installer's internal agent 
INFO[2021-05-24T02:46:06Z] level=error msg=Attempted to gather debug logs after installation failure: failed to create SSH client: dial tcp 38.102.83.11:22: connect: connection timed out 
INFO[2021-05-24T02:46:06Z] level=error msg=Bootstrap failed to complete: Get "https://api.nhmqbc00-d8ea2.shiftstack.devcluster.openshift.com:6443/version?timeout=32s": dial tcp 38.102.83.104:6443: connect: no route to host 
INFO[2021-05-24T02:46:06Z] level=error msg=Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane. 
INFO[2021-05-24T02:46:06Z] level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory 
INFO[2021-05-24T02:46:06Z] level=fatal msg=Bootstrap failed to complete

Comment 2 Martin André 2021-05-26 06:47:19 UTC
VMs can no longer reach the metadata service.

Booting a cirros VM, it shows in the logs:

Starting network: udhcpc: started, v1.29.3                                                                                                                                
udhcpc: sending discover
udhcpc: sending select for 172.16.0.125
udhcpc: lease of 172.16.0.125 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "172.16.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 1.59. request failed
failed 2/20: up 14.74. request failed
failed 3/20: up 27.83. request failed
[snip]
failed 19/20: up 237.21. request failed
failed 20/20: up 250.30. request failed
failed to read iid from metadata. tried 20 
failed to get instance-id of datasource

I'm unable to SSH to the VM, however I can connect to it via the noVNC client from the web interface. Request to the nova metadata shows it's returning a 500 error. We've opened a support ticket with our cloud provider.

Comment 4 Martin André 2021-05-28 06:32:04 UTC
Vexxhost has fixed their issue with the metadata service and now jobs are passing again.

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.8

Comment 7 errata-xmlrpc 2021-07-27 23:10:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.