1924358 – metal UPI setup fails, no worker nodes

Bug 1924358 - metal UPI setup fails, no worker nodes

Summary: metal UPI setup fails, no worker nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Aditya Narayanaswamy
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-03 00:54 UTC by jamo luhrsen
Modified:	2021-07-27 22:39 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	This bug only affects CI.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:37:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift release pull 15583	0	None	open	Bug 1924358: Fix approve-csr script to not abort test on failure	2021-02-07 01:00:42 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:39:28 UTC

Description jamo luhrsen 2021-02-03 00:54:23 UTC

Description of problem:

periodically https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-metal-serial-4.7 is failing to
install. Aprox 20% of the jobs in the past week failed like this. The job build log is complaining about operators not functioning properly,
routes not becoming ready, IngressControllerDegraded, etc.

example build log here:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/build-log.txt
from this job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072

one thing that stands out is that no worker nodes appear to have been provisioned. The install log shows their ip's. for example,
139.178.89.213 from here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/container-logs/setup.log

but the must gather only has a single reference to this ip in namespaces/openshift-machine-config-operator/pods/machine-config-server-czpvn/machine-config-server/machine-config-server/logs/current.log which is:

2021-02-02T04:05:48.165144180Z I0202 04:05:48.165086       1 api.go:117] Pool worker requested by address:"139.178.88.147:37738" User-Agent:"Ignition/2.9.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.2.0, */*;q=0.1"

must-gather is here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/must-gather.tar




Version-Release number of selected component (if applicable):

this is happening in 4.7, but not in 4.6 according to these searches:
4.6  https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+image-registry+is+reporting+a+failure%3A+Degraded&maxAge=168h&context=1&type=junit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job

4.7  https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+image-registry+is+reporting+a+failure%3A+Degraded&maxAge=168h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job


How reproducible:

seems to be hitting these metal jobs frequently enough to need a solid understanding and root cause. probably ~20% in the
last week for the job above.

Comment 1 Stephen Benjamin 2021-02-03 20:36:11 UTC

BMO is IPI, UPI metal bugs AFAIK just go to the main installer subcomponent.

Comment 2 Matthew Staebler 2021-02-04 02:20:49 UTC

I looked at one of the failures [1] and noticed two things.

1) As mentioned in the description, not all of the worker nodes appear to be pulling their ignition configs from the MCO. The logs only show requests made from 2 of the worker machines. The openshift-cluster-machine-approver logs also only show CSRs for 2 of the worker machines.

2) None of the CSRs for the worker machines were approved. The CSRs for the masters were approved, but they would have all been approved by the bootstrap auto-approver. Once the bootstrapping is done, the test job is supposed to auto-approve CSRs every 15 second. However, there is no output in the build log regarding approving CSRs.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1357010120940195840

Comment 3 Matthew Staebler 2021-02-04 02:32:24 UTC

In the failures that I looked at where the CSRs are not getting approved, they all have this error.

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get certificatesigningrequests.certificates.k8s.io)

It looks like the approve_csr function [1] will abort that error.

[1] https://github.com/openshift/release/blob/a519efd531a5f7acf6771255b28b6616ebf64275/ci-operator/templates/openshift/installer/cluster-launch-installer-metal-e2e.yaml#L526

Comment 4 W. Trevor King 2021-02-04 02:55:37 UTC

The presence of installer-logged operator conditions in [1] and the gathered assets in [2] suggest the Kube API is alive and well in this cluster.  The lack of timestamps or successful-approval logging makes it hard to know how happy the CSR-approver is when it is not logging errors.  I guess we can add logging to the template function, and have more to look at the next time it fails an install?

[1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/build-log.txt
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/

Comment 6 Matthew Staebler 2021-02-07 17:38:16 UTC

There is no QA needed for this BZ as it is only relevant to CI.

Comment 9 errata-xmlrpc 2021-07-27 22:37:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.