Bug 1924358

Summary: metal UPI setup fails, no worker nodes
Product: OpenShift Container Platform Reporter: jamo luhrsen <jluhrsen>
Component: InstallerAssignee: Aditya Narayanaswamy <anarayan>
Installer sub component: openshift-installer QA Contact: Gaoyun Pei <gpei>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bleanhar, mstaeble, stbenjam, wking
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
This bug only affects CI.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:37:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jamo luhrsen 2021-02-03 00:54:23 UTC
Description of problem:

periodically https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-metal-serial-4.7 is failing to
install. Aprox 20% of the jobs in the past week failed like this. The job build log is complaining about operators not functioning properly,
routes not becoming ready, IngressControllerDegraded, etc.

example build log here:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/build-log.txt
from this job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072

one thing that stands out is that no worker nodes appear to have been provisioned. The install log shows their ip's. for example,
139.178.89.213 from here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/container-logs/setup.log

but the must gather only has a single reference to this ip in namespaces/openshift-machine-config-operator/pods/machine-config-server-czpvn/machine-config-server/machine-config-server/logs/current.log which is:

2021-02-02T04:05:48.165144180Z I0202 04:05:48.165086       1 api.go:117] Pool worker requested by address:"139.178.88.147:37738" User-Agent:"Ignition/2.9.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.2.0, */*;q=0.1"

must-gather is here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/must-gather.tar




Version-Release number of selected component (if applicable):

this is happening in 4.7, but not in 4.6 according to these searches:
4.6  https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+image-registry+is+reporting+a+failure%3A+Degraded&maxAge=168h&context=1&type=junit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job

4.7  https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+image-registry+is+reporting+a+failure%3A+Degraded&maxAge=168h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job


How reproducible:

seems to be hitting these metal jobs frequently enough to need a solid understanding and root cause. probably ~20% in the
last week for the job above.

Comment 1 Stephen Benjamin 2021-02-03 20:36:11 UTC
BMO is IPI, UPI metal bugs AFAIK just go to the main installer subcomponent.

Comment 2 Matthew Staebler 2021-02-04 02:20:49 UTC
I looked at one of the failures [1] and noticed two things.

1) As mentioned in the description, not all of the worker nodes appear to be pulling their ignition configs from the MCO. The logs only show requests made from 2 of the worker machines. The openshift-cluster-machine-approver logs also only show CSRs for 2 of the worker machines.

2) None of the CSRs for the worker machines were approved. The CSRs for the masters were approved, but they would have all been approved by the bootstrap auto-approver. Once the bootstrapping is done, the test job is supposed to auto-approve CSRs every 15 second. However, there is no output in the build log regarding approving CSRs.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1357010120940195840

Comment 3 Matthew Staebler 2021-02-04 02:32:24 UTC
In the failures that I looked at where the CSRs are not getting approved, they all have this error.

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get certificatesigningrequests.certificates.k8s.io)

It looks like the approve_csr function [1] will abort that error.

[1] https://github.com/openshift/release/blob/a519efd531a5f7acf6771255b28b6616ebf64275/ci-operator/templates/openshift/installer/cluster-launch-installer-metal-e2e.yaml#L526

Comment 4 W. Trevor King 2021-02-04 02:55:37 UTC
The presence of installer-logged operator conditions in [1] and the gathered assets in [2] suggest the Kube API is alive and well in this cluster.  The lack of timestamps or successful-approval logging makes it hard to know how happy the CSR-approver is when it is not logging errors.  I guess we can add logging to the template function, and have more to look at the next time it fails an install?

[1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/build-log.txt
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/

Comment 6 Matthew Staebler 2021-02-07 17:38:16 UTC
There is no QA needed for this BZ as it is only relevant to CI.

Comment 9 errata-xmlrpc 2021-07-27 22:37:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438