Bug 1924358
Summary: | metal UPI setup fails, no worker nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | jamo luhrsen <jluhrsen> |
Component: | Installer | Assignee: | Aditya Narayanaswamy <anarayan> |
Installer sub component: | openshift-installer | QA Contact: | Gaoyun Pei <gpei> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | bleanhar, mstaeble, stbenjam, wking |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: |
This bug only affects CI.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:37:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jamo luhrsen
2021-02-03 00:54:23 UTC
BMO is IPI, UPI metal bugs AFAIK just go to the main installer subcomponent. I looked at one of the failures [1] and noticed two things. 1) As mentioned in the description, not all of the worker nodes appear to be pulling their ignition configs from the MCO. The logs only show requests made from 2 of the worker machines. The openshift-cluster-machine-approver logs also only show CSRs for 2 of the worker machines. 2) None of the CSRs for the worker machines were approved. The CSRs for the masters were approved, but they would have all been approved by the bootstrap auto-approver. Once the bootstrapping is done, the test job is supposed to auto-approve CSRs every 15 second. However, there is no output in the build log regarding approving CSRs. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1357010120940195840 In the failures that I looked at where the CSRs are not getting approved, they all have this error. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get certificatesigningrequests.certificates.k8s.io) It looks like the approve_csr function [1] will abort that error. [1] https://github.com/openshift/release/blob/a519efd531a5f7acf6771255b28b6616ebf64275/ci-operator/templates/openshift/installer/cluster-launch-installer-metal-e2e.yaml#L526 The presence of installer-logged operator conditions in [1] and the gathered assets in [2] suggest the Kube API is alive and well in this cluster. The lack of timestamps or successful-approval logging makes it hard to know how happy the CSR-approver is when it is not logging errors. I guess we can add logging to the template function, and have more to look at the next time it fails an install? [1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/build-log.txt [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/ There is no QA needed for this BZ as it is only relevant to CI. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |