Description of problem: periodically https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-metal-serial-4.7 is failing to install. Aprox 20% of the jobs in the past week failed like this. The job build log is complaining about operators not functioning properly, routes not becoming ready, IngressControllerDegraded, etc. example build log here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/build-log.txt from this job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072 one thing that stands out is that no worker nodes appear to have been provisioned. The install log shows their ip's. for example, 139.178.89.213 from here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/container-logs/setup.log but the must gather only has a single reference to this ip in namespaces/openshift-machine-config-operator/pods/machine-config-server-czpvn/machine-config-server/machine-config-server/logs/current.log which is: 2021-02-02T04:05:48.165144180Z I0202 04:05:48.165086 1 api.go:117] Pool worker requested by address:"139.178.88.147:37738" User-Agent:"Ignition/2.9.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.2.0, */*;q=0.1" must-gather is here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/must-gather.tar Version-Release number of selected component (if applicable): this is happening in 4.7, but not in 4.6 according to these searches: 4.6 https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+image-registry+is+reporting+a+failure%3A+Degraded&maxAge=168h&context=1&type=junit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job 4.7 https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+image-registry+is+reporting+a+failure%3A+Degraded&maxAge=168h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job How reproducible: seems to be hitting these metal jobs frequently enough to need a solid understanding and root cause. probably ~20% in the last week for the job above.
BMO is IPI, UPI metal bugs AFAIK just go to the main installer subcomponent.
I looked at one of the failures [1] and noticed two things. 1) As mentioned in the description, not all of the worker nodes appear to be pulling their ignition configs from the MCO. The logs only show requests made from 2 of the worker machines. The openshift-cluster-machine-approver logs also only show CSRs for 2 of the worker machines. 2) None of the CSRs for the worker machines were approved. The CSRs for the masters were approved, but they would have all been approved by the bootstrap auto-approver. Once the bootstrapping is done, the test job is supposed to auto-approve CSRs every 15 second. However, there is no output in the build log regarding approving CSRs. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1357010120940195840
In the failures that I looked at where the CSRs are not getting approved, they all have this error. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get certificatesigningrequests.certificates.k8s.io) It looks like the approve_csr function [1] will abort that error. [1] https://github.com/openshift/release/blob/a519efd531a5f7acf6771255b28b6616ebf64275/ci-operator/templates/openshift/installer/cluster-launch-installer-metal-e2e.yaml#L526
The presence of installer-logged operator conditions in [1] and the gathered assets in [2] suggest the Kube API is alive and well in this cluster. The lack of timestamps or successful-approval logging makes it hard to know how happy the CSR-approver is when it is not logging errors. I guess we can add logging to the template function, and have more to look at the next time it fails an install? [1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/build-log.txt [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.7/1356446127385219072/artifacts/e2e-metal-serial/
There is no QA needed for this BZ as it is only relevant to CI.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438