Description of problem: The last three 4.11 nightly payloads have all failed including metal-ipi-ovn-ipv6 failure. In each, the cluster fails to install complaining that: level=info msg=Waiting up to 1h0m0s (until 3:54AM) for the cluster at https://api.ostest.test.metalkube.org:6443 to initialize... level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, kube-apiserver, monitoring, openshift-samples level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, kube-apiserver, monitoring, openshift-samples level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 746 of 777 done (96% complete) level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 746 of 777 done (96% complete) level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 746 of 777 done (96% complete) level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, monitoring level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 768 of 777 done (98% complete) level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, insights level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, insights level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, insights level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 773 of 777 done (99% complete) level=debug msg=Still waiting for the cluster to initialize: Cluster operator insights is not available bash: line 52: 16671 Killed timeout -s 9 105m make Examples: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1502214830407290880 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1502099122117677056 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1501915647955701760 Using the last of the three links above for analysis. ^ The insights operator in it's log reports: I0310 15:45:44.183372 1 operator.go:201] The last pod state is unhealthy This code originates from: https://github.com/openshift/insights-operator/blob/b67157fc871fe846850235334b7eaa6ca3a74547/pkg/controller/operator.go#L201 Which seems to indicate it's unhappy about some container status on the insights pod itself, status for that pod can be seen here but bear in mind this would have been captured after the install failed, and state could have changed: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1501915647955701760/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods.json We have no e2e intervals graph as this cluster did not make it past install. We can see that this problem is only affecting metal-ipi, insights install success is very close to 100% on all other variants: https://sippy.ci.openshift.org/sippy-ng/install/4.11/operators As this is blocking all nightly payloads for all parts of the org that depend on them, TRT believes this bug should be marked urgent as we need to get the payloads flowing asap. I am unsure if this is a metal problem or an insights problem, starting with insights, but I will notify the metal teams as well.
We believe we've identified the problematic PR and opened a revert: https://github.com/openshift/insights-operator/pull/592 This is the second time this PR has caused a regression and had to be reverted, need to make sure this is working on metal-ipi before going in again.
"cluster bootstrap should succeed" test shows this problem resolving as of around Mar 11. I think this is safe to verify.
Verified on Verified on 4.11.0-0.ci-2022-03-15-032841. Test works as expected.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069