The upgrade jobs frequently fail the test case about "Cluster should remain functional during upgrade" because "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is not available". example here: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428340402481532928 That test case can fail for many different reasons, but this specific monitor operator reason seems to be one of the top reasons. briefly digging in to the artifacts: 'oc get pods' shows this: openshift-monitoring node-exporter-5bvsr 2/2 Running 0 3h6m 10.0.205.43 ip-10-0-205-43.ec2.internal <none> <none> openshift-monitoring node-exporter-75w29 0/2 Init:0/1 0 175m 10.0.177.20 ip-10-0-177-20.ec2.internal <none> <none> openshift-monitoring node-exporter-gzc8z 2/2 Running 0 173m 10.0.157.65 ip-10-0-157-65.ec2.internal <none> <none> openshift-monitoring node-exporter-vk4zx 2/2 Running 0 178m 10.0.217.91 ip-10-0-217-91.ec2.internal <none> <none> openshift-monitoring node-exporter-wnxpv 2/2 Running 0 3h6m 10.0.136.24 ip-10-0-136-24.ec2.internal <none> <none> openshift-monitoring node-exporter-x7n47 2/2 Running 0 3h6m 10.0.138.18 ip-10-0-138-18.ec2.internal <none> <none> 'oc get operators' shows this: monitoring 4.9.0-0.nightly-2021-08-18-144658 False True True 157m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. in the cluster-operator-monitoring pod log [0], this line shows up a few times: W0819 16:28:40.593676 1 tasks.go:71] task 7 of 15: Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 6 ready pods for "node-exporter" daemonset, got 5 [0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428340402481532928/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-dfbf6d699-lzklv_cluster-monitoring-operator.log
Hint from @simon, > hey! looking quickly at the CI logs, I've found that one pod instance couldn't be started on node ip-10-0-177-20.ec2.internal the events say { "apiVersion": "v1", "count": 807, "eventTime": null, "firstTimestamp": "2021-08-19T13:37:58Z", "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "node-exporter-75w29", "namespace": "openshift-monitoring", "resourceVersion": "22157", "uid": "6af24349-66a6-4c67-a759-062e2fa67242" }, "kind": "Event", "lastTimestamp": "2021-08-19T16:32:44Z", "message": "unable to ensure pod container exists: failed to create container for [kubepods burstable pod6af24349-66a6-4c67-a759-062e2fa67242] : Unit kubepods-burstable-pod6af24349_66a6_4c67_a759_062e2fa67242.slice already exists.", "metadata": { "creationTimestamp": "2021-08-19T13:38:00Z", "name": "node-exporter-75w29.169cb8bae3eb3430", "namespace": "openshift-monitoring", "resourceVersion": "94761", "uid": "c2bc0ab7-5ca9-404c-9c1c-8f61b36ecdf4" }, "reason": "FailedCreatePodContainer", "reportingComponent": "", "reportingInstance": "", "source": { "component": "kubelet", "host": "ip-10-0-177-20.ec2.internal" }, "type": "Warning" }, https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428340402481532928/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json Assigning it to node team for their evaluation.
*** This bug has been marked as a duplicate of bug 1993980 ***