When pod expectations are not met, status for workloads can wedge. When status for workloads wedges, operators wait indefinitely. When operators wait indefinitely status is wrong. When status is wrong, upgrades can fail. Picking https://github.com/kubernetes/kubernetes/pull/91008 seems like a fix.
*** Bug 1843319 has been marked as a duplicate of this bug. ***
this has been identified as at least one cause of failures seen in "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" where the upgrade completely fails. (see discussion in https://bugzilla.redhat.com/show_bug.cgi?id=1843319)
Moving this back to 4.5, we are seeing upgrade failures in 4.5 and this bug was duped against the 4.5 bug that was filed for them.
Poking at [1] from [2], here's the CVO being apparently happy with the monitoring Deployment: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/pods/openshift-cluster-version_cluster-version-operator-cb7cf5b8c-5xhm6_cluster-version-operator.log | grep 'Running sync .*in state\|Result of work\|deployment.*openshift-monitoring' | tail I0601 16:25:43.966583 1 task_graph.go:596] Result of work: [Cluster operator monitoring is still updating] I0601 16:28:58.732465 1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-05-29-203954 (force=true) on generation 2 in state Updating at attempt 7 I0601 16:28:58.866011 1 task_graph.go:596] Result of work: [] I0601 16:29:16.488091 1 sync_worker.go:653] Running sync for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) I0601 16:29:18.889119 1 sync_worker.go:666] Done syncing for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) I0601 16:34:43.784268 1 task_graph.go:596] Result of work: [Cluster operator monitoring is still updating] I0601 16:37:51.316881 1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-05-29-203954 (force=true) on generation 2 in state Updating at attempt 8 I0601 16:37:51.459038 1 task_graph.go:596] Result of work: [] I0601 16:38:08.974398 1 sync_worker.go:653] Running sync for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) I0601 16:38:11.373663 1 sync_worker.go:666] Done syncing for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) despite that Deployment being obviously sad: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/daemonsets.json | zcat | jq -r '.items[] | select(.metadata.namespace == "openshift-monitoring").status' { "currentNumberScheduled": 6, "desiredNumberScheduled": 6, "numberMisscheduled": 0, "numberReady": 0, "numberUnavailable": 6, "observedGeneration": 1, "updatedNumberScheduled": 6 } So there is room for improving the CVO's Deployment logic to account for numberUnavailable and complain about the stuck Deployment. Although the current docs [3] claim "There are no unavailable replicas"? I'll poke around. But this reporting would just improve the current: Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is still updating to talk about the error message; it would not unstick the Deployment. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56 [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56 [3]: https://github.com/openshift/cluster-version-operator/blob/cb8241a33846546ab63ec8c01112f633b6980182/docs/user/reconciliation.md#deployment
So I was just mixing up Deployment vs. DaemonSet. The CVO does not manage DaemonSet openshift-monitoring/node-exporter, and it's up to the monitoring operator to explain what's going on with those unavailable pods: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-monitoring" and (.metadata.name | startswith("node-exporter-"))).status | select(.phase != "Running").conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort | uniq 2020-06-01T15:36:49Z PodScheduled=False Unschedulable: 0/6 nodes are available: 6 node(s) didn't have free ports for the requested pod ports. Which it is doing (ish): $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "monitoring").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2020-06-01T15:41:33Z Available=False : 2020-06-01T15:46:42Z Degraded=True UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 0, unavailable: 6) 2020-06-01T16:38:19Z Progressing=True RollOutInProgress: Rolling out the stack. 2020-06-01T16:38:19Z Upgradeable=True RollOutInProgress: Rollout of the monitoring stack is in progress. Please wait until it finishes. So I think the only bug here is the underlying "DaemonSet controller should not get hung up" from bug 1843319, bug 1790989, and possibly other places, which might be fixed by the upstream PR David linked in comment 0. Sorry for the noise.
Confirmed with payload: 4.5.0-0.nightly-2020-06-11-183238, the issue has fixed: Open 2 terminals , on the first terminal delete po, on the second scale down the deployment, check the deployment , no new pod created with the status update. [root@dhcp-140-138 ~]# oc delete po/ruby-ex-557d5865d4-vccl7 pod "ruby-ex-557d5865d4-vccl7" deleted [root@dhcp-140-138 ~]# oc scale deploy/ruby-ex --replicas=1 deployment.apps/ruby-ex scaled [root@dhcp-140-138 ~]# oc get po NAME READY STATUS RESTARTS AGE ruby-ex-1-build 0/1 Completed 0 21m ruby-ex-557d5865d4-vccl7 1/1 Terminating 0 20m ruby-ex-557d5865d4-x7jgk 1/1 Running 0 9m5s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475