Bug 1843187
Summary: | daemonset, deployment, and replicaset status can permafail | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | |
Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> | |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.5 | CC: | agarcial, aos-bugs, bparees, mfojtik, wking | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
In certain cases NotFound error was swallowed by controller logic.
Consequence:
Missing NotFound event was causing the controller not be aware of missing pods.
Fix:
Properly react to NotFound events, which indicate that the pod was already removed by a different actor.
Result:
Controller (deployment, daemonset, replicaset and others) will properly react to pod NotFound event.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1843462 (view as bug list) | Environment: |
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
|
|
Last Closed: | 2020-10-27 16:04:12 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1843462 |
Description
David Eads
2020-06-02 18:51:47 UTC
*** Bug 1843319 has been marked as a duplicate of this bug. *** this has been identified as at least one cause of failures seen in "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" where the upgrade completely fails. (see discussion in https://bugzilla.redhat.com/show_bug.cgi?id=1843319) Moving this back to 4.5, we are seeing upgrade failures in 4.5 and this bug was duped against the 4.5 bug that was filed for them. Poking at [1] from [2], here's the CVO being apparently happy with the monitoring Deployment: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/pods/openshift-cluster-version_cluster-version-operator-cb7cf5b8c-5xhm6_cluster-version-operator.log | grep 'Running sync .*in state\|Result of work\|deployment.*openshift-monitoring' | tail I0601 16:25:43.966583 1 task_graph.go:596] Result of work: [Cluster operator monitoring is still updating] I0601 16:28:58.732465 1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-05-29-203954 (force=true) on generation 2 in state Updating at attempt 7 I0601 16:28:58.866011 1 task_graph.go:596] Result of work: [] I0601 16:29:16.488091 1 sync_worker.go:653] Running sync for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) I0601 16:29:18.889119 1 sync_worker.go:666] Done syncing for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) I0601 16:34:43.784268 1 task_graph.go:596] Result of work: [Cluster operator monitoring is still updating] I0601 16:37:51.316881 1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-05-29-203954 (force=true) on generation 2 in state Updating at attempt 8 I0601 16:37:51.459038 1 task_graph.go:596] Result of work: [] I0601 16:38:08.974398 1 sync_worker.go:653] Running sync for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) I0601 16:38:11.373663 1 sync_worker.go:666] Done syncing for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584) despite that Deployment being obviously sad: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/daemonsets.json | zcat | jq -r '.items[] | select(.metadata.namespace == "openshift-monitoring").status' { "currentNumberScheduled": 6, "desiredNumberScheduled": 6, "numberMisscheduled": 0, "numberReady": 0, "numberUnavailable": 6, "observedGeneration": 1, "updatedNumberScheduled": 6 } So there is room for improving the CVO's Deployment logic to account for numberUnavailable and complain about the stuck Deployment. Although the current docs [3] claim "There are no unavailable replicas"? I'll poke around. But this reporting would just improve the current: Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is still updating to talk about the error message; it would not unstick the Deployment. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56 [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56 [3]: https://github.com/openshift/cluster-version-operator/blob/cb8241a33846546ab63ec8c01112f633b6980182/docs/user/reconciliation.md#deployment So I was just mixing up Deployment vs. DaemonSet. The CVO does not manage DaemonSet openshift-monitoring/node-exporter, and it's up to the monitoring operator to explain what's going on with those unavailable pods: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-monitoring" and (.metadata.name | startswith("node-exporter-"))).status | select(.phase != "Running").conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort | uniq 2020-06-01T15:36:49Z PodScheduled=False Unschedulable: 0/6 nodes are available: 6 node(s) didn't have free ports for the requested pod ports. Which it is doing (ish): $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "monitoring").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2020-06-01T15:41:33Z Available=False : 2020-06-01T15:46:42Z Degraded=True UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 0, unavailable: 6) 2020-06-01T16:38:19Z Progressing=True RollOutInProgress: Rolling out the stack. 2020-06-01T16:38:19Z Upgradeable=True RollOutInProgress: Rollout of the monitoring stack is in progress. Please wait until it finishes. So I think the only bug here is the underlying "DaemonSet controller should not get hung up" from bug 1843319, bug 1790989, and possibly other places, which might be fixed by the upstream PR David linked in comment 0. Sorry for the noise. Confirmed with payload: 4.5.0-0.nightly-2020-06-11-183238, the issue has fixed: Open 2 terminals , on the first terminal delete po, on the second scale down the deployment, check the deployment , no new pod created with the status update. [root@dhcp-140-138 ~]# oc delete po/ruby-ex-557d5865d4-vccl7 pod "ruby-ex-557d5865d4-vccl7" deleted [root@dhcp-140-138 ~]# oc scale deploy/ruby-ex --replicas=1 deployment.apps/ruby-ex scaled [root@dhcp-140-138 ~]# oc get po NAME READY STATUS RESTARTS AGE ruby-ex-1-build 0/1 Completed 0 21m ruby-ex-557d5865d4-vccl7 1/1 Terminating 0 20m ruby-ex-557d5865d4-x7jgk 1/1 Running 0 9m5s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |