Bug 1843187

Summary: daemonset, deployment, and replicaset status can permafail
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: kube-controller-managerAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: agarcial, aos-bugs, bparees, mfojtik, wking
Target Milestone: ---Keywords: Upgrades
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: In certain cases NotFound error was swallowed by controller logic. Consequence: Missing NotFound event was causing the controller not be aware of missing pods. Fix: Properly react to NotFound events, which indicate that the pod was already removed by a different actor. Result: Controller (deployment, daemonset, replicaset and others) will properly react to pod NotFound event.
Story Points: ---
Clone Of:
: 1843462 (view as bug list) Environment:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Last Closed: 2020-10-27 16:04:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1843462    

Description David Eads 2020-06-02 18:51:47 UTC
When pod expectations are not met, status for workloads can wedge. When status for workloads wedges, operators wait indefinitely. When operators wait indefinitely status is wrong.  When status is wrong, upgrades can fail.

Picking https://github.com/kubernetes/kubernetes/pull/91008 seems like a fix.

Comment 1 Maciej Szulik 2020-06-03 10:54:58 UTC
*** Bug 1843319 has been marked as a duplicate of this bug. ***

Comment 2 Ben Parees 2020-06-03 17:33:06 UTC
this has been identified as at least one cause of failures seen in "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" where the upgrade completely fails.

(see discussion in https://bugzilla.redhat.com/show_bug.cgi?id=1843319)

Comment 3 Ben Parees 2020-06-03 17:33:46 UTC
Moving this back to 4.5, we are seeing upgrade failures in 4.5 and this bug was duped against the 4.5 bug that was filed for them.

Comment 4 W. Trevor King 2020-06-03 18:37:05 UTC
Poking at [1] from [2], here's the CVO being apparently happy with the monitoring Deployment:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/pods/openshift-cluster-version_cluster-version-operator-cb7cf5b8c-5xhm6_cluster-version-operator.log | grep 'Running sync .*in state\|Result of work\|deployment.*openshift-monitoring' | tail 
I0601 16:25:43.966583       1 task_graph.go:596] Result of work: [Cluster operator monitoring is still updating]
I0601 16:28:58.732465       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-05-29-203954 (force=true) on generation 2 in state Updating at attempt 7
I0601 16:28:58.866011       1 task_graph.go:596] Result of work: []
I0601 16:29:16.488091       1 sync_worker.go:653] Running sync for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584)
I0601 16:29:18.889119       1 sync_worker.go:666] Done syncing for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584)
I0601 16:34:43.784268       1 task_graph.go:596] Result of work: [Cluster operator monitoring is still updating]
I0601 16:37:51.316881       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-05-29-203954 (force=true) on generation 2 in state Updating at attempt 8
I0601 16:37:51.459038       1 task_graph.go:596] Result of work: []
I0601 16:38:08.974398       1 sync_worker.go:653] Running sync for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584)
I0601 16:38:11.373663       1 sync_worker.go:666] Done syncing for deployment "openshift-monitoring/cluster-monitoring-operator" (349 of 584)

despite that Deployment being obviously sad:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/daemonsets.json | zcat | jq -r '.items[] | select(.metadata.namespace == "openshift-monitoring").status'
{
  "currentNumberScheduled": 6,
  "desiredNumberScheduled": 6,
  "numberMisscheduled": 0,
  "numberReady": 0,
  "numberUnavailable": 6,
  "observedGeneration": 1,
  "updatedNumberScheduled": 6
}

So there is room for improving the CVO's Deployment logic to account for numberUnavailable and complain about the stuck Deployment.  Although the current docs [3] claim "There are no unavailable replicas"?  I'll poke around.  But this reporting would just improve the current:

  Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is still updating

to talk about the error message; it would not unstick the Deployment.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56
[3]: https://github.com/openshift/cluster-version-operator/blob/cb8241a33846546ab63ec8c01112f633b6980182/docs/user/reconciliation.md#deployment

Comment 5 W. Trevor King 2020-06-03 22:46:15 UTC
So I was just mixing up Deployment vs. DaemonSet.  The CVO does not manage DaemonSet openshift-monitoring/node-exporter, and it's up to the monitoring operator to explain what's going on with those unavailable pods:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-monitoring" and (.metadata.name | startswith("node-exporter-"))).status | select(.phase != "Running").conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort | uniq
2020-06-01T15:36:49Z PodScheduled=False Unschedulable: 0/6 nodes are available: 6 node(s) didn't have free ports for the requested pod ports.

Which it is doing (ish):

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56/artifacts/e2e-azure-upgrade/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "monitoring").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-06-01T15:41:33Z Available=False : 
2020-06-01T15:46:42Z Degraded=True UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 0, unavailable: 6)
2020-06-01T16:38:19Z Progressing=True RollOutInProgress: Rolling out the stack.
2020-06-01T16:38:19Z Upgradeable=True RollOutInProgress: Rollout of the monitoring stack is in progress. Please wait until it finishes.

So I think the only bug here is the underlying "DaemonSet controller should not get hung up" from bug 1843319, bug 1790989, and possibly other places, which might be fixed by the upstream PR David linked in comment 0.  Sorry for the noise.

Comment 10 zhou ying 2020-06-15 07:29:47 UTC
Confirmed with payload: 4.5.0-0.nightly-2020-06-11-183238, the issue has fixed:

Open 2 terminals , on the first terminal delete po, on the second scale down the deployment, check the deployment , no new pod created with the status update. 
[root@dhcp-140-138 ~]# oc delete po/ruby-ex-557d5865d4-vccl7 
pod "ruby-ex-557d5865d4-vccl7" deleted


[root@dhcp-140-138 ~]# oc scale deploy/ruby-ex --replicas=1
deployment.apps/ruby-ex scaled
[root@dhcp-140-138 ~]# oc get po 
NAME                       READY   STATUS        RESTARTS   AGE
ruby-ex-1-build            0/1     Completed     0          21m
ruby-ex-557d5865d4-vccl7   1/1     Terminating   0          20m
ruby-ex-557d5865d4-x7jgk   1/1     Running       0          9m5s

Comment 12 errata-xmlrpc 2020-10-27 16:04:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 13 W. Trevor King 2021-04-05 17:46:57 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475