Bug 1888041

Summary:	non-terminating pods are going from running to pending
Product:	OpenShift Container Platform	Reporter:	David Eads <deads>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, jokerman
Version:	4.6
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The Kubelet did not handle transitions properly when statuses were missing. Consequence: This caused terminated pods to sometimes not get restarted. Fix: Adds a ContainerStatus of failed to allow the container to be restarted (if need be). Result: Kubelet pod handling does not result in an invalid state transition.	Story Points:	---
Clone Of:
Clones:	1960291 (view as bug list)		Environment:
Last Closed:	2021-02-24 15:25:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1888847, 1960291

Description David Eads 2020-10-13 20:54:25 UTC

terminated pods are not going from running to pending anymore.
static pods with the same UID are going from running to pending and we're hopeful the UID workaround will help with that.

We are now seeing another class of running to pending failure.  Some non-terminated pods are failing.  If you need help by creating a separate test and failing specifically on these, I can show you where to do that to spot them.

14 pods illegally transitioned to Pending

ns/openshift-cluster-node-tuning-operator pod/tuned-fvppk node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-image-registry pod/node-ca-8stv8 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-sdn pod/sdn-jqr2l node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-dns pod/dns-default-z5pj9 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-multus pod/multus-jkrv4 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-multus pod/network-metrics-daemon-t5kx5 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-sdn pod/ovs-h9n6p node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-machine-config-operator pod/machine-config-daemon-6fkpx node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-monitoring pod/node-exporter-b6m9s node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-statefulset-144 pod/ss-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-projected-8536 pod/pod-projected-secrets-2618d264-db5d-4371-9c4b-761e9618305a node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-csi-mock-volumes-4609-8746 pod/csi-mockplugin-resizer-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-csi-mock-volumes-4609-8746 pod/csi-mockplugin-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-csi-mock-volumes-4609-8746 pod/csi-mockplugin-attacher-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending


from https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1314781668476719104

Comment 1 Ryan Phillips 2020-10-14 16:16:51 UTC

Node team had a slack conversation with David, and he suggested this patch: https://github.com/kubernetes/kubernetes/pull/95561

We will go test this. If the patch works, then we will target a 4.6.z backport.

The issue stems from reboots and the way crio wipes the container statuses upon reboot.

Comment 3 David Eads 2020-10-27 12:58:53 UTC

The fix was in master when https://bugzilla.redhat.com/show_bug.cgi?id=1884035#c15 was verified and the same check applies to both.  There were three fixes involved and the test verified them together.

Comment 6 errata-xmlrpc 2021-02-24 15:25:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633