1888041 – non-terminating pods are going from running to pending

Bug 1888041 - non-terminating pods are going from running to pending

Summary: non-terminating pods are going from running to pending

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1888847 1960291
TreeView+	depends on / blocked

Reported:	2020-10-13 20:54 UTC by David Eads
Modified:	2021-05-13 14:36 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The Kubelet did not handle transitions properly when statuses were missing. Consequence: This caused terminated pods to sometimes not get restarted. Fix: Adds a ContainerStatus of failed to allow the container to be restarted (if need be). Result: Kubelet pod handling does not result in an invalid state transition.
Clone Of:
Clones:	1960291 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:25:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 411	0	None	closed	Bug 1888041: UPSTREAM: 95561: kubelet container status calculation doesn't handle suddenly missing data properly	2021-01-05 17:35:13 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:26:09 UTC

Description David Eads 2020-10-13 20:54:25 UTC

terminated pods are not going from running to pending anymore.
static pods with the same UID are going from running to pending and we're hopeful the UID workaround will help with that.

We are now seeing another class of running to pending failure.  Some non-terminated pods are failing.  If you need help by creating a separate test and failing specifically on these, I can show you where to do that to spot them.

14 pods illegally transitioned to Pending

ns/openshift-cluster-node-tuning-operator pod/tuned-fvppk node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-image-registry pod/node-ca-8stv8 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-sdn pod/sdn-jqr2l node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-dns pod/dns-default-z5pj9 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-multus pod/multus-jkrv4 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-multus pod/network-metrics-daemon-t5kx5 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-sdn pod/ovs-h9n6p node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-machine-config-operator pod/machine-config-daemon-6fkpx node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/openshift-monitoring pod/node-exporter-b6m9s node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-statefulset-144 pod/ss-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-projected-8536 pod/pod-projected-secrets-2618d264-db5d-4371-9c4b-761e9618305a node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-csi-mock-volumes-4609-8746 pod/csi-mockplugin-resizer-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-csi-mock-volumes-4609-8746 pod/csi-mockplugin-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending
ns/e2e-csi-mock-volumes-4609-8746 pod/csi-mockplugin-attacher-0 node/ci-op-ym39gmpg-9c4c5-v4td6-worker-westus-fzjx9 - pod moved back to Pending


from https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1314781668476719104

Comment 1 Ryan Phillips 2020-10-14 16:16:51 UTC

Node team had a slack conversation with David, and he suggested this patch: https://github.com/kubernetes/kubernetes/pull/95561

We will go test this. If the patch works, then we will target a 4.6.z backport.

The issue stems from reboots and the way crio wipes the container statuses upon reboot.

Comment 3 David Eads 2020-10-27 12:58:53 UTC

The fix was in master when https://bugzilla.redhat.com/show_bug.cgi?id=1884035#c15 was verified and the same check applies to both.  There were three fixes involved and the test verified them together.

Comment 6 errata-xmlrpc 2021-02-24 15:25:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.