Bug 1952224 - Some quickly deleted pods are never cleaned up by kubelet after 20m
Summary: Some quickly deleted pods are never cleaned up by kubelet after 20m
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.9.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
Padraig O'Grady
: 1928946 1929463 1940916 1948920 1952655 1959495 1960772 1965059 2005347 2010157 2033278 (view as bug list)
Depends On:
Blocks: 1723620 1932097 1946306
TreeView+ depends on / blocked
Reported: 2021-04-21 19:09 UTC by Clayton Coleman
Modified: 2022-07-29 11:37 UTC (History)
30 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Pod lifecycle logic within the Kubelet has been improved. Consequence: Fix: The Kubelet behaves better when static and regular pods are quickly added and deleted. Result:
Clone Of:
[sig-node] Pods Extended Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]
Last Closed: 2021-10-18 17:30:04 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6331221 0 None None None 2021-09-15 07:42:35 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:30:35 UTC

Description Clayton Coleman 2021-04-21 19:09:15 UTC

The "create and then quickly delete" e2e test is triggering what looks like a failure to cleanup in the kubelet.  About 10-20% of the time this test fails to delete a pod within 1000-1300s (20m or so). This could impact availability of upgrades across the fleet and is also a source of flakes.  Setting to high due to potential impact to users.


[36mINFO[0m[2021-04-21T16:13:36Z] Apr 21 15:49:33.766 - 1300s W ns/e2e-pods-2306 pod/pod-submit-status-1-5 node/ip-10-0-148-7.ec2.internal pod has been pending longer than a minute

Similar in symptom to previous errors we have fixed around cleanup.

Comment 1 Ryan Phillips 2021-04-24 01:08:52 UTC
Openshift PR: https://github.com/openshift/kubernetes/pull/703
Upstream PR: https://github.com/kubernetes/kubernetes/pull/101376

Testing on Openshift CI looks extremely promising.

Comment 2 Ryan Phillips 2021-04-24 01:12:34 UTC
*** Bug 1952655 has been marked as a duplicate of this bug. ***

Comment 4 Ryan Phillips 2021-04-30 15:38:09 UTC
*** Bug 1940916 has been marked as a duplicate of this bug. ***

Comment 7 Lucas López Montero 2021-05-04 06:48:21 UTC
Is this the same problem as https://github.com/kubernetes/kubernetes/issues/97497? Otherwise, I would need to create a new bug for that one.

Comment 8 Elana Hashman 2021-05-04 22:58:29 UTC
Lucas, I believe these are the same issue.

Comment 9 Elana Hashman 2021-05-04 22:59:55 UTC
*** Bug 1929463 has been marked as a duplicate of this bug. ***

Comment 10 Lucas López Montero 2021-05-05 06:11:45 UTC
Thank you very much, Elana.

Comment 12 Sunil Choudhary 2021-05-07 07:46:40 UTC
Hi Ryan, I still see failures in CI for 4.8, could you take a look?


periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp (all) - 17 runs, 76% failed, 23% of failures match = 18% impact
#1390544939795877888	build-log.txt.gz	37 minutes ago	

[36mINFO[0m[2021-05-07T06:52:42Z] May 07 06:33:45.393 - 1018s W ns/e2e-pods-7774 pod/pod-submit-status-1-14 node/ci-op-7g0bhm8t-4bd89-7q7ps-worker-c-lb2sw pod has been pending longer than a minute

#1390489133008293888	build-log.txt.gz	4 hours ago	

[36mINFO[0m[2021-05-07T03:20:04Z] May 07 03:07:01.931 - 408s  W ns/e2e-pods-6056 pod/pod-submit-status-2-1 node/ci-op-gfg20tr9-4bd89-czp5j-worker-d-2t9qn pod has been pending longer than a minute
[36mINFO[0m[2021-05-07T03:20:04Z] May 07 03:07:08.931 - 401s  W ns/e2e-pods-6056 pod/pod-submit-status-0-3 node/ci-op-gfg20tr9-4bd89-czp5j-worker-b-k977w pod has been pending longer than a minute

#1389875576445079552	build-log.txt.gz	44 hours ago	

[36mINFO[0m[2021-05-05T11:01:10Z] May 05 10:29:49.926 - 1503s W ns/e2e-pods-700 pod/pod-submit-status-2-6 node/ci-op-1lw327fx-4bd89-fc9ft-worker-b-bh55f pod has been pending longer than a minute

Comment 17 Ryan Phillips 2021-06-01 18:21:04 UTC
*** Bug 1959495 has been marked as a duplicate of this bug. ***

Comment 18 Elana Hashman 2021-06-02 22:35:48 UTC
*** Bug 1960772 has been marked as a duplicate of this bug. ***

Comment 20 Elana Hashman 2021-06-09 17:11:37 UTC
Clayton is working on a refactor of the pod lifecycle in https://github.com/kubernetes/kubernetes/pull/102344

A public overview doc is available here: https://docs.google.com/document/d/1DvAmqp9CV8i4_zYvNdDWZW-V0FMk1NbZ8BAOvLRD3Ds/edit#

Comment 21 W. Trevor King 2021-06-12 20:39:39 UTC
Cropped up again in 4.8->4.8 update CI [1]:

  : operator conditions kube-apiserver	0s
    Operator degraded (InstallerPodContainerWaiting_ContainerCreating): InstallerPodContainerWaitingDegraded: Pod "installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1" on node "ci-op-cc8dg34d-17f95-gnzhj-master-1" container "installer" is waiting since 2021-06-12 02:52:06 +0000 UTC because ContainerCreating

  : [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]	1s
    fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 12 03:03:20.612: Some cluster operators are not ready: kube-apiserver (Degraded=True InstallerPodContainerWaiting_ContainerCreating: InstallerPodContainerWaitingDegraded: Pod "installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1" on node "ci-op-cc8dg34d-17f95-gnzhj-master-1" container "installer" is waiting since 2021-06-12 02:52:06 +0000 UTC because ContainerCreating)

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	2h33m49s
    Jun 12 05:37:29.528: Unexpected alerts fired or pending during the upgrade:

    alert ClusterOperatorDegraded fired for 6180 seconds with labels: {condition="Degraded", endpoint="metrics", instance="", job="cluster-version-operator", name="kube-apiserver", namespace="openshift-cluster-version", pod="cluster-version-operator-66d7977489-rt7v4", reason="InstallerPodContainerWaiting_ContainerCreating", service="cluster-version-operator", severity="warning"}
    alert KubeContainerWaiting fired for 6313 seconds with labels: {container="installer", namespace="openshift-kube-apiserver", pod="installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1", severity="warning"}
alert KubePodNotReady fired for 9013 seconds with labels: {namespace="openshift-kube-apiserver", pod="installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1", severity="warning"}


  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1403539606200127488/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-cc8dg34d-17f95-gnzhj-master-1/journal | gunzip | grep installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1 | tail -n2
  Jun 12 05:41:30.689984 ci-op-cc8dg34d-17f95-gnzhj-master-1 hyperkube[1583]: E0612 05:41:30.689927    1583 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/openshift-kube-apiserver_installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1_e88fc0d3-90bd-4927-a65d-e3b6aceb3264: no such file or directory" pod="openshift-kube-apiserver/installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1"
  Jun 12 05:41:38.197389 ci-op-cc8dg34d-17f95-gnzhj-master-1 hyperkube[1583]: E0612 05:41:38.197338    1583 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/openshift-kube-apiserver_installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1_e88fc0d3-90bd-4927-a65d-e3b6aceb3264: no such file or directory" pod="openshift-kube-apiserver/installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1"
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1403539606200127488/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-cc8dg34d-17f95-gnzhj-master-1/journal | gunzip | grep -c 'Unable to fetch pod log stats.*installer-8-ci-op-cc8dg34d-17f95-gnzhj-master-1'          

While we're working on a fix to avoid wedging on this, are there mitigation recommendations?  Is it just "delete the stuck pod, let the controller who created that pod create a replacement, and hope the replacement doesn't get stuck"?  Do we expect a need to force the deletion?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1403539606200127488

Comment 26 Harshal Patil 2021-06-28 07:13:33 UTC
*** Bug 1948920 has been marked as a duplicate of this bug. ***

Comment 31 Alexander Chuzhoy 2021-07-19 21:51:06 UTC
Reported https://bugzilla.redhat.com/show_bug.cgi?id=1920700 which was closed as duplicate of this bug.

Deployed ~1000 snos and counted the appearance of the issue mentioned in the bug - it's approximately 17% reproduction rate.

Comment 35 Ryan Phillips 2021-08-06 13:11:55 UTC
*** Bug 1965059 has been marked as a duplicate of this bug. ***

Comment 37 Ryan Phillips 2021-08-12 19:35:21 UTC
rebased into 4.9 : https://github.com/kubernetes/kubernetes/pull/102344

QE: Please use a release after 4.9.0-0.nightly-2021-08-12-185332 to test

Comment 38 Ryan Phillips 2021-08-12 19:35:44 UTC
*** Bug 1928946 has been marked as a duplicate of this bug. ***

Comment 48 Michael Burke 2021-09-30 21:42:05 UTC
Clayton --

Can you take a look at my proposed release note for this BZ? I saw your doc text and made a few changes to match our style. I want to make sure i didn't change the meaning. Thank you in advance.


* Previously, if pods were deleted quickly after creation, the kubelet might not clean up the pods properly. This resulted in pods being stuck in terminating state could impact availability of upgrades. This fix improves the pod lifecycle logic to avoid this problem.

Comment 50 Jan Chaloupka 2021-10-14 10:14:06 UTC
*** Bug 2005347 has been marked as a duplicate of this bug. ***

Comment 51 Gabe Montero 2021-10-14 18:30:45 UTC
*** Bug 2010157 has been marked as a duplicate of this bug. ***

Comment 52 Masaki Hatada 2021-10-15 01:15:04 UTC
Dear Red Hat,

It seems that the target version of this fix is 4.9, but isn't there a plan to backport this fix to previous versions?
Our customer(4.6) have faced this issue multiple times.

Quickly creation/deletion of pod is caused by automatic rolling update of openshift-controller-manager-operator as we reported at Bug 2010157.
So any user can face this issue.

Best Regards,
Masaki Hatada

Comment 54 errata-xmlrpc 2021-10-18 17:30:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 56 W. Trevor King 2021-10-19 22:57:22 UTC
This nominally shipped with 4.9.0, but it links no PRs.  My reading of the comments, especially comment 37, is that [1] is the relevant upstream PR.  That made it into OpenShift 4.9 and later via [2] and bug 1986003.  Associated 4.8.z backport is bug 1994457, which shipped with 4.8.13, but that does not include a "Keep pod worker running until pod is truly complete" commit.  And I don't see any upstream attempts to pick [1] back into 1.21.z.

[1]: https://github.com/kubernetes/kubernetes/pull/102344
[2]: https://github.com/openshift/kubernetes/pull/862/commits/3eadd1a9ead7a009a9abfbd603a5efd0560473cc

Comment 57 Elana Hashman 2021-11-01 20:35:32 UTC
Michael, your release note looks good to me.

This fix cannot be backported to earlier versions of OpenShift, as it is a large refactor rather than a small, targeted fix. It is not eligible for backport in upstream Kubernetes as it does not meet upstream cherry-pick criteria---while it does fix an urgent bug, the fix is too large and there were too many required follow-up patches for it to be an appropriate backport to any earlier versions.

This merged into OpenShift as part of the rebase PR https://github.com/openshift/kubernetes/pull/862 as Trevor correctly identified.

Comment 58 Harshal Patil 2022-01-07 04:11:06 UTC
*** Bug 2033278 has been marked as a duplicate of this bug. ***

Comment 59 Emmanuel Kasper 2022-03-21 09:28:24 UTC
On OCP 4.7 / 4.8 how do I know if a problematic pod is hit by this bug ? (unfortunately the linked KB article does not help a lot)

I have a pod which has the status `Terminating` for a very long time (150 minutes)

$  oc get po tekton-pipelines-webhook-5ff9f7c9f7-jvj8c -o wide
NAME                                        READY   STATUS        RESTARTS   AGE    IP       NODE                                    NOMINATED NODE   READINESS GATES
tekton-pipelines-webhook-5ff9f7c9f7-jvj8c   0/1     Terminating   0          157m   <none>   example.com   <none>           <none

if I look at the pod details, I see the following  containerStatus:
        reason: ContainerCreating

looking at the cri-o log I see the container was scheduled for deletion just after being created:

Mar 04 13:34:03.826802  crio[1885]: time="2022-03-04 13:34:03.814217038Z" level=info msg="Running pod sandbox: openshift-pipelines/tekton-pipelines-webhook-5ff9f7c9f7-jvj8c/POD" id=9f0f1794-7d59-4f04-9209-ec066cd31fa2 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox 

and 6 seconds later:
Mar 04 13:34:09.533891 crio[1885]: time="2022-03-04 13:34:09.533835428Z" level=info msg="Deleting pod openshift-pipelines_tekton-pipelines-webhook-5ff9f7c9f7-jvj8c from CNI network \"multus-cni-network\" (type=multus)"
the tekton-pipelines-webhook pods have been created by the HPA on behalf of the pipeline operator

am I facing the bug which was closed by the errata or I am hitting something else ?

Comment 60 Thomas Jungblut 2022-04-07 11:24:22 UTC
Happened again on a GCP upgrade in 4.8 https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.8.0-0.ci/release/4.8.0-0.ci-2022-04-07-030644

[machine-config-operator-7688f6d6f7-dz4rs] 2022-04-07T06:06:55.367414879Z E0407 06:06:55.367274       1 sync.go:605] Error syncing Required MachineConfigPools: "error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)"
[machine-config-daemon-fwdqk machine-config-daemon] 2022-04-07T06:24:54.687800813Z I0407 06:24:54.687733  165813 drain.go:92] Draining failed with: error when waiting for pod "csi-snapshot-controller-64d9d77f8d-8p47c" terminating: global timeout reached: 1m30s, retrying

st_service_logs/masters/crio_service.log:Apr 07 04:29:22.573456 ci-op-t0cyizv6-3b3f8-vhql7-master-2 crio[1536]: time="2022-04-07 04:29:22.573381172Z" level=info msg="Got pod network &{Name:csi-snapshot-controller-64d9d77f8d-8p47c Namespace:openshift-cluster-storage-operator ID:6483bfcf39000349c6d00ec5ec3b750097fbf1449c7c00825885f9b33e177047 UID:fb45ce03-539a-42a0-8661-fedfa62091a2 NetNS:/var/run/netns/c8a48f34-a4a1-4de0-9afa-56b32c5e39a8 Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
host_service_logs/masters/crio_service.log:Apr 07 04:29:22.573749 ci-op-t0cyizv6-3b3f8-vhql7-master-2 crio[1536]: time="2022-04-07 04:29:22.573705732Z" level=info msg="Deleting pod openshift-cluster-storage-operator_csi-snapshot-controller-64d9d77f8d-8p47c from CNI network \"multus-cni-network\" (type=multus)"
host_service_logs/masters/crio_service.log:Apr 07 04:29:22.842850 ci-op-t0cyizv6-3b3f8-vhql7-master-2 crio[1536]: 2022-04-07T04:29:22Z [verbose] Del: openshift-cluster-storage-operator:csi-snapshot-controller-64d9d77f8d-8p47c:fb45ce03-539a-42a0-8661-fedfa62091a2:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:18.451037 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: I0407 04:29:18.450980    1568 kubelet.go:1919] SyncLoop (ADD, "api"): "csi-snapshot-controller-64d9d77f8d-8p47c_openshift-cluster-storage-operator(fb45ce03-539a-42a0-8661-fedfa62091a2)"
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:18.635177 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: I0407 04:29:18.635131    1568 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-64d9d77f8d-8p47c_openshift-cluster-storage-operator(fb45ce03-539a-42a0-8661-fedfa62091a2)"
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:18.674528 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: I0407 04:29:18.674493    1568 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-9b9vg" (UniqueName: "kubernetes.io/projected/fb45ce03-539a-42a0-8661-fedfa62091a2-kube-api-access-9b9vg") pod "csi-snapshot-controller-64d9d77f8d-8p47c" (UID: "fb45ce03-539a-42a0-8661-fedfa62091a2")
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:18.775573 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: I0407 04:29:18.775528    1568 reconciler.go:269] operationExecutor.MountVolume started for volume "kube-api-access-9b9vg" (UniqueName: "kubernetes.io/projected/fb45ce03-539a-42a0-8661-fedfa62091a2-kube-api-access-9b9vg") pod "csi-snapshot-controller-64d9d77f8d-8p47c" (UID: "fb45ce03-539a-42a0-8661-fedfa62091a2")
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:18.881857 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: I0407 04:29:18.881773    1568 operation_generator.go:672] MountVolume.SetUp succeeded for volume "kube-api-access-9b9vg" (UniqueName: "kubernetes.io/projected/fb45ce03-539a-42a0-8661-fedfa62091a2-kube-api-access-9b9vg") pod "csi-snapshot-controller-64d9d77f8d-8p47c" (UID: "fb45ce03-539a-42a0-8661-fedfa62091a2")
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:18.916674 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: I0407 04:29:18.916150    1568 kuberuntime_manager.go:443] No sandbox for pod "csi-snapshot-controller-64d9d77f8d-8p47c_openshift-cluster-storage-operator(fb45ce03-539a-42a0-8661-fedfa62091a2)" can be found. Need to start a new one
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:20.854591 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: I0407 04:29:20.854528    1568 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-64d9d77f8d-8p47c_openshift-cluster-storage-operator(fb45ce03-539a-42a0-8661-fedfa62091a2)"
host_service_logs/masters/kubelet_service.log:Apr 07 04:29:22.924957 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: E0407 04:29:22.924778    1568 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "csi-snapshot-controller-64d9d77f8d-8p47c_openshift-cluster-storage-operator(fb45ce03-539a-42a0-8661-fedfa62091a2)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
host_service_logs/masters/kubelet_service.log:Apr 07 04:30:21.197225 ci-op-t0cyizv6-3b3f8-vhql7-master-2 hyperkube[1568]: E0407 04:30:21.197159    1568 fsHandler.go:114] failed to collect filesystem stats - rootDiskErr: could not stat "/var/lib/containers/storage/overlay/82bc8bd3a226caa35815f1890560e1873122b5f31e7a05f8fded2444c5bf153c/diff" to get inode usage: stat /var/lib/containers/storage/overlay/82bc8bd3a226caa35815f1890560e1873122b5f31e7a05f8fded2444c5bf153c/diff: no such file or directory, extraDiskErr: could not stat "/var/log/pods/openshift-cluster-storage-operator_csi-snapshot-controller-64d9d77f8d-8p47c_fb45ce03-539a-42a0-8661-fedfa62091a2/6483bfcf39000349c6d00ec5ec3b750097fbf1449c7c00825885f9b33e177047.log" to get inode usage: stat /var/log/pods/openshift-cluster-storage-operator_csi-snapshot-controller-64d9d77f8d-8p47c_fb45ce03-539a-42a0-8661-fedfa62091a2/6483bfcf39000349c6d00ec5ec3b750097fbf1449c7c00825885f9b33e177047.log: no such file or directory

Note You need to log in before you can comment on or make changes to this bug.