Bug 1948920

Summary:	when deleting pod successfully the kubelet log show "Failed to delete pod"
Product:	OpenShift Container Platform	Reporter:	MinLi <minmli>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, ccoleman, harpatil, wking
Version:	4.9	Keywords:	Reopened
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-28 07:13:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description MinLi 2021-04-13 02:56:47 UTC

Description of problem:
kubelet log show "Failed to delete pod", but indeed the pod is deleted successfully.
And the log show error : container with ID starting with 1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79 not found: ID does not exist"

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-09-000946

How reproducible:
always

hello-pod.yaml:
apiVersion: v1
kind: Pod
metadata:
  labels:
    name: hello-pod
  name: hello-pod-1
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod:latest"
      imagePullPolicy: IfNotPresent
      name: hello-pod
      ports:
        - containerPort: 8080
          protocol: TCP
      resources: {}
      securityContext:
        capabilities: {}
        privileged: false
      terminationMessagePath: /dev/termination-log
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  serviceAccount: ""
status: {}


Steps to Reproduce:
1.create a pod
$ oc create -f hello-pod.yaml
pod/hello-pod created

2.$ oc get pod -o wide 
NAME          READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
hello-pod-1   1/1     Running   0          9s    10.131.0.48   ip-10-0-131-96.us-east-2.compute.internal   <none>           <none>

3.$ oc delete pod hello-pod-1
pod "hello-pod-1" deleted

4.login node ip-10-0-131-96.us-east-2.compute.internal, and check kubelet log


Actual results:
4.the log show: 
Failed killing the pod "hello-pod-1": failed to "KillContainer" for "hello-pod" with KillContainerError: "rpc error: code = NotFound desc = could not find container \"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79\": container with ID starting with 1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79 not found: ID does not exist"

Failed to delete pod "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", err: pod not found


Expected results:
4.the log show pod "hello-pod-1" is deleted successfully

Additional info:

sh-4.4# journalctl -u kubelet --since="20 minutes ago" | grep hello-pod-1
Apr 09 09:34:31 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:31.968785    1452 kubelet.go:1920] SyncLoop (ADD, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)"
Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.095108    1452 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-5f8hz" (UniqueName: "kubernetes.io/secret/c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48-default-token-5f8hz") pod "hello-pod-1" (UID: "c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48")
Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.196095    1452 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-5f8hz" (UniqueName: "kubernetes.io/secret/c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48-default-token-5f8hz") pod "hello-pod-1" (UID: "c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48")
Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.199193    1452 operation_generator.go:672] MountVolume.SetUp succeeded for volume "default-token-5f8hz" (UniqueName: "kubernetes.io/secret/c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48-default-token-5f8hz") pod "hello-pod-1" (UID: "c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48")
Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.286503    1452 kuberuntime_manager.go:443] No sandbox for pod "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)" can be found. Need to start a new one
Apr 09 09:34:33 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:33.968246    1452 kubelet.go:1927] SyncLoop (UPDATE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)"
Apr 09 09:34:35 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:35.150708    1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerStarted", Data:"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79"}
Apr 09 09:34:35 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:35.150798    1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerStarted", Data:"7d0178ef972d48dedad1eeb53ecbcaebeda44316010bfaaaaef2438edad5865f"}
Apr 09 09:35:03 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:03.252534    1452 kubelet.go:1936] SyncLoop (DELETE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)"
Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:04.236972    1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerDied", Data:"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79"}
Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:04.237130    1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerDied", Data:"7d0178ef972d48dedad1eeb53ecbcaebeda44316010bfaaaaef2438edad5865f"}
Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:04.287641    1452 kubelet_pods.go:1270] Killing unwanted pod "hello-pod-1"
Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: E0409 09:35:04.288568    1452 kubelet_pods.go:1273] Failed killing the pod "hello-pod-1": failed to "KillContainer" for "hello-pod" with KillContainerError: "rpc error: code = NotFound desc = could not find container \"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79\": container with ID starting with 1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79 not found: ID does not exist"
Apr 09 09:35:16 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:16.298474    1452 kubelet.go:1936] SyncLoop (DELETE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)"
Apr 09 09:35:16 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:16.301770    1452 kubelet.go:1930] SyncLoop (REMOVE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)"
Apr 09 09:35:16 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:16.303441    1452 kubelet.go:2120] Failed to delete pod "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", err: pod not found

Comment 2 W. Trevor King 2021-06-11 22:17:36 UTC

Harshal, are you sure about NOTABUG?  I agree that the kubelet's understanding of which pods are running locally is eventually consistent, but there are other side effects.  For example, the pod discussed in bug 1971080 got removed at some point, possibly by CRI-O after the kubelet had already timed out its request).  But because of this bug, we get no pod-deletion events:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-controller-manager" and (tostring | contains("revision-pruner-11-ip-10-0-150-164.ec2.internal"))) | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedOb
ject.name + " " + .reason + ": " + .message' | sort
  ...
  2021-06-11T18:58:38Z revision-pruner-11-ip-10-0-150-164.ec2.internal Created: Created container pruner
  2021-06-11T18:58:38Z revision-pruner-11-ip-10-0-150-164.ec2.internal Started: Started container pruner
  2021-06-11T19:08:16Z revision-pruner-11-ip-10-0-150-164.ec2.internal AddedInterface: Add eth0 [10.128.0.12/23] from openshift-sdn
  2021-06-11T19:08:24Z revision-pruner-11-ip-10-0-150-164.ec2.internal Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8- 
  2021-06-11-170229@sha256:a9f57c74b85a7cb361740a9c163b192303f94d89f36af847c27fd3a03a0df23a"
  2021-06-11T19:08:52Z revision-pruner-11-ip-10-0-150-164.ec2.internal Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-06-11-170229@sha256:a9f57c74b85a7cb361740a9c163b192303f94d89f36af847c27fd3a03a0df23a" in 27.873680715s
  2021-06-11T19:08:54Z revision-pruner-11-ip-10-0-150-164.ec2.internal Created: Created container pruner
  2021-06-11T19:08:55Z revision-pruner-11-ip-10-0-150-164.ec2.internal Started: Started container pruner

Comparing with the kubelet logs mentioned in bug 1971080:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-150-164.ec2.internal/journal | gunzip | grep '^Jun 11 \(18:5[89]\|19:0[1-4]\):.*revision-pruner-' | grep openshift-kube-controller-manager | tail -n4
  Jun 11 18:59:09.145453 ip-10-0-150-164 crio[1429]: time="2021-06-11 18:59:09.145380970Z" level=info msg="Removed container 8f7d498e1eacac92c086bf46996d8098ce072f2442137d7ef8be5271cb4e921c: openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal/pruner" id=a25faefd-9673-4fda-bd6e-20dd156a8069 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
  Jun 11 19:04:44.382551 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.382521    1465 kubelet.go:1931] SyncLoop (DELETE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)"
  Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411394    1465 kubelet.go:1925] SyncLoop (REMOVE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)"
  Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411454    1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found

Having a pod-deleted event somewhere between 19:59:09 and 19:04:44 would have been nice.  Maybe CRI-O can hold a cache of zombie pod exit data for five or 10m, so it can return their state to the kubelet the next time the kubelet asks about them?