Description of problem: kubelet log show "Failed to delete pod", but indeed the pod is deleted successfully. And the log show error : container with ID starting with 1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79 not found: ID does not exist" Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-09-000946 How reproducible: always hello-pod.yaml: apiVersion: v1 kind: Pod metadata: labels: name: hello-pod name: hello-pod-1 spec: containers: - image: "docker.io/ocpqe/hello-pod:latest" imagePullPolicy: IfNotPresent name: hello-pod ports: - containerPort: 8080 protocol: TCP resources: {} securityContext: capabilities: {} privileged: false terminationMessagePath: /dev/termination-log dnsPolicy: ClusterFirst restartPolicy: Always serviceAccount: "" status: {} Steps to Reproduce: 1.create a pod $ oc create -f hello-pod.yaml pod/hello-pod created 2.$ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-pod-1 1/1 Running 0 9s 10.131.0.48 ip-10-0-131-96.us-east-2.compute.internal <none> <none> 3.$ oc delete pod hello-pod-1 pod "hello-pod-1" deleted 4.login node ip-10-0-131-96.us-east-2.compute.internal, and check kubelet log Actual results: 4.the log show: Failed killing the pod "hello-pod-1": failed to "KillContainer" for "hello-pod" with KillContainerError: "rpc error: code = NotFound desc = could not find container \"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79\": container with ID starting with 1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79 not found: ID does not exist" Failed to delete pod "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", err: pod not found Expected results: 4.the log show pod "hello-pod-1" is deleted successfully Additional info: sh-4.4# journalctl -u kubelet --since="20 minutes ago" | grep hello-pod-1 Apr 09 09:34:31 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:31.968785 1452 kubelet.go:1920] SyncLoop (ADD, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)" Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.095108 1452 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-5f8hz" (UniqueName: "kubernetes.io/secret/c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48-default-token-5f8hz") pod "hello-pod-1" (UID: "c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48") Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.196095 1452 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-5f8hz" (UniqueName: "kubernetes.io/secret/c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48-default-token-5f8hz") pod "hello-pod-1" (UID: "c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48") Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.199193 1452 operation_generator.go:672] MountVolume.SetUp succeeded for volume "default-token-5f8hz" (UniqueName: "kubernetes.io/secret/c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48-default-token-5f8hz") pod "hello-pod-1" (UID: "c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48") Apr 09 09:34:32 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:32.286503 1452 kuberuntime_manager.go:443] No sandbox for pod "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)" can be found. Need to start a new one Apr 09 09:34:33 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:33.968246 1452 kubelet.go:1927] SyncLoop (UPDATE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)" Apr 09 09:34:35 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:35.150708 1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerStarted", Data:"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79"} Apr 09 09:34:35 ip-10-0-131-96 hyperkube[1452]: I0409 09:34:35.150798 1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerStarted", Data:"7d0178ef972d48dedad1eeb53ecbcaebeda44316010bfaaaaef2438edad5865f"} Apr 09 09:35:03 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:03.252534 1452 kubelet.go:1936] SyncLoop (DELETE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)" Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:04.236972 1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerDied", Data:"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79"} Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:04.237130 1452 kubelet.go:1958] SyncLoop (PLEG): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", event: &pleg.PodLifecycleEvent{ID:"c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48", Type:"ContainerDied", Data:"7d0178ef972d48dedad1eeb53ecbcaebeda44316010bfaaaaef2438edad5865f"} Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:04.287641 1452 kubelet_pods.go:1270] Killing unwanted pod "hello-pod-1" Apr 09 09:35:04 ip-10-0-131-96 hyperkube[1452]: E0409 09:35:04.288568 1452 kubelet_pods.go:1273] Failed killing the pod "hello-pod-1": failed to "KillContainer" for "hello-pod" with KillContainerError: "rpc error: code = NotFound desc = could not find container \"1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79\": container with ID starting with 1f263b16389bd57dd814616e34a53fc3676bef2e1d7c4598b7ffde6cfc0d7e79 not found: ID does not exist" Apr 09 09:35:16 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:16.298474 1452 kubelet.go:1936] SyncLoop (DELETE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)" Apr 09 09:35:16 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:16.301770 1452 kubelet.go:1930] SyncLoop (REMOVE, "api"): "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)" Apr 09 09:35:16 ip-10-0-131-96 hyperkube[1452]: I0409 09:35:16.303441 1452 kubelet.go:2120] Failed to delete pod "hello-pod-1_default(c61e2a7e-aefd-4ec2-a581-f7d6ec0e6f48)", err: pod not found
Harshal, are you sure about NOTABUG? I agree that the kubelet's understanding of which pods are running locally is eventually consistent, but there are other side effects. For example, the pod discussed in bug 1971080 got removed at some point, possibly by CRI-O after the kubelet had already timed out its request). But because of this bug, we get no pod-deletion events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-controller-manager" and (tostring | contains("revision-pruner-11-ip-10-0-150-164.ec2.internal"))) | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedOb ject.name + " " + .reason + ": " + .message' | sort ... 2021-06-11T18:58:38Z revision-pruner-11-ip-10-0-150-164.ec2.internal Created: Created container pruner 2021-06-11T18:58:38Z revision-pruner-11-ip-10-0-150-164.ec2.internal Started: Started container pruner 2021-06-11T19:08:16Z revision-pruner-11-ip-10-0-150-164.ec2.internal AddedInterface: Add eth0 [10.128.0.12/23] from openshift-sdn 2021-06-11T19:08:24Z revision-pruner-11-ip-10-0-150-164.ec2.internal Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8- 2021-06-11-170229@sha256:a9f57c74b85a7cb361740a9c163b192303f94d89f36af847c27fd3a03a0df23a" 2021-06-11T19:08:52Z revision-pruner-11-ip-10-0-150-164.ec2.internal Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-06-11-170229@sha256:a9f57c74b85a7cb361740a9c163b192303f94d89f36af847c27fd3a03a0df23a" in 27.873680715s 2021-06-11T19:08:54Z revision-pruner-11-ip-10-0-150-164.ec2.internal Created: Created container pruner 2021-06-11T19:08:55Z revision-pruner-11-ip-10-0-150-164.ec2.internal Started: Started container pruner Comparing with the kubelet logs mentioned in bug 1971080: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-150-164.ec2.internal/journal | gunzip | grep '^Jun 11 \(18:5[89]\|19:0[1-4]\):.*revision-pruner-' | grep openshift-kube-controller-manager | tail -n4 Jun 11 18:59:09.145453 ip-10-0-150-164 crio[1429]: time="2021-06-11 18:59:09.145380970Z" level=info msg="Removed container 8f7d498e1eacac92c086bf46996d8098ce072f2442137d7ef8be5271cb4e921c: openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal/pruner" id=a25faefd-9673-4fda-bd6e-20dd156a8069 name=/runtime.v1alpha2.RuntimeService/RemoveContainer Jun 11 19:04:44.382551 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.382521 1465 kubelet.go:1931] SyncLoop (DELETE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411394 1465 kubelet.go:1925] SyncLoop (REMOVE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411454 1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found Having a pod-deleted event somewhere between 19:59:09 and 19:04:44 would have been nice. Maybe CRI-O can hold a cache of zombie pod exit data for five or 10m, so it can return their state to the kubelet the next time the kubelet asks about them?