Bug 1948920
Summary: | when deleting pod successfully the kubelet log show "Failed to delete pod" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | MinLi <minmli> |
Component: | Node | Assignee: | Harshal Patil <harpatil> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, ccoleman, harpatil, wking |
Version: | 4.9 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-06-28 07:13:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
MinLi
2021-04-13 02:56:47 UTC
Harshal, are you sure about NOTABUG? I agree that the kubelet's understanding of which pods are running locally is eventually consistent, but there are other side effects. For example, the pod discussed in bug 1971080 got removed at some point, possibly by CRI-O after the kubelet had already timed out its request). But because of this bug, we get no pod-deletion events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-controller-manager" and (tostring | contains("revision-pruner-11-ip-10-0-150-164.ec2.internal"))) | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedOb ject.name + " " + .reason + ": " + .message' | sort ... 2021-06-11T18:58:38Z revision-pruner-11-ip-10-0-150-164.ec2.internal Created: Created container pruner 2021-06-11T18:58:38Z revision-pruner-11-ip-10-0-150-164.ec2.internal Started: Started container pruner 2021-06-11T19:08:16Z revision-pruner-11-ip-10-0-150-164.ec2.internal AddedInterface: Add eth0 [10.128.0.12/23] from openshift-sdn 2021-06-11T19:08:24Z revision-pruner-11-ip-10-0-150-164.ec2.internal Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8- 2021-06-11-170229@sha256:a9f57c74b85a7cb361740a9c163b192303f94d89f36af847c27fd3a03a0df23a" 2021-06-11T19:08:52Z revision-pruner-11-ip-10-0-150-164.ec2.internal Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-06-11-170229@sha256:a9f57c74b85a7cb361740a9c163b192303f94d89f36af847c27fd3a03a0df23a" in 27.873680715s 2021-06-11T19:08:54Z revision-pruner-11-ip-10-0-150-164.ec2.internal Created: Created container pruner 2021-06-11T19:08:55Z revision-pruner-11-ip-10-0-150-164.ec2.internal Started: Started container pruner Comparing with the kubelet logs mentioned in bug 1971080: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-150-164.ec2.internal/journal | gunzip | grep '^Jun 11 \(18:5[89]\|19:0[1-4]\):.*revision-pruner-' | grep openshift-kube-controller-manager | tail -n4 Jun 11 18:59:09.145453 ip-10-0-150-164 crio[1429]: time="2021-06-11 18:59:09.145380970Z" level=info msg="Removed container 8f7d498e1eacac92c086bf46996d8098ce072f2442137d7ef8be5271cb4e921c: openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal/pruner" id=a25faefd-9673-4fda-bd6e-20dd156a8069 name=/runtime.v1alpha2.RuntimeService/RemoveContainer Jun 11 19:04:44.382551 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.382521 1465 kubelet.go:1931] SyncLoop (DELETE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411394 1465 kubelet.go:1925] SyncLoop (REMOVE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411454 1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found Having a pod-deleted event somewhere between 19:59:09 and 19:04:44 would have been nice. Maybe CRI-O can hold a cache of zombie pod exit data for five or 10m, so it can return their state to the kubelet the next time the kubelet asks about them? |