Bug 1912521
Summary: | Deleted pods stuck in terminating after 7 days in a long running cluster | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
Component: | Node | Assignee: | Elana Hashman <ehashman> | ||||
Node sub component: | Kubelet | QA Contact: | Weinan Liu <weinliu> | ||||
Status: | CLOSED UPSTREAM | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | high | CC: | aos-bugs, nagrawal, rphillips, tsweeney | ||||
Version: | 4.7 | Keywords: | Reopened, TestBlocker, UpcomingSprint | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-02-09 13:00:02 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1898612, 1898614, 1915085 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Mike Fiedler
2021-01-04 16:13:19 UTC
Created attachment 1744349 [details]
excerpt of kubelet log around time of deletion of namespace rails-pgsql-persistent-263 pod
This revert [1] went in three days ago into the Kubelet which checked for Sandbox deletions. Could you update your cluster to at least this version with the patch to see if you can reproduce it again? Does it happen in CI? 1. https://github.com/openshift/kubernetes/pull/523 From a quick look through the logs, this is a different failure mode than #1912880 - per the attachment, there's just a bunch of attempted DELETE/REMOVEs followed by "pod not found" errors: Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.571429 1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-8-build_rails-pgsql-persistent-263(e8dd01e7-36b4-4b92-9068-73d7a7f61e8a)", err: pod not found Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.592284 1643 kubelet.go:1908] SyncLoop (DELETE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)" Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.615702 1643 kubelet.go:1902] SyncLoop (REMOVE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)" Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.627932 1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)", err: pod not found Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.775816829Z" level=info msg="Stopped container 3c3200981bcc340aba47b0ff058568fb896eec18f78ea7b374277075b7bd7cc6: rails-pgsql-persistent-263/postgresql-1-tbg5r/postgresql" id=58dea3b5-a63f-46ef-b7f6-54e540c2c5a7 n> Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.789091982Z" level=info msg="Got pod network &{Name:postgresql-1-tbg5r Namespace:rails-pgsql-persistent-263 ID:82f3d0ed2434cd1b5ec63e3898c2f8e9fd973bf0075b9284dec2d793803519b2 NetNS:/var/run/netns/352e5966-1d74-4> The attached logs only cover a period of ~1s and the must-gathers don't include kubelet logs. To debug this I will need more logs from the kubelet (host logs for any affected hosts) to get a better idea of why the pod is not deleting; I can't figure out how the pod got into this state from the context that's been provided. After 2 days running on 4.7.0-0.nightly-2021-01-19-095812 I have not seen this yet. Going to try the scenario in bug 1912880 to see if that happens on recent builds. Will continue to monitor. Tested on 4.7.0-0.nightly-2021-02-08-164120 which contains the fix for bug 1915085 and the problem still occurs. After running the user workload overnight, several projects and pods are stuck in Terminating. I will attach oc adm must-gather and the journal of a node where a pod is stuck Terminating Comment 10 may have been premature - the Terminating pods eventually terminated and the namespaces terminated as well. Moving this back to CLOSED pending further investigation. If an additional issue is found, a new bz will be used to track it. |