Description of problem: Starting this with Kubelet, please redirect as appropriate In a 4.7 reliability cluster, pods and namespaces are stuck in terminating after the cluster was running fine for ~7 days. Cluster started 16-December and on 23-December at 04:51 UTC we hit the first instance of a pod stuck in terminating during namespace deletion. After that it began happening for what appears to be all namespace deletions. All compute nodes are affected - all have pods stuck in Terminating. First pod/node affected: nodejs-postgresql-persistent-2-sj2cb in namespace nodejs-postgresql-persistent-36 on node ip-10-0-142-65.us-west-2.compute.internal I have 2 must-gather. #1 is from 28-Dec and not sure what it shows. #2 is from 4-January and should contain the (REMOVE, "api"): "postgresql-1-deploy_rails-pgsql-persistent-263(3c55656f-e9b8-4b62-bcd0-1c6c2957c7be)" on deletion of the rails-psql-persistent-263 namespace at 07:38:55 UTC and contain the subsequent failures removing that pod Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-16-162836 How reproducible: Unknown. Setting up a new environment on 4-January but it may be 7 days before the issue occurs Steps to Reproduce: 1. standard 3 master/3 node environment on AWS 2. start a steady workload of project/app creation/deletion with visits to the applications. This is the system test reliability workload, more details available upon request Actual results: After 7 days project deletion started resulting in projects getting stuck in Terminating. All projects contain pods stuck in Terminating Expected results: Cluster runs reliably beyond 7 days Additional info: I will add location of 2 must-gathers and kubeconfig in private comment. The cluster is available for investigation.
Created attachment 1744349 [details] excerpt of kubelet log around time of deletion of namespace rails-pgsql-persistent-263 pod
This revert [1] went in three days ago into the Kubelet which checked for Sandbox deletions. Could you update your cluster to at least this version with the patch to see if you can reproduce it again? Does it happen in CI? 1. https://github.com/openshift/kubernetes/pull/523
From a quick look through the logs, this is a different failure mode than #1912880 - per the attachment, there's just a bunch of attempted DELETE/REMOVEs followed by "pod not found" errors: Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.571429 1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-8-build_rails-pgsql-persistent-263(e8dd01e7-36b4-4b92-9068-73d7a7f61e8a)", err: pod not found Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.592284 1643 kubelet.go:1908] SyncLoop (DELETE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)" Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.615702 1643 kubelet.go:1902] SyncLoop (REMOVE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)" Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.627932 1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)", err: pod not found Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.775816829Z" level=info msg="Stopped container 3c3200981bcc340aba47b0ff058568fb896eec18f78ea7b374277075b7bd7cc6: rails-pgsql-persistent-263/postgresql-1-tbg5r/postgresql" id=58dea3b5-a63f-46ef-b7f6-54e540c2c5a7 n> Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.789091982Z" level=info msg="Got pod network &{Name:postgresql-1-tbg5r Namespace:rails-pgsql-persistent-263 ID:82f3d0ed2434cd1b5ec63e3898c2f8e9fd973bf0075b9284dec2d793803519b2 NetNS:/var/run/netns/352e5966-1d74-4> The attached logs only cover a period of ~1s and the must-gathers don't include kubelet logs. To debug this I will need more logs from the kubelet (host logs for any affected hosts) to get a better idea of why the pod is not deleting; I can't figure out how the pod got into this state from the context that's been provided.
After 2 days running on 4.7.0-0.nightly-2021-01-19-095812 I have not seen this yet. Going to try the scenario in bug 1912880 to see if that happens on recent builds. Will continue to monitor.
Tested on 4.7.0-0.nightly-2021-02-08-164120 which contains the fix for bug 1915085 and the problem still occurs. After running the user workload overnight, several projects and pods are stuck in Terminating. I will attach oc adm must-gather and the journal of a node where a pod is stuck Terminating
Comment 10 may have been premature - the Terminating pods eventually terminated and the namespaces terminated as well. Moving this back to CLOSED pending further investigation. If an additional issue is found, a new bz will be used to track it.