Bug 1912521 - Deleted pods stuck in terminating after 7 days in a long running cluster
Summary: Deleted pods stuck in terminating after 7 days in a long running cluster
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.7.0
Assignee: Elana Hashman
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On: 1898612 1898614 1915085
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-04 16:13 UTC by Mike Fiedler
Modified: 2021-02-09 13:00 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-09 13:00:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
excerpt of kubelet log around time of deletion of namespace rails-pgsql-persistent-263 pod (14.80 KB, text/plain)
2021-01-04 17:09 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2021-01-04 16:13:19 UTC
Description of problem:

Starting this with Kubelet, please redirect as appropriate

In a 4.7 reliability cluster, pods and namespaces are stuck in terminating after the cluster was running fine for ~7 days.

Cluster started 16-December and on 23-December at 04:51 UTC we hit the first instance of a pod stuck in terminating during namespace deletion.   After that it began happening for what appears to be all namespace deletions.  

All compute nodes are affected - all have pods stuck in Terminating.

First pod/node affected:   nodejs-postgresql-persistent-2-sj2cb in namespace nodejs-postgresql-persistent-36 on node ip-10-0-142-65.us-west-2.compute.internal

I have 2 must-gather.  #1 is from 28-Dec and not sure what it shows.   #2 is from 4-January and should contain the (REMOVE, "api"): "postgresql-1-deploy_rails-pgsql-persistent-263(3c55656f-e9b8-4b62-bcd0-1c6c2957c7be)" on deletion of the rails-psql-persistent-263 namespace at 07:38:55 UTC and contain the subsequent failures removing that pod


Version-Release number of selected component (if applicable):  4.7.0-0.nightly-2020-12-16-162836


How reproducible: Unknown.   Setting up a new environment on 4-January but it may be 7 days before the issue occurs


Steps to Reproduce:
1. standard 3 master/3 node environment on AWS
2. start a steady workload of project/app creation/deletion with visits to the applications.   This is the system test reliability workload, more details available upon request


Actual results: 

After 7 days project deletion started resulting in projects getting stuck in Terminating.   All projects contain pods stuck in Terminating


Expected results:

Cluster runs reliably beyond 7 days


Additional info:

I will  add location of 2 must-gathers and kubeconfig in private comment.   The cluster is available for investigation.

Comment 2 Mike Fiedler 2021-01-04 17:09:54 UTC
Created attachment 1744349 [details]
excerpt of kubelet log around time of deletion of namespace rails-pgsql-persistent-263 pod

Comment 5 Ryan Phillips 2021-01-18 19:56:42 UTC
This revert [1] went in three days ago into the Kubelet which checked for Sandbox deletions. Could you update your cluster to at least this version with the patch to see if you can reproduce it again? 

Does it happen in CI?

1. https://github.com/openshift/kubernetes/pull/523

Comment 6 Elana Hashman 2021-01-18 22:49:14 UTC
From a quick look through the logs, this is a different failure mode than #1912880 - per the attachment, there's just a bunch of attempted DELETE/REMOVEs followed by "pod not found" errors:

Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.571429    1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-8-build_rails-pgsql-persistent-263(e8dd01e7-36b4-4b92-9068-73d7a7f61e8a)", err: pod not found
Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.592284    1643 kubelet.go:1908] SyncLoop (DELETE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)"
Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.615702    1643 kubelet.go:1902] SyncLoop (REMOVE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)"
Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.627932    1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)", err: pod not found
Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.775816829Z" level=info msg="Stopped container 3c3200981bcc340aba47b0ff058568fb896eec18f78ea7b374277075b7bd7cc6: rails-pgsql-persistent-263/postgresql-1-tbg5r/postgresql" id=58dea3b5-a63f-46ef-b7f6-54e540c2c5a7 n>
Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.789091982Z" level=info msg="Got pod network &{Name:postgresql-1-tbg5r Namespace:rails-pgsql-persistent-263 ID:82f3d0ed2434cd1b5ec63e3898c2f8e9fd973bf0075b9284dec2d793803519b2 NetNS:/var/run/netns/352e5966-1d74-4>


The attached logs only cover a period of ~1s and the must-gathers don't include kubelet logs. To debug this I will need more logs from the kubelet (host logs for any affected hosts) to get a better idea of why the pod is not deleting; I can't figure out how the pod got into this state from the context that's been provided.

Comment 7 Mike Fiedler 2021-01-21 15:25:28 UTC
After 2 days running on 4.7.0-0.nightly-2021-01-19-095812 I have not seen this yet.   Going to try the scenario in bug 1912880 to see if that happens on recent builds.   Will continue to monitor.

Comment 10 Mike Fiedler 2021-02-09 12:53:17 UTC
Tested on 4.7.0-0.nightly-2021-02-08-164120 which contains the fix for bug 1915085 and the problem still occurs.  After running the user workload overnight, several projects and pods are stuck in Terminating.

I will attach oc adm must-gather and the journal of a node where a pod is stuck Terminating

Comment 11 Mike Fiedler 2021-02-09 13:00:02 UTC
Comment 10 may have been premature - the Terminating pods eventually terminated and the namespaces terminated as well.   Moving this back to CLOSED pending further investigation.   If an additional issue is found, a new bz will be used to track it.


Note You need to log in before you can comment on or make changes to this bug.