Bug 1912521

Summary: Deleted pods stuck in terminating after 7 days in a long running cluster
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED UPSTREAM Docs Contact:
Severity: urgent    
Priority: high CC: aos-bugs, nagrawal, rphillips, tsweeney
Version: 4.7Keywords: Reopened, TestBlocker, UpcomingSprint
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-09 13:00:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1898612, 1898614, 1915085    
Bug Blocks:    
Attachments:
Description Flags
excerpt of kubelet log around time of deletion of namespace rails-pgsql-persistent-263 pod none

Description Mike Fiedler 2021-01-04 16:13:19 UTC
Description of problem:

Starting this with Kubelet, please redirect as appropriate

In a 4.7 reliability cluster, pods and namespaces are stuck in terminating after the cluster was running fine for ~7 days.

Cluster started 16-December and on 23-December at 04:51 UTC we hit the first instance of a pod stuck in terminating during namespace deletion.   After that it began happening for what appears to be all namespace deletions.  

All compute nodes are affected - all have pods stuck in Terminating.

First pod/node affected:   nodejs-postgresql-persistent-2-sj2cb in namespace nodejs-postgresql-persistent-36 on node ip-10-0-142-65.us-west-2.compute.internal

I have 2 must-gather.  #1 is from 28-Dec and not sure what it shows.   #2 is from 4-January and should contain the (REMOVE, "api"): "postgresql-1-deploy_rails-pgsql-persistent-263(3c55656f-e9b8-4b62-bcd0-1c6c2957c7be)" on deletion of the rails-psql-persistent-263 namespace at 07:38:55 UTC and contain the subsequent failures removing that pod


Version-Release number of selected component (if applicable):  4.7.0-0.nightly-2020-12-16-162836


How reproducible: Unknown.   Setting up a new environment on 4-January but it may be 7 days before the issue occurs


Steps to Reproduce:
1. standard 3 master/3 node environment on AWS
2. start a steady workload of project/app creation/deletion with visits to the applications.   This is the system test reliability workload, more details available upon request


Actual results: 

After 7 days project deletion started resulting in projects getting stuck in Terminating.   All projects contain pods stuck in Terminating


Expected results:

Cluster runs reliably beyond 7 days


Additional info:

I will  add location of 2 must-gathers and kubeconfig in private comment.   The cluster is available for investigation.

Comment 2 Mike Fiedler 2021-01-04 17:09:54 UTC
Created attachment 1744349 [details]
excerpt of kubelet log around time of deletion of namespace rails-pgsql-persistent-263 pod

Comment 5 Ryan Phillips 2021-01-18 19:56:42 UTC
This revert [1] went in three days ago into the Kubelet which checked for Sandbox deletions. Could you update your cluster to at least this version with the patch to see if you can reproduce it again? 

Does it happen in CI?

1. https://github.com/openshift/kubernetes/pull/523

Comment 6 Elana Hashman 2021-01-18 22:49:14 UTC
From a quick look through the logs, this is a different failure mode than #1912880 - per the attachment, there's just a bunch of attempted DELETE/REMOVEs followed by "pod not found" errors:

Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.571429    1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-8-build_rails-pgsql-persistent-263(e8dd01e7-36b4-4b92-9068-73d7a7f61e8a)", err: pod not found
Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.592284    1643 kubelet.go:1908] SyncLoop (DELETE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)"
Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.615702    1643 kubelet.go:1902] SyncLoop (REMOVE, "api"): "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)"
Jan 03 07:38:55 ip-10-0-214-247 hyperkube[1643]: I0103 07:38:55.627932    1643 kubelet.go:2097] Failed to delete pod "rails-pgsql-persistent-9-build_rails-pgsql-persistent-263(6ec3d5a1-fc1b-4a3f-9c48-30990c4beb5c)", err: pod not found
Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.775816829Z" level=info msg="Stopped container 3c3200981bcc340aba47b0ff058568fb896eec18f78ea7b374277075b7bd7cc6: rails-pgsql-persistent-263/postgresql-1-tbg5r/postgresql" id=58dea3b5-a63f-46ef-b7f6-54e540c2c5a7 n>
Jan 03 07:38:55 ip-10-0-214-247 crio[1609]: time="2021-01-03 07:38:55.789091982Z" level=info msg="Got pod network &{Name:postgresql-1-tbg5r Namespace:rails-pgsql-persistent-263 ID:82f3d0ed2434cd1b5ec63e3898c2f8e9fd973bf0075b9284dec2d793803519b2 NetNS:/var/run/netns/352e5966-1d74-4>


The attached logs only cover a period of ~1s and the must-gathers don't include kubelet logs. To debug this I will need more logs from the kubelet (host logs for any affected hosts) to get a better idea of why the pod is not deleting; I can't figure out how the pod got into this state from the context that's been provided.

Comment 7 Mike Fiedler 2021-01-21 15:25:28 UTC
After 2 days running on 4.7.0-0.nightly-2021-01-19-095812 I have not seen this yet.   Going to try the scenario in bug 1912880 to see if that happens on recent builds.   Will continue to monitor.

Comment 10 Mike Fiedler 2021-02-09 12:53:17 UTC
Tested on 4.7.0-0.nightly-2021-02-08-164120 which contains the fix for bug 1915085 and the problem still occurs.  After running the user workload overnight, several projects and pods are stuck in Terminating.

I will attach oc adm must-gather and the journal of a node where a pod is stuck Terminating

Comment 11 Mike Fiedler 2021-02-09 13:00:02 UTC
Comment 10 may have been premature - the Terminating pods eventually terminated and the namespaces terminated as well.   Moving this back to CLOSED pending further investigation.   If an additional issue is found, a new bz will be used to track it.