1912880 – Pod stuck in Terminating after scale up and scale down

Bug 1912880 - Pod stuck in Terminating after scale up and scale down

Summary: Pod stuck in Terminating after scale up and scale down

Keywords:
Status:	CLOSED DUPLICATE of bug 1915085
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Elana Hashman
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Depends On:	1898612 1898614 1915085
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-05 13:57 UTC by Mike Fiedler
Modified:	2021-02-01 19:05 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-27 16:35:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mike Fiedler 2021-01-05 13:57:54 UTC

Description of problem:

This may be the root cause of bug 1912521, both have pods stuck in Terminating.

In a system test reliability cluster that periodically scales application deployments up/down from 1 replica->2->back to 1, some pods are getting stuck in Terminating on scale down and never clean up.

I have a must-gather and node journal showing this for two pods in our current cluster:

pod django-psql-persistent-1-bpngc in namespace django-psql-persistent-2
pod rails-pgsql-persistent-2-xkt9p in namespace rails-pgsql-persistent-19

Both pods were running on node ip-10-0-162-121.us-west-2.compute.internal

pod django-psql-persistent-1-bpngc is the more recent to get stuck.  It's scale down where it wedged was at 2021-01-05 05:55:21 UTC.   The first errors I see for it after that are:

Jan 05 05:55:22 ip-10-0-162-121 crio[1609]: time="2021-01-05 05:55:22.863160622Z" level=info msg="RunSandbox: releasing container name: k8s_POD_django-psql-persistent-1-bpngc_django-psql-persistent-2_03626057-68f2-4cac-9080-8c02dbc42666_0" id=4907b207-70b0-4f48-959c-e3603824c707 n>
Jan 05 05:55:22 ip-10-0-162-121 crio[1609]: time="2021-01-05 05:55:22.863223642Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_django-psql-persistent-1-bpngc_django-psql-persistent-2_03626057-68f2-4cac-9080-8c02dbc42666_0" id=4907b207-70b0-4f48-959c-e3603824c707 nam>
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864591    1643 remote_runtime.go:116] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864652    1643 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)" failed: rpc error: code = Unknown desc = error reading>
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864671    1643 kuberuntime_manager.go:755] createPodSandbox for pod "django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)" failed: rpc error: code = Unknown desc = error readin>
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864724    1643 pod_workers.go:191] Error syncing pod 03626057-68f2-4cac-9080-8c02dbc42666 ("django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)"), skipping: failed to "CreateP>



Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-04-153716


How reproducible: If it is the same as bug 1912521 then it is reliably reproducible.  I'll start a new cluster for this particular bug and try it again with just scale up/down testing.


Steps to Reproduce:
1.  OOTB AWS 3 master/3 worker cluster
2.  Run a workload that periodically creates/deletes projects, deployments and scales the deployments up/down.  


Actual results:

After about 12 hours, started seeing pods get stuck in terminating on scale down.  They never clean up



Additional info:

I will add full must-gather location and the journal of a node where the two pods described above were running before becoming wedged.

Comment 2 Mike Fiedler 2021-01-05 14:03:23 UTC

Marking this as a blocker for long running reliability testing for 4.7.0

Comment 5 Neelesh Agrawal 2021-01-22 18:22:04 UTC


*** This bug has been marked as a duplicate of bug 1915085 ***

Note You need to log in before you can comment on or make changes to this bug.