Bug 1912880 - Pod stuck in Terminating after scale up and scale down
Summary: Pod stuck in Terminating after scale up and scale down
Keywords:
Status: CLOSED DUPLICATE of bug 1915085
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Elana Hashman
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On: 1898612 1898614 1915085
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-05 13:57 UTC by Mike Fiedler
Modified: 2021-02-01 19:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-27 16:35:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mike Fiedler 2021-01-05 13:57:54 UTC
Description of problem:

This may be the root cause of bug 1912521, both have pods stuck in Terminating.

In a system test reliability cluster that periodically scales application deployments up/down from 1 replica->2->back to 1, some pods are getting stuck in Terminating on scale down and never clean up.

I have a must-gather and node journal showing this for two pods in our current cluster:

pod django-psql-persistent-1-bpngc in namespace django-psql-persistent-2
pod rails-pgsql-persistent-2-xkt9p in namespace rails-pgsql-persistent-19

Both pods were running on node ip-10-0-162-121.us-west-2.compute.internal

pod django-psql-persistent-1-bpngc is the more recent to get stuck.  It's scale down where it wedged was at 2021-01-05 05:55:21 UTC.   The first errors I see for it after that are:

Jan 05 05:55:22 ip-10-0-162-121 crio[1609]: time="2021-01-05 05:55:22.863160622Z" level=info msg="RunSandbox: releasing container name: k8s_POD_django-psql-persistent-1-bpngc_django-psql-persistent-2_03626057-68f2-4cac-9080-8c02dbc42666_0" id=4907b207-70b0-4f48-959c-e3603824c707 n>
Jan 05 05:55:22 ip-10-0-162-121 crio[1609]: time="2021-01-05 05:55:22.863223642Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_django-psql-persistent-1-bpngc_django-psql-persistent-2_03626057-68f2-4cac-9080-8c02dbc42666_0" id=4907b207-70b0-4f48-959c-e3603824c707 nam>
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864591    1643 remote_runtime.go:116] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864652    1643 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)" failed: rpc error: code = Unknown desc = error reading>
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864671    1643 kuberuntime_manager.go:755] createPodSandbox for pod "django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)" failed: rpc error: code = Unknown desc = error readin>
Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864724    1643 pod_workers.go:191] Error syncing pod 03626057-68f2-4cac-9080-8c02dbc42666 ("django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)"), skipping: failed to "CreateP>



Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-04-153716


How reproducible: If it is the same as bug 1912521 then it is reliably reproducible.  I'll start a new cluster for this particular bug and try it again with just scale up/down testing.


Steps to Reproduce:
1.  OOTB AWS 3 master/3 worker cluster
2.  Run a workload that periodically creates/deletes projects, deployments and scales the deployments up/down.  


Actual results:

After about 12 hours, started seeing pods get stuck in terminating on scale down.  They never clean up



Additional info:

I will add full must-gather location and the journal of a node where the two pods described above were running before becoming wedged.

Comment 2 Mike Fiedler 2021-01-05 14:03:23 UTC
Marking this as a blocker for long running reliability testing for 4.7.0

Comment 5 Neelesh Agrawal 2021-01-22 18:22:04 UTC

*** This bug has been marked as a duplicate of bug 1915085 ***


Note You need to log in before you can comment on or make changes to this bug.