Description of problem: This may be the root cause of bug 1912521, both have pods stuck in Terminating. In a system test reliability cluster that periodically scales application deployments up/down from 1 replica->2->back to 1, some pods are getting stuck in Terminating on scale down and never clean up. I have a must-gather and node journal showing this for two pods in our current cluster: pod django-psql-persistent-1-bpngc in namespace django-psql-persistent-2 pod rails-pgsql-persistent-2-xkt9p in namespace rails-pgsql-persistent-19 Both pods were running on node ip-10-0-162-121.us-west-2.compute.internal pod django-psql-persistent-1-bpngc is the more recent to get stuck. It's scale down where it wedged was at 2021-01-05 05:55:21 UTC. The first errors I see for it after that are: Jan 05 05:55:22 ip-10-0-162-121 crio[1609]: time="2021-01-05 05:55:22.863160622Z" level=info msg="RunSandbox: releasing container name: k8s_POD_django-psql-persistent-1-bpngc_django-psql-persistent-2_03626057-68f2-4cac-9080-8c02dbc42666_0" id=4907b207-70b0-4f48-959c-e3603824c707 n> Jan 05 05:55:22 ip-10-0-162-121 crio[1609]: time="2021-01-05 05:55:22.863223642Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_django-psql-persistent-1-bpngc_django-psql-persistent-2_03626057-68f2-4cac-9080-8c02dbc42666_0" id=4907b207-70b0-4f48-959c-e3603824c707 nam> Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864591 1643 remote_runtime.go:116] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864652 1643 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)" failed: rpc error: code = Unknown desc = error reading> Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864671 1643 kuberuntime_manager.go:755] createPodSandbox for pod "django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)" failed: rpc error: code = Unknown desc = error readin> Jan 05 05:55:22 ip-10-0-162-121 hyperkube[1643]: E0105 05:55:22.864724 1643 pod_workers.go:191] Error syncing pod 03626057-68f2-4cac-9080-8c02dbc42666 ("django-psql-persistent-1-bpngc_django-psql-persistent-2(03626057-68f2-4cac-9080-8c02dbc42666)"), skipping: failed to "CreateP> Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-04-153716 How reproducible: If it is the same as bug 1912521 then it is reliably reproducible. I'll start a new cluster for this particular bug and try it again with just scale up/down testing. Steps to Reproduce: 1. OOTB AWS 3 master/3 worker cluster 2. Run a workload that periodically creates/deletes projects, deployments and scales the deployments up/down. Actual results: After about 12 hours, started seeing pods get stuck in terminating on scale down. They never clean up Additional info: I will add full must-gather location and the journal of a node where the two pods described above were running before becoming wedged.
Marking this as a blocker for long running reliability testing for 4.7.0
*** This bug has been marked as a duplicate of bug 1915085 ***