Description of problem: During an upgrade from 3.5 to 3.6.126.1, the oadm drain command hung with several pods stuck in the terminating state. Version-Release number of the following components: OCP 3.6.126.1 How reproducible: Intermittent. Not all pods hung in the terminating state during the upgrade. Steps to Reproduce: 1. Large scale cluster upgrade with openshift-ansible. Actual results: Pods hung in terminating state and openshift-ansible hung indefinitely until ssh sessions broke due to timeout. http://file.rdu.redhat.com/~jupierce/share/hung.pod.master-controllers.log The condition could be fixed with: oc patch pod <pod-name> --type=json --patch='[ { "op":"remove", "path": "/metadata/finalizers" }]' This may be fixed with pulls, but looking for confirmation before another cluster upgrade is attempted: https://github.com/openshift/origin/pull/15112 issue https://github.com/openshift/origin/pull/14988 issue https://github.com/openshift/origin/pull/14918 issue
yes, https://github.com/openshift/origin/pull/15112 resolves this issue
PR 15112 fixed the bug 1462067, the step is that when node is stopped and delete pod, pod is Terminating, then tried to delete it by OC/Web-console. On openshift/oc v3.6.133, oc could delete it successfully by [1], but console can't delete it and even makes it stuck as detail [2] On v3.6.144, the stuck Terminating pod could be deleted by [1] and console, not sure if this bug could be verified too. [1] oc delete pod mypod --grace-period=0 --force [2] https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c17
Verified and pass. The steps are below: 1. setup Env OCP-v3.6.133 2.Create a pod in terminal status as https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c12 3. The drain node hang can be reproduced oadm drain nodename --force --delete-local-data --ignore-daemonsets 4. Upgrade to v3.6.144 by upgrade playbook Result: the upgrade success. No drain node hang appeared.