Description of problem: We have some pods that we cannot delete. The pods are stuck in terminating status. When we tried to upgrade our cluster the node upgrade playbook stuck at evacuation and could not complete the upgrade. Version-Release number of the following components: Upgrade from 3.6 to 3.7 OCP Containerized How reproducible: Steps to Reproduce: 1. Have pods that are stuck in Terminating state. 2. Upgrade the node where the pods are scheduled to Actual results: The node-upgrade playbook hangs at evacuation and cannot continue since the evacuation cannot be completed due to stuck pods. Workaround: Evacuate the node manually. Comment out the section in the node upgrade role. Replay the playbook while the terminating pods are still there. Expected results: Forced evacuation should not await verification, so that terminating pods won't have an effect on the upgrade process. Additional info:
Yes, unfortunately, there is no --force flag for odm drain like there is for oc delete, where the resource is immediately deleted if it doesn't terminate within a grace period. There is no better solution here of which I am aware. Sending back to Upgrade for backport of the timeout for 3.7.
Bahaddin, https://github.com/openshift/openshift-ansible/pull/5080 implemented this for 3.9 We'll try to backport this to 3.7 in the future but a backport PR or support case would raise priority. -- Scott
We already upgraded our environments, therefore it is not critical for us anymore. It might help other customers who could face the same issue, though.
Node drain timeouts were added in openshift-ansible-3.7.49-1 via this PR https://github.com/openshift/openshift-ansible/pull/8428 Moving to ON_QA for QE to verify but if you can confirm it that'd be helpful too.
openshift-ansible-3.7.52-1.git.0.3fddee4.el7 is the latest version in the 3.7 channel as of yesterday so should be available to customers
Cannot reproduce. Could you give detailed info for step 1 ? Steps to Reproduce: 1. Have pods that are stuck in Terminating state. 2. Upgrade the node where the pods are scheduled to thanks.
Hello, I cannot reproduce the step I meyself either. Step I is another problem we have, for which we opened another issue. The reason why the pods get stuck is currently unknown by us. Thanks.
Weihua, The reasons for non terminating pods are quite varied. Lets just test that setting openshift_upgrade_nodes_drain_timeout=10 causes the node drain to only wait 10 seconds and then move on with upgrade process?