Bug 1565211 - Terminating pods cause node upgrade playbook to hang
Summary: Terminating pods cause node upgrade playbook to hang
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 3.7.z
Assignee: Scott Dodson
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-09 15:54 UTC by Bahaddin
Modified: 2018-08-20 03:08 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
During an upgrade you may now specify the time, in seconds, that you wish to allow for pods to drain from a node. When the timeout is reached the node and container runtime services will be stopped immediately terminating all remaining pods. For example, to configure a 10 minute timeout set openshift_upgrade_nodes_drain_timeout=600
Clone Of:
Environment:
Last Closed: 2018-08-20 03:08:16 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Bahaddin 2018-04-09 15:54:30 UTC
Description of problem:
We have some pods that we cannot delete. The pods are stuck in terminating status.

When we tried to upgrade our cluster the node upgrade playbook stuck at evacuation and could not complete the upgrade. 

Version-Release number of the following components:
Upgrade from 3.6 to 3.7 
OCP Containerized

How reproducible:
Steps to Reproduce:
1. Have pods that are stuck in Terminating state.
2. Upgrade the node where the pods are scheduled to

Actual results:
The node-upgrade playbook hangs at evacuation and cannot continue since the  evacuation cannot be completed  due to stuck pods.

Workaround:
Evacuate the node manually. 
Comment out the section in the node upgrade role.
Replay the playbook while the terminating pods are still there.

Expected results:
Forced evacuation should not await verification, so that terminating pods won't have an effect on the upgrade process.


Additional info:

Comment 2 Seth Jennings 2018-04-09 20:31:30 UTC
Yes, unfortunately, there is no --force flag for odm drain like there is for oc delete, where the resource is immediately deleted if it doesn't terminate within a grace period.  There is no better solution here of which I am aware.

Sending back to Upgrade for backport of the timeout for 3.7.

Comment 3 Scott Dodson 2018-04-10 14:43:17 UTC
Bahaddin,

https://github.com/openshift/openshift-ansible/pull/5080 implemented this for 3.9

We'll try to backport this to 3.7 in the future but a backport PR or support case would raise priority.

--
Scott

Comment 4 Bahaddin 2018-04-18 08:44:21 UTC
We already upgraded our environments, therefore it is not critical for us anymore. It might help other customers who could face the same issue, though.

Comment 6 Scott Dodson 2018-06-07 20:31:24 UTC
Node drain timeouts were added in openshift-ansible-3.7.49-1 via this PR
https://github.com/openshift/openshift-ansible/pull/8428

Moving to ON_QA for QE to verify but if you can confirm it that'd be helpful too.

Comment 7 Scott Dodson 2018-06-07 20:32:29 UTC
openshift-ansible-3.7.52-1.git.0.3fddee4.el7 is the latest version in the 3.7 channel as of yesterday so should be available to customers

Comment 8 Weihua Meng 2018-06-09 09:09:58 UTC
Cannot reproduce.

Could you give detailed info for step 1 ? 
Steps to Reproduce:
1. Have pods that are stuck in Terminating state.
2. Upgrade the node where the pods are scheduled to

thanks.

Comment 9 Bahaddin 2018-06-11 08:58:24 UTC
Hello,

I cannot reproduce the step I meyself either. Step I is another problem we have, for which we opened another issue. 

The reason why the pods get stuck is currently unknown by us.

 Thanks.

Comment 10 Scott Dodson 2018-06-11 12:25:39 UTC
Weihua,

The reasons for non terminating pods are quite varied.

Lets just test that setting openshift_upgrade_nodes_drain_timeout=10 causes the node drain to only wait 10 seconds and then move on with upgrade process?


Note You need to log in before you can comment on or make changes to this bug.