1565211 – Terminating pods cause node upgrade playbook to hang

Bug 1565211 - Terminating pods cause node upgrade playbook to hang

Summary: Terminating pods cause node upgrade playbook to hang

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	3.7.z
Assignee:	Scott Dodson
QA Contact:	Weihua Meng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-09 15:54 UTC by Bahaddin
Modified:	2021-09-09 13:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	During an upgrade you may now specify the time, in seconds, that you wish to allow for pods to drain from a node. When the timeout is reached the node and container runtime services will be stopped immediately terminating all remaining pods. For example, to configure a 10 minute timeout set openshift_upgrade_nodes_drain_timeout=600
Clone Of:
Environment:
Last Closed:	2018-08-20 03:08:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Bahaddin 2018-04-09 15:54:30 UTC

Description of problem:
We have some pods that we cannot delete. The pods are stuck in terminating status.

When we tried to upgrade our cluster the node upgrade playbook stuck at evacuation and could not complete the upgrade. 

Version-Release number of the following components:
Upgrade from 3.6 to 3.7 
OCP Containerized

How reproducible:
Steps to Reproduce:
1. Have pods that are stuck in Terminating state.
2. Upgrade the node where the pods are scheduled to

Actual results:
The node-upgrade playbook hangs at evacuation and cannot continue since the  evacuation cannot be completed  due to stuck pods.

Workaround:
Evacuate the node manually. 
Comment out the section in the node upgrade role.
Replay the playbook while the terminating pods are still there.

Expected results:
Forced evacuation should not await verification, so that terminating pods won't have an effect on the upgrade process.


Additional info:

Comment 2 Seth Jennings 2018-04-09 20:31:30 UTC

Yes, unfortunately, there is no --force flag for odm drain like there is for oc delete, where the resource is immediately deleted if it doesn't terminate within a grace period.  There is no better solution here of which I am aware.

Sending back to Upgrade for backport of the timeout for 3.7.

Comment 3 Scott Dodson 2018-04-10 14:43:17 UTC

Bahaddin,

https://github.com/openshift/openshift-ansible/pull/5080 implemented this for 3.9

We'll try to backport this to 3.7 in the future but a backport PR or support case would raise priority.

--
Scott

Comment 4 Bahaddin 2018-04-18 08:44:21 UTC

We already upgraded our environments, therefore it is not critical for us anymore. It might help other customers who could face the same issue, though.

Comment 6 Scott Dodson 2018-06-07 20:31:24 UTC

Node drain timeouts were added in openshift-ansible-3.7.49-1 via this PR
https://github.com/openshift/openshift-ansible/pull/8428

Moving to ON_QA for QE to verify but if you can confirm it that'd be helpful too.

Comment 7 Scott Dodson 2018-06-07 20:32:29 UTC

openshift-ansible-3.7.52-1.git.0.3fddee4.el7 is the latest version in the 3.7 channel as of yesterday so should be available to customers

Comment 8 Weihua Meng 2018-06-09 09:09:58 UTC

Cannot reproduce.

Could you give detailed info for step 1 ? 
Steps to Reproduce:
1. Have pods that are stuck in Terminating state.
2. Upgrade the node where the pods are scheduled to

thanks.

Comment 9 Bahaddin 2018-06-11 08:58:24 UTC

Hello,

I cannot reproduce the step I meyself either. Step I is another problem we have, for which we opened another issue. 

The reason why the pods get stuck is currently unknown by us.

 Thanks.

Comment 10 Scott Dodson 2018-06-11 12:25:39 UTC

Weihua,

The reasons for non terminating pods are quite varied.

Lets just test that setting openshift_upgrade_nodes_drain_timeout=10 causes the node drain to only wait 10 seconds and then move on with upgrade process?

Note You need to log in before you can comment on or make changes to this bug.