1470373 – Pods hung in terminating state during cluster upgrade

Bug 1470373 - Pods hung in terminating state during cluster upgrade

Summary: Pods hung in terminating state during cluster upgrade

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Jordan Liggitt
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:	1460729
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-12 20:35 UTC by Justin Pierce
Modified:	2017-08-16 20:47 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-16 20:43:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Justin Pierce 2017-07-12 20:35:49 UTC

Description of problem:
During an upgrade from 3.5 to 3.6.126.1, the oadm drain command hung with several pods stuck in the terminating state. 

Version-Release number of the following components:
OCP 3.6.126.1

How reproducible:
Intermittent. Not all pods hung in the terminating state during the upgrade.

Steps to Reproduce:
1. Large scale cluster upgrade with openshift-ansible.

Actual results:
Pods hung in terminating state and openshift-ansible hung indefinitely until ssh sessions broke due to timeout. 

http://file.rdu.redhat.com/~jupierce/share/hung.pod.master-controllers.log

The condition could be fixed with:
oc patch pod <pod-name> --type=json --patch='[ { "op":"remove", "path": "/metadata/finalizers" }]'

This may be fixed with pulls, but looking for confirmation before another cluster upgrade is attempted:
https://github.com/openshift/origin/pull/15112 issue https://github.com/openshift/origin/pull/14988 issue                                          
https://github.com/openshift/origin/pull/14918 issue

Comment 2 Jordan Liggitt 2017-07-13 14:40:24 UTC

yes, https://github.com/openshift/origin/pull/15112 resolves this issue

Comment 4 XiaochuanWang 2017-07-14 09:14:31 UTC

PR 15112 fixed the bug 1462067,  the step is that when node is stopped and delete pod, pod is Terminating, then tried to delete it by OC/Web-console. 
On openshift/oc v3.6.133, oc could delete it successfully by [1], but console can't delete it and even makes it stuck as detail [2]

On v3.6.144, the stuck Terminating pod could be deleted by [1] and console, not sure if this bug could be verified too. 

[1] oc delete pod mypod --grace-period=0 --force
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c17

Comment 5 Anping Li 2017-07-14 13:16:32 UTC

Verified and pass. The steps are below:

1. setup Env OCP-v3.6.133
2.Create a pod in terminal status as https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c12
3. The drain node hang can be reproduced 
   oadm drain nodename --force --delete-local-data --ignore-daemonsets
4. Upgrade to v3.6.144 by upgrade playbook

Result: the upgrade success. No drain node hang appeared.

Note You need to log in before you can comment on or make changes to this bug.