Bug 1470373

Summary: Pods hung in terminating state during cluster upgrade
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: InstallerAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED CURRENTRELEASE QA Contact: Johnny Liu <jialiu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: anli, aos-bugs, erich, jliggitt, jokerman, mmccomas, sdodson, xiaocwan
Target Milestone: ---Keywords: DeliveryBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-16 20:43:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1460729    
Bug Blocks:    

Description Justin Pierce 2017-07-12 20:35:49 UTC
Description of problem:
During an upgrade from 3.5 to 3.6.126.1, the oadm drain command hung with several pods stuck in the terminating state. 

Version-Release number of the following components:
OCP 3.6.126.1

How reproducible:
Intermittent. Not all pods hung in the terminating state during the upgrade.

Steps to Reproduce:
1. Large scale cluster upgrade with openshift-ansible.

Actual results:
Pods hung in terminating state and openshift-ansible hung indefinitely until ssh sessions broke due to timeout. 

http://file.rdu.redhat.com/~jupierce/share/hung.pod.master-controllers.log

The condition could be fixed with:
oc patch pod <pod-name> --type=json --patch='[ { "op":"remove", "path": "/metadata/finalizers" }]'

This may be fixed with pulls, but looking for confirmation before another cluster upgrade is attempted:
https://github.com/openshift/origin/pull/15112 issue https://github.com/openshift/origin/pull/14988 issue                                          
https://github.com/openshift/origin/pull/14918 issue

Comment 2 Jordan Liggitt 2017-07-13 14:40:24 UTC
yes, https://github.com/openshift/origin/pull/15112 resolves this issue

Comment 4 XiaochuanWang 2017-07-14 09:14:31 UTC
PR 15112 fixed the bug 1462067,  the step is that when node is stopped and delete pod, pod is Terminating, then tried to delete it by OC/Web-console. 
On openshift/oc v3.6.133, oc could delete it successfully by [1], but console can't delete it and even makes it stuck as detail [2]

On v3.6.144, the stuck Terminating pod could be deleted by [1] and console, not sure if this bug could be verified too. 

[1] oc delete pod mypod --grace-period=0 --force
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c17

Comment 5 Anping Li 2017-07-14 13:16:32 UTC
Verified and pass. The steps are below:

1. setup Env OCP-v3.6.133
2.Create a pod in terminal status as https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c12
3. The drain node hang can be reproduced 
   oadm drain nodename --force --delete-local-data --ignore-daemonsets
4. Upgrade to v3.6.144 by upgrade playbook

Result: the upgrade success. No drain node hang appeared.