Bug 1470373 - Pods hung in terminating state during cluster upgrade
Summary: Pods hung in terminating state during cluster upgrade
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Jordan Liggitt
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On: 1460729
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-12 20:35 UTC by Justin Pierce
Modified: 2017-08-16 20:47 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-16 20:43:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Justin Pierce 2017-07-12 20:35:49 UTC
Description of problem:
During an upgrade from 3.5 to 3.6.126.1, the oadm drain command hung with several pods stuck in the terminating state. 

Version-Release number of the following components:
OCP 3.6.126.1

How reproducible:
Intermittent. Not all pods hung in the terminating state during the upgrade.

Steps to Reproduce:
1. Large scale cluster upgrade with openshift-ansible.

Actual results:
Pods hung in terminating state and openshift-ansible hung indefinitely until ssh sessions broke due to timeout. 

http://file.rdu.redhat.com/~jupierce/share/hung.pod.master-controllers.log

The condition could be fixed with:
oc patch pod <pod-name> --type=json --patch='[ { "op":"remove", "path": "/metadata/finalizers" }]'

This may be fixed with pulls, but looking for confirmation before another cluster upgrade is attempted:
https://github.com/openshift/origin/pull/15112 issue https://github.com/openshift/origin/pull/14988 issue                                          
https://github.com/openshift/origin/pull/14918 issue

Comment 2 Jordan Liggitt 2017-07-13 14:40:24 UTC
yes, https://github.com/openshift/origin/pull/15112 resolves this issue

Comment 4 XiaochuanWang 2017-07-14 09:14:31 UTC
PR 15112 fixed the bug 1462067,  the step is that when node is stopped and delete pod, pod is Terminating, then tried to delete it by OC/Web-console. 
On openshift/oc v3.6.133, oc could delete it successfully by [1], but console can't delete it and even makes it stuck as detail [2]

On v3.6.144, the stuck Terminating pod could be deleted by [1] and console, not sure if this bug could be verified too. 

[1] oc delete pod mypod --grace-period=0 --force
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c17

Comment 5 Anping Li 2017-07-14 13:16:32 UTC
Verified and pass. The steps are below:

1. setup Env OCP-v3.6.133
2.Create a pod in terminal status as https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c12
3. The drain node hang can be reproduced 
   oadm drain nodename --force --delete-local-data --ignore-daemonsets
4. Upgrade to v3.6.144 by upgrade playbook

Result: the upgrade success. No drain node hang appeared.


Note You need to log in before you can comment on or make changes to this bug.