Bug 1470373 - Pods hung in terminating state during cluster upgrade
Pods hung in terminating state during cluster upgrade
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.6.0
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Jordan Liggitt
Johnny Liu
: DeliveryBlocker
Depends On: 1460729
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-12 16:35 EDT by Justin Pierce
Modified: 2017-08-16 16:47 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-16 16:43:15 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Justin Pierce 2017-07-12 16:35:49 EDT
Description of problem:
During an upgrade from 3.5 to 3.6.126.1, the oadm drain command hung with several pods stuck in the terminating state. 

Version-Release number of the following components:
OCP 3.6.126.1

How reproducible:
Intermittent. Not all pods hung in the terminating state during the upgrade.

Steps to Reproduce:
1. Large scale cluster upgrade with openshift-ansible.

Actual results:
Pods hung in terminating state and openshift-ansible hung indefinitely until ssh sessions broke due to timeout. 

http://file.rdu.redhat.com/~jupierce/share/hung.pod.master-controllers.log

The condition could be fixed with:
oc patch pod <pod-name> --type=json --patch='[ { "op":"remove", "path": "/metadata/finalizers" }]'

This may be fixed with pulls, but looking for confirmation before another cluster upgrade is attempted:
https://github.com/openshift/origin/pull/15112 issue https://github.com/openshift/origin/pull/14988 issue                                          
https://github.com/openshift/origin/pull/14918 issue
Comment 2 Jordan Liggitt 2017-07-13 10:40:24 EDT
yes, https://github.com/openshift/origin/pull/15112 resolves this issue
Comment 4 XiaochuanWang 2017-07-14 05:14:31 EDT
PR 15112 fixed the bug 1462067,  the step is that when node is stopped and delete pod, pod is Terminating, then tried to delete it by OC/Web-console. 
On openshift/oc v3.6.133, oc could delete it successfully by [1], but console can't delete it and even makes it stuck as detail [2]

On v3.6.144, the stuck Terminating pod could be deleted by [1] and console, not sure if this bug could be verified too. 

[1] oc delete pod mypod --grace-period=0 --force
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c17
Comment 5 Anping Li 2017-07-14 09:16:32 EDT
Verified and pass. The steps are below:

1. setup Env OCP-v3.6.133
2.Create a pod in terminal status as https://bugzilla.redhat.com/show_bug.cgi?id=1462067#c12
3. The drain node hang can be reproduced 
   oadm drain nodename --force --delete-local-data --ignore-daemonsets
4. Upgrade to v3.6.144 by upgrade playbook

Result: the upgrade success. No drain node hang appeared.

Note You need to log in before you can comment on or make changes to this bug.