Bug 1505687

Summary:	Pods in unknown state, cannot be forcibly deleted.
Product:	OpenShift Container Platform	Reporter:	Sergi Jimenez Romero <sjr>
Component:	Node	Assignee:	Avesh Agarwal <avagarwa>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	DeShuai Ma <dma>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.5.1	CC:	aos-bugs, avagarwa, fcami, jokerman, mmccomas, rkrawitz, rpuccini, sjenning, sjr, sreber
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-02-06 17:59:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sergi Jimenez Romero 2017-10-24 06:49:28 UTC

Description of problem:

When trying to evacuate pods from a node that was going on maintenance, some pods were stuck in "Terminating" state. After rebooting the node, those pods were in "Unknown" state.

Trying to terminate the pods using `oc delete --force --grace-period=0`, didn't help. The pods vanished after some time (~1h).

Version-Release number of selected component (if applicable):

3.5.5.31

How reproducible:

At least once

Steps to Reproduce:
1. evacuate pods, if pods in terminating state never die, reboot
2. node back, pods in unknown state
3. use oc delete --force --grace-period=0 to delete the pods in unknown state.

Actual results:

oc delete doesn't produce the expected results.

Expected results:

The pods in unknown state to be deleted.
Additional info:

Comment 1 Sergi Jimenez Romero 2017-10-24 06:53:22 UTC

I believe the pods being on unknown state is the effect of the following proposal:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md

What doesn't seem to be working as expected is the "--force" parameter as per:

https://github.com/kubernetes/kubernetes/pull/37263

kubectl delete pods <pod> --grace-period=0 --force should do the trick.

As far as I have seen, oc delete will send any parameter directly to kubectl.

Comment 2 Seth Jennings 2017-10-24 21:47:05 UTC

Avesh, PTAL

Comment 3 Avesh Agarwal 2017-10-25 14:29:56 UTC

Hi Sergi Jimenez Romero, 

I looked into it and here is my understanding:

First of all:  oc delete --grace-period=0 --force should have done the job (though you tried oc delete  --force --grace-period=0, but i think the order of force and grace-period does not matter).

So now the question is why it did not, and it may be due to several reasons, for example, kubelet might be wedged on the node, or any other issues with the node. (There is a similar upstream issue: https://github.com/kubernetes/kubernetes/issues/43279.)

As a next step, I'd suggest you to provide:
1) logs from node
2) logs from master
3) oc describe node 
4) oc describe pod 


To really see what is going on.

Comment 4 Avesh Agarwal 2017-10-25 18:56:04 UTC

Hi Sergi Jimenez Romero, 

I have been trying to reproduce this on my 1 master and 2-node cluster but unable too. Could you provide some details about the cluster in addition to the info I asked in https://bugzilla.redhat.com/show_bug.cgi?id=1505687#c3?

Comment 5 Avesh Agarwal 2017-10-27 15:52:32 UTC

Since yesterday, I have been trying to reproduce but not able to:

I have been running 50 pods rc:
http://pastebin.test.redhat.com/527504

I have been running following commands several times for both nodes:
oadm drain 192.168.122.186 --config=./openshift.local.config/master/admin.kubeconfig
oadm drain 192.168.122.239 --config=./openshift.local.config/master/admin.kubeconfig
oadm uncordon 192.168.122.186 --config=./openshift.local.config/master/admin.kubeconfig
oadm uncordon 192.168.122.239 --config=./openshift.local.config/master/admin.kubeconfig

I also tried Seth's suggestion to hold onto shell in mounted dirs on host for one of the pods basically by going into the pod's mounted dir and running a watch ls -al. But I dont see any pod stuck and drain is always successful.

Comment 13 Avesh Agarwal 2017-10-27 17:02:47 UTC

But the bad news is that I can delete pods in unkown state with out any issue so 

Before:

# oc get pods -a -o wide --config=./openshift.local.config/master/admin.kubeconfig |grep Unknown
nginx8-009hz   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-0zgcp   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-1mksh   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-3lghb   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-4cp93   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-4n9sm   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-66tkm   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-84llr   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-cb4v3   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-cxb6q   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-d4726   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-frk9n   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-g65xt   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-gdjfz   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-ktx78   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-ljb2f   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-m6lb4   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-pg93t   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-rq46z   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-s13pd   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-s4222   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-stjd5   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-stl9z   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-vkwt4   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-wkj4q   0/1       Unknown   0          17m       <none>        192.168.122.239


Now force delete one of the above:

#oc delete --force --grace-period=0 pod nginx8-009hz  --config=./openshift.local.config/master/admin.kubeconfig
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nginx8-009hz" deleted


After:
# oc get pods -a -o wide --config=./openshift.local.config/master/admin.kubeconfig |grep Unknown
nginx8-0zgcp   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-1mksh   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-3lghb   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-4cp93   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-4n9sm   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-66tkm   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-84llr   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-cb4v3   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-cxb6q   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-d4726   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-frk9n   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-g65xt   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-gdjfz   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-ktx78   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-ljb2f   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-m6lb4   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-pg93t   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-rq46z   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-s13pd   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-s4222   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-stjd5   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-stl9z   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-vkwt4   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-wkj4q   0/1       Unknown   0          18m       <none>        192.168.122.239

Comment 27 Robert Krawitz 2018-05-08 15:17:35 UTC

Reference bug 1557306