1505687 – Pods in unknown state, cannot be forcibly deleted.

Bug 1505687 - Pods in unknown state, cannot be forcibly deleted.

Summary: Pods in unknown state, cannot be forcibly deleted.

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Avesh Agarwal
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-24 06:49 UTC by Sergi Jimenez Romero
Modified:	2018-05-08 15:17 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-06 17:59:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sergi Jimenez Romero 2017-10-24 06:49:28 UTC

Description of problem:

When trying to evacuate pods from a node that was going on maintenance, some pods were stuck in "Terminating" state. After rebooting the node, those pods were in "Unknown" state.

Trying to terminate the pods using `oc delete --force --grace-period=0`, didn't help. The pods vanished after some time (~1h).

Version-Release number of selected component (if applicable):

3.5.5.31

How reproducible:

At least once

Steps to Reproduce:
1. evacuate pods, if pods in terminating state never die, reboot
2. node back, pods in unknown state
3. use oc delete --force --grace-period=0 to delete the pods in unknown state.

Actual results:

oc delete doesn't produce the expected results.

Expected results:

The pods in unknown state to be deleted.
Additional info:

Comment 1 Sergi Jimenez Romero 2017-10-24 06:53:22 UTC

I believe the pods being on unknown state is the effect of the following proposal:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md

What doesn't seem to be working as expected is the "--force" parameter as per:

https://github.com/kubernetes/kubernetes/pull/37263

kubectl delete pods <pod> --grace-period=0 --force should do the trick.

As far as I have seen, oc delete will send any parameter directly to kubectl.

Comment 2 Seth Jennings 2017-10-24 21:47:05 UTC

Avesh, PTAL

Comment 3 Avesh Agarwal 2017-10-25 14:29:56 UTC

Hi Sergi Jimenez Romero, 

I looked into it and here is my understanding:

First of all:  oc delete --grace-period=0 --force should have done the job (though you tried oc delete  --force --grace-period=0, but i think the order of force and grace-period does not matter).

So now the question is why it did not, and it may be due to several reasons, for example, kubelet might be wedged on the node, or any other issues with the node. (There is a similar upstream issue: https://github.com/kubernetes/kubernetes/issues/43279.)

As a next step, I'd suggest you to provide:
1) logs from node
2) logs from master
3) oc describe node 
4) oc describe pod 


To really see what is going on.

Comment 4 Avesh Agarwal 2017-10-25 18:56:04 UTC

Hi Sergi Jimenez Romero, 

I have been trying to reproduce this on my 1 master and 2-node cluster but unable too. Could you provide some details about the cluster in addition to the info I asked in https://bugzilla.redhat.com/show_bug.cgi?id=1505687#c3?

Comment 5 Avesh Agarwal 2017-10-27 15:52:32 UTC

Since yesterday, I have been trying to reproduce but not able to:

I have been running 50 pods rc:
http://pastebin.test.redhat.com/527504

I have been running following commands several times for both nodes:
oadm drain 192.168.122.186 --config=./openshift.local.config/master/admin.kubeconfig
oadm drain 192.168.122.239 --config=./openshift.local.config/master/admin.kubeconfig
oadm uncordon 192.168.122.186 --config=./openshift.local.config/master/admin.kubeconfig
oadm uncordon 192.168.122.239 --config=./openshift.local.config/master/admin.kubeconfig

I also tried Seth's suggestion to hold onto shell in mounted dirs on host for one of the pods basically by going into the pod's mounted dir and running a watch ls -al. But I dont see any pod stuck and drain is always successful.

Comment 13 Avesh Agarwal 2017-10-27 17:02:47 UTC

But the bad news is that I can delete pods in unkown state with out any issue so 

Before:

# oc get pods -a -o wide --config=./openshift.local.config/master/admin.kubeconfig |grep Unknown
nginx8-009hz   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-0zgcp   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-1mksh   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-3lghb   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-4cp93   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-4n9sm   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-66tkm   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-84llr   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-cb4v3   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-cxb6q   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-d4726   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-frk9n   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-g65xt   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-gdjfz   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-ktx78   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-ljb2f   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-m6lb4   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-pg93t   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-rq46z   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-s13pd   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-s4222   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-stjd5   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-stl9z   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-vkwt4   0/1       Unknown   0          17m       <none>        192.168.122.239
nginx8-wkj4q   0/1       Unknown   0          17m       <none>        192.168.122.239


Now force delete one of the above:

#oc delete --force --grace-period=0 pod nginx8-009hz  --config=./openshift.local.config/master/admin.kubeconfig
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nginx8-009hz" deleted


After:
# oc get pods -a -o wide --config=./openshift.local.config/master/admin.kubeconfig |grep Unknown
nginx8-0zgcp   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-1mksh   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-3lghb   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-4cp93   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-4n9sm   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-66tkm   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-84llr   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-cb4v3   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-cxb6q   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-d4726   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-frk9n   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-g65xt   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-gdjfz   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-ktx78   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-ljb2f   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-m6lb4   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-pg93t   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-rq46z   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-s13pd   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-s4222   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-stjd5   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-stl9z   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-vkwt4   0/1       Unknown   0          18m       <none>        192.168.122.239
nginx8-wkj4q   0/1       Unknown   0          18m       <none>        192.168.122.239

Comment 27 Robert Krawitz 2018-05-08 15:17:35 UTC

Reference bug 1557306

Note You need to log in before you can comment on or make changes to this bug.