Bug 1541476

Summary:	Pods in crash loop backoff can't be deleted until the crash loop backoff period expires
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Node	Assignee:	Robert Krawitz <rkrawitz>
Status:	CLOSED ERRATA	QA Contact:	weiwei jiang <wjiang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	aos-bugs, avagarwa, dma, jokerman, mmccomas, sjenning
Target Milestone:	---	Keywords:	NeedsTestCase
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-30 19:09:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-02-02 16:44:58 UTC

It looks like the kubelet refuses to finalize the deletion of a pod while the pod is in crash loop backoff (in the backoff window), probably because it's waiting and holding the sync loop blocked.

Scenario:

1. Create a pod that crash loops
2. Wait until the backoff is > 1m (3-4 crashes)
3. Delete the pod

Expect:

1. Kubelet acknowledges the delete request immediately and cleans up the pod (the main container would be stopped already, so delete should be almost instantaneous)

Actual:

1. Pod sits in terminating for +1m (looks like until the backoff period expires)

This bug is incredibly annoying when debugging or working with pods.  It makes me want to break things, which is sad :(

Comment 1 Seth Jennings 2018-02-02 16:46:50 UTC

Upstream issue:
https://github.com/kubernetes/kubernetes/issues/57865#issuecomment-358183236

Comment 2 Seth Jennings 2018-02-07 21:47:47 UTC

Still working to run this down upstream
https://github.com/kubernetes/kubernetes/issues/57865

The delay is only a factor of the terminationGracePeriod (30s), not the backoff timeout (up to 5m).  So we are looking at a delay in the 10s of seconds, not minutes.

Trying to figure out why the kubelet does not clean up the failed container once the pod gets its deletionTimestamp set.

Comment 3 Seth Jennings 2018-02-13 16:02:00 UTC

This has been an issue since at least 1.6 according to the upstream issue so it isn't a regression.  It is annoying and effects pods in general, not just StatefulSet pods.  Not a blocker in my mind though, so deferring to z-stream.

Clayton, if you really want this to be a blocker, feel free to move it back.

Comment 4 Seth Jennings 2018-04-10 03:50:31 UTC

WIP upstream PR:
https://github.com/kubernetes/kubernetes/pull/62170

Comment 5 Seth Jennings 2018-05-01 18:00:17 UTC

Previous upstream PR abandon.

New upstream PR:
https://github.com/kubernetes/kubernetes/pull/63321

Origin PR:
https://github.com/openshift/origin/pull/19580

Comment 7 weiwei jiang 2018-05-16 09:03:46 UTC

Checked on 
# oc version 
oc v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-127.ec2.internal:8443
openshift v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8

And the terminating pods is deleted immediately now.

Comment 9 errata-xmlrpc 2018-07-30 19:09:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816