1541476 – Pods in crash loop backoff can't be deleted until the crash loop backoff period expires

Bug 1541476 - Pods in crash loop backoff can't be deleted until the crash loop backoff period expires

Summary: Pods in crash loop backoff can't be deleted until the crash loop backoff peri...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Robert Krawitz
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-02 16:44 UTC by Clayton Coleman
Modified:	2018-07-30 19:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-07-30 19:09:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:09:34 UTC

Description Clayton Coleman 2018-02-02 16:44:58 UTC

It looks like the kubelet refuses to finalize the deletion of a pod while the pod is in crash loop backoff (in the backoff window), probably because it's waiting and holding the sync loop blocked.

Scenario:

1. Create a pod that crash loops
2. Wait until the backoff is > 1m (3-4 crashes)
3. Delete the pod

Expect:

1. Kubelet acknowledges the delete request immediately and cleans up the pod (the main container would be stopped already, so delete should be almost instantaneous)

Actual:

1. Pod sits in terminating for +1m (looks like until the backoff period expires)

This bug is incredibly annoying when debugging or working with pods.  It makes me want to break things, which is sad :(

Comment 1 Seth Jennings 2018-02-02 16:46:50 UTC

Upstream issue:
https://github.com/kubernetes/kubernetes/issues/57865#issuecomment-358183236

Comment 2 Seth Jennings 2018-02-07 21:47:47 UTC

Still working to run this down upstream
https://github.com/kubernetes/kubernetes/issues/57865

The delay is only a factor of the terminationGracePeriod (30s), not the backoff timeout (up to 5m).  So we are looking at a delay in the 10s of seconds, not minutes.

Trying to figure out why the kubelet does not clean up the failed container once the pod gets its deletionTimestamp set.

Comment 3 Seth Jennings 2018-02-13 16:02:00 UTC

This has been an issue since at least 1.6 according to the upstream issue so it isn't a regression.  It is annoying and effects pods in general, not just StatefulSet pods.  Not a blocker in my mind though, so deferring to z-stream.

Clayton, if you really want this to be a blocker, feel free to move it back.

Comment 4 Seth Jennings 2018-04-10 03:50:31 UTC

WIP upstream PR:
https://github.com/kubernetes/kubernetes/pull/62170

Comment 5 Seth Jennings 2018-05-01 18:00:17 UTC

Previous upstream PR abandon.

New upstream PR:
https://github.com/kubernetes/kubernetes/pull/63321

Origin PR:
https://github.com/openshift/origin/pull/19580

Comment 7 weiwei jiang 2018-05-16 09:03:46 UTC

Checked on 
# oc version 
oc v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-127.ec2.internal:8443
openshift v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8

And the terminating pods is deleted immediately now.

Comment 9 errata-xmlrpc 2018-07-30 19:09:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.