Bug 1541476 - Pods in crash loop backoff can't be deleted until the crash loop backoff period expires
Summary: Pods in crash loop backoff can't be deleted until the crash loop backoff peri...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.10.0
Assignee: Robert Krawitz
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-02 16:44 UTC by Clayton Coleman
Modified: 2018-07-30 19:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-07-30 19:09:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 0 None None None 2018-07-30 19:09:34 UTC

Description Clayton Coleman 2018-02-02 16:44:58 UTC
It looks like the kubelet refuses to finalize the deletion of a pod while the pod is in crash loop backoff (in the backoff window), probably because it's waiting and holding the sync loop blocked.

Scenario:

1. Create a pod that crash loops
2. Wait until the backoff is > 1m (3-4 crashes)
3. Delete the pod

Expect:

1. Kubelet acknowledges the delete request immediately and cleans up the pod (the main container would be stopped already, so delete should be almost instantaneous)

Actual:

1. Pod sits in terminating for +1m (looks like until the backoff period expires)

This bug is incredibly annoying when debugging or working with pods.  It makes me want to break things, which is sad :(

Comment 1 Seth Jennings 2018-02-02 16:46:50 UTC
Upstream issue:
https://github.com/kubernetes/kubernetes/issues/57865#issuecomment-358183236

Comment 2 Seth Jennings 2018-02-07 21:47:47 UTC
Still working to run this down upstream
https://github.com/kubernetes/kubernetes/issues/57865

The delay is only a factor of the terminationGracePeriod (30s), not the backoff timeout (up to 5m).  So we are looking at a delay in the 10s of seconds, not minutes.

Trying to figure out why the kubelet does not clean up the failed container once the pod gets its deletionTimestamp set.

Comment 3 Seth Jennings 2018-02-13 16:02:00 UTC
This has been an issue since at least 1.6 according to the upstream issue so it isn't a regression.  It is annoying and effects pods in general, not just StatefulSet pods.  Not a blocker in my mind though, so deferring to z-stream.

Clayton, if you really want this to be a blocker, feel free to move it back.

Comment 4 Seth Jennings 2018-04-10 03:50:31 UTC
WIP upstream PR:
https://github.com/kubernetes/kubernetes/pull/62170

Comment 5 Seth Jennings 2018-05-01 18:00:17 UTC
Previous upstream PR abandon.

New upstream PR:
https://github.com/kubernetes/kubernetes/pull/63321

Origin PR:
https://github.com/openshift/origin/pull/19580

Comment 7 weiwei jiang 2018-05-16 09:03:46 UTC
Checked on 
# oc version 
oc v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-127.ec2.internal:8443
openshift v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8

And the terminating pods is deleted immediately now.

Comment 9 errata-xmlrpc 2018-07-30 19:09:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.