Bug 1541476

Summary: Pods in crash loop backoff can't be deleted until the crash loop backoff period expires
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NodeAssignee: Robert Krawitz <rkrawitz>
Status: CLOSED ERRATA QA Contact: weiwei jiang <wjiang>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.9.0CC: aos-bugs, avagarwa, dma, jokerman, mmccomas, sjenning
Target Milestone: ---Keywords: NeedsTestCase
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:09:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-02-02 16:44:58 UTC
It looks like the kubelet refuses to finalize the deletion of a pod while the pod is in crash loop backoff (in the backoff window), probably because it's waiting and holding the sync loop blocked.

Scenario:

1. Create a pod that crash loops
2. Wait until the backoff is > 1m (3-4 crashes)
3. Delete the pod

Expect:

1. Kubelet acknowledges the delete request immediately and cleans up the pod (the main container would be stopped already, so delete should be almost instantaneous)

Actual:

1. Pod sits in terminating for +1m (looks like until the backoff period expires)

This bug is incredibly annoying when debugging or working with pods.  It makes me want to break things, which is sad :(

Comment 1 Seth Jennings 2018-02-02 16:46:50 UTC
Upstream issue:
https://github.com/kubernetes/kubernetes/issues/57865#issuecomment-358183236

Comment 2 Seth Jennings 2018-02-07 21:47:47 UTC
Still working to run this down upstream
https://github.com/kubernetes/kubernetes/issues/57865

The delay is only a factor of the terminationGracePeriod (30s), not the backoff timeout (up to 5m).  So we are looking at a delay in the 10s of seconds, not minutes.

Trying to figure out why the kubelet does not clean up the failed container once the pod gets its deletionTimestamp set.

Comment 3 Seth Jennings 2018-02-13 16:02:00 UTC
This has been an issue since at least 1.6 according to the upstream issue so it isn't a regression.  It is annoying and effects pods in general, not just StatefulSet pods.  Not a blocker in my mind though, so deferring to z-stream.

Clayton, if you really want this to be a blocker, feel free to move it back.

Comment 4 Seth Jennings 2018-04-10 03:50:31 UTC
WIP upstream PR:
https://github.com/kubernetes/kubernetes/pull/62170

Comment 5 Seth Jennings 2018-05-01 18:00:17 UTC
Previous upstream PR abandon.

New upstream PR:
https://github.com/kubernetes/kubernetes/pull/63321

Origin PR:
https://github.com/openshift/origin/pull/19580

Comment 7 weiwei jiang 2018-05-16 09:03:46 UTC
Checked on 
# oc version 
oc v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-127.ec2.internal:8443
openshift v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8

And the terminating pods is deleted immediately now.

Comment 9 errata-xmlrpc 2018-07-30 19:09:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816