1509309 – Retrospective addition of a LimitRange can result in terminating Pods remaining indefinitely

Bug 1509309 - Retrospective addition of a LimitRange can result in terminating Pods remaining indefinitely

Summary: Retrospective addition of a LimitRange can result in terminating Pods remaini...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Derek Carr
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-03 13:59 UTC by Ed Seymour
Modified:	2019-05-15 08:36 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-10 16:28:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
LimitRange used to replicate issue (536 bytes, text/plain) 2017-11-03 13:59 UTC, Ed Seymour	no flags	Details
View All

Description Ed Seymour 2017-11-03 13:59:04 UTC

Created attachment 1347405 [details]
LimitRange used to replicate issue

Description of problem:
A LimitRange applied to the project seems to be affecting terminating Pods, and it's not related to specific requests / limits applied to Pods. 

It seems that is is restricted to Pods that were created outside of the introduction of the limitrange - which I appreciate is somewhat of an edge case, but odd nonetheless. 

Version-Release number of selected component (if applicable):
OpenShift 3.6.1

How reproducible:


Steps to Reproduce:
1. Project has no LimitRange
2. Create an application (e.g. simple Java app) and scaled to 4x Pods. 
3. Delete one of the Pods
4. Pod in 'terminating' state for up to 30seconds (default grace) then disappears
5. New Pod replaces deleted Pod
6. Add the LimitRange to the project (note: this won't have been applied to these containers)
7. Delete one of the Pods
8. Pod in 'terminating' state indefinitely
9. Delete the LimitRange
10. Pod in 'terminating' state is cleaned up
11. Scale app to zero
12. Add the LimitRange
13. Scale app to 4 Pods
14. Delete a Pod
15. Pod in 'terminating' state up for about 40 seconds, then disappears

Actual results:
Pod in terminating state indefinitely

Expected results:
Pod in terminating state for little more than grace period

Additional info:

Comment 1 Seth Jennings 2017-11-06 13:59:39 UTC

Andrew, PTAL.  See if you can reproduce on 3.7 first and if so, we'll look at fixing upstream.  Probably get Derek to take a quick looks as well to make sure the test case is valid.

Comment 2 Andrew McDermott 2017-11-08 14:44:22 UTC

I was able to reproduce this on 3.6.1 and 3.7.0-rc0 using the steps above via `oc cluster up` on fedora 26.

Initially I could not repro at all. What I found was that I needed to be patient between the time when adding/creating the limits and deleting the pod. If I did these successively from the CLI (i.e., as quickly as possible) then everything appeared to behave as expected; the pod went "terminated", another spun up, and the terminated pod eventually disappeared. However, even with a 1-2 minute delay between creating the limit and issuing a delete it was not 100% reproducible. 

Once I had a pod in the terminated state I also went on to delete other pods and the general pattern seemed to hold in those cases; the pod stayed "terminated" and new ones appeared, but the "terminated" pods hung around.

I then went on and deleted the limits (steps 9 & 10) expecting those pods that were terminated to get cleared up. It reliably seems to happen for at least one, but not all.

I did not really look at the logs to see what was happening because it took some time just with these relatively high-level actions to understand the behaviour/pattern that triggers this bug.

Comment 3 Derek Carr 2017-12-08 17:10:24 UTC

UPSTREAM PR:
https://github.com/kubernetes/kubernetes/pull/56971

Comment 4 Seth Jennings 2018-01-03 22:16:23 UTC

Origin PR:
https://github.com/openshift/origin/pull/17978

Comment 6 weiwei jiang 2018-01-11 02:20:08 UTC

Checked with # openshift version 
openshift v3.6.173.0.96
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

and the patch is not in this version, so will retry after the patch is packaged.

Comment 7 Seth Jennings 2018-01-11 02:33:44 UTC

Sorry, wrong target release.  Since there is no customer issue or request for backport here, we'll just fix it in master (3.9).

Comment 8 weiwei jiang 2018-01-11 05:43:32 UTC

Will give a check once new puddle with this patch come out, since latest puddle not contain this patch.

openshift v3.9.0-0.16.0
kubernetes v1.9.0-beta1
etcd 3.2.8

Comment 10 weiwei jiang 2018-01-18 05:48:54 UTC

Checked with 
# openshift version 
openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8
And this issue can not be reproduced, so verify this.

Note You need to log in before you can comment on or make changes to this bug.