Created attachment 1347405 [details] LimitRange used to replicate issue Description of problem: A LimitRange applied to the project seems to be affecting terminating Pods, and it's not related to specific requests / limits applied to Pods. It seems that is is restricted to Pods that were created outside of the introduction of the limitrange - which I appreciate is somewhat of an edge case, but odd nonetheless. Version-Release number of selected component (if applicable): OpenShift 3.6.1 How reproducible: Steps to Reproduce: 1. Project has no LimitRange 2. Create an application (e.g. simple Java app) and scaled to 4x Pods. 3. Delete one of the Pods 4. Pod in 'terminating' state for up to 30seconds (default grace) then disappears 5. New Pod replaces deleted Pod 6. Add the LimitRange to the project (note: this won't have been applied to these containers) 7. Delete one of the Pods 8. Pod in 'terminating' state indefinitely 9. Delete the LimitRange 10. Pod in 'terminating' state is cleaned up 11. Scale app to zero 12. Add the LimitRange 13. Scale app to 4 Pods 14. Delete a Pod 15. Pod in 'terminating' state up for about 40 seconds, then disappears Actual results: Pod in terminating state indefinitely Expected results: Pod in terminating state for little more than grace period Additional info:
Andrew, PTAL. See if you can reproduce on 3.7 first and if so, we'll look at fixing upstream. Probably get Derek to take a quick looks as well to make sure the test case is valid.
I was able to reproduce this on 3.6.1 and 3.7.0-rc0 using the steps above via `oc cluster up` on fedora 26. Initially I could not repro at all. What I found was that I needed to be patient between the time when adding/creating the limits and deleting the pod. If I did these successively from the CLI (i.e., as quickly as possible) then everything appeared to behave as expected; the pod went "terminated", another spun up, and the terminated pod eventually disappeared. However, even with a 1-2 minute delay between creating the limit and issuing a delete it was not 100% reproducible. Once I had a pod in the terminated state I also went on to delete other pods and the general pattern seemed to hold in those cases; the pod stayed "terminated" and new ones appeared, but the "terminated" pods hung around. I then went on and deleted the limits (steps 9 & 10) expecting those pods that were terminated to get cleared up. It reliably seems to happen for at least one, but not all. I did not really look at the logs to see what was happening because it took some time just with these relatively high-level actions to understand the behaviour/pattern that triggers this bug.
UPSTREAM PR: https://github.com/kubernetes/kubernetes/pull/56971
Origin PR: https://github.com/openshift/origin/pull/17978
Checked with # openshift version openshift v3.6.173.0.96 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 and the patch is not in this version, so will retry after the patch is packaged.
Sorry, wrong target release. Since there is no customer issue or request for backport here, we'll just fix it in master (3.9).
Will give a check once new puddle with this patch come out, since latest puddle not contain this patch. openshift v3.9.0-0.16.0 kubernetes v1.9.0-beta1 etcd 3.2.8
Checked with # openshift version openshift v3.9.0-0.20.0 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.8 And this issue can not be reproduced, so verify this.