Bug 1364243
| Summary: | Terminating Pod does not get rescheduled to another node when node is NotReady | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> |
| Component: | Node | Assignee: | Derek Carr <decarr> |
| Status: | CLOSED ERRATA | QA Contact: | Vikas Laad <vlaad> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.3.0 | CC: | agoldste, aos-bugs, jokerman, mmccomas, tdawson, vlaad, weliang, wmeng, xtian |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: |
undefined
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-01-18 12:51:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vikas Laad
2016-08-04 19:19:59 UTC
If you wait > 5 minutes, does the DeploymentConfig create a new pod on another node? No, this Terminating pod is stuck for a day. Node became NotReady for few hours now, still it was not creating on another node. Derek would you mind looking at this? I think this may reproduce on a multi-node cluster by just stopping Docker on one node and waiting >5 minutes to see if the NodeController evicts the pods on the NotReady node. I do want to clarify that pods never get rescheduled. If you have a scalable resource (replication controller, deployment config), that will attempt to create new pods to replace failed ones, but a pod by itself is never moved or rescheduled. Just wanted to make sure that's clear :-) *** Bug 1365657 has been marked as a duplicate of this bug. *** To summarize the full set of discussion topics in this thread: 1. The kubelet will wait 5 minutes before transitioning from a Ready to NotReady state if the kubelet container runtime goes down. I think this time is too long, and its not tunable by operators since its hard-coded. See upstream issue to try and come to a consensus: https://github.com/kubernetes/kubernetes/issues/30534 2. The node controller does not evict a Pod if its in Terminating state, and its the ONLY pod scheduled to that node that required eviction. This is because the node controller identifies that the nodes on the pod should be evicted, but because its the only pod on the node, and it has a TerminationGracePeriodSeconds, the current logic skips a delete on it, and it never goes into the terminating evictor queue. See upstream issue to try and determine how to refactor: https://github.com/kubernetes/kubernetes/issues/30536 The operator can forcefully delete the pod in question by doing: $ oc delete pods <pod> --grace-period=0 Given this is an edge-case, and its fix requires a larger refactor, I am marking this UPCOMING_RELEASE and hope to get fixes into kubernetes 1.4 to be picked up by OpenShift upon that rebase. *** Bug 1343157 has been marked as a duplicate of this bug. *** Upstream PR for node controller not removing terminating pods from a node if it was the only pod on the node: https://github.com/kubernetes/kubernetes/pull/30624 This should be fixed as the requisite origin pr has merged above. Tested with following scenario - Created 2 nodes cluster - Created projects which has pods on both the nodes - Stopped docker on one of the nodes - Deleted projects immediately - Node becomes NotReady and Pods stay in Terminating state (It was stuck in this state) - After few minutes Pods are gone, Node is Still in NotReady state - Stared docker back on that node, Node is Ready and everything is good. Verified in following version openshift v3.4.0.16+cc70b72 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066 |