| Summary: | forceful node evacuation leads to stuck pods | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Harald Klein <hklein> |
| Component: | Node | Assignee: | Andy Goldstein <agoldste> |
| Status: | CLOSED NOTABUG | QA Contact: | DeShuai Ma <dma> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.1.0 | CC: | hklein, jokerman, mmccomas, sjenning |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-05-27 13:47:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Harald Klein
2016-04-08 08:47:26 UTC
I attempted to recreate the reported issue, but had no luck. # openshift version openshift v3.1.1.6-33-g81eabcc kubernetes v1.1.0-origin-1107-g4c8e6f4 etcd 2.1.2 I set up and Openshift cluster (single master, two nodes) in three Openstack instances. I created a project and an application and scaled it up such that the application had two pods, one running on each node. I stopped one node with an immediate ungraceful shutdown. It took about 30s for the node to switch to NotReady and about 5 minutes for the old pod to be considered dead and the new pod to schedule onto the remaining node. However, I did not observe the pod on the terminated node getting stuck in Terminating state. When I brought the node back up, I scaled up to 3, then down to 2 and the pods rebalanced across the two nodes. Other than the 5 minute delay, which might arguably be too long, this worked as I expected. I also tried gracefully evacuating the node, which also worked as expected. # oadm manage-node node1 --schedulable=false # oadm manage-node node1 --evacuate (a new pod was immediately rescheduled to node2 with no pods stuck in Terminating) # oadm manage-node node1 --schedulable=true scale up to 3, then down to 2 and the pods rebalanced across the two nodes. In both situations, I was not able to reproduce the pod hung in Terminating state. Any additional information on how I might recreate this issue? Harald, can we close this? Closing as customer was unable to reproduce. Please reopen in the future if necessary. |