Bug 1325102

Summary: forceful node evacuation leads to stuck pods
Product: OpenShift Container Platform Reporter: Harald Klein <hklein>
Component: NodeAssignee: Andy Goldstein <agoldste>
Status: CLOSED NOTABUG QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: hklein, jokerman, mmccomas, sjenning
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-27 13:47:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Harald Klein 2016-04-08 08:47:26 UTC
Description of problem:

When doing a forceful node evacuation the pods get stuck in "Terminating" state. The RC will be creating new pods but they are not scheduled to any node (even with enough resources available at other nodes). They remain stuck at Pending scheduled to no node until the initially evicted node re-joins the cluster.

Version-Release number of selected component (if applicable):

OSE 3.1

How reproducible:

Forcefully evacuate a node

Steps to Reproduce:
1. oadm manage-node <nodeName> --evacuate --force

Actual results:

Pods are not started on other nodes

Expected results:

Pods should be started on other nodes

Comment 3 Seth Jennings 2016-04-19 02:59:01 UTC
I attempted to recreate the reported issue, but had no luck.

# openshift version
openshift v3.1.1.6-33-g81eabcc
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

I set up and Openshift cluster (single master, two nodes) in three Openstack instances.  I created a project and an application and scaled it up such that the application had two pods, one running on each node.

I stopped one node with an immediate ungraceful shutdown.  It took about 30s for the node to switch to NotReady and about 5 minutes for the old pod to be considered dead and the new pod to schedule onto the remaining node.  However, I did not observe the pod on the terminated node getting stuck in Terminating state.  When I brought the node back up, I scaled up to 3, then down to 2 and the pods rebalanced across the two nodes.  Other than the 5 minute delay, which might arguably be too long, this worked as I expected.

I also tried gracefully evacuating the node, which also worked as expected.

# oadm manage-node node1 --schedulable=false
# oadm manage-node node1 --evacuate
(a new pod was immediately rescheduled to node2 with no pods stuck in Terminating)
# oadm manage-node node1 --schedulable=true
scale up to 3, then down to 2 and the pods rebalanced across the two nodes.

In both situations, I was not able to reproduce the pod hung in Terminating state.

Any additional information on how I might recreate this issue?

Comment 6 Andy Goldstein 2016-05-18 16:56:34 UTC
Harald, can we close this?

Comment 7 Andy Goldstein 2016-05-27 13:47:15 UTC
Closing as customer was unable to reproduce. Please reopen in the future if necessary.