Bug 1902690

Summary: Worker node is not evicted at soft shutdown
Product: OpenShift Container Platform Reporter: Daniel <dmaizel>
Component: Bare Metal Hardware ProvisioningAssignee: Steven Hardy <shardy>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Amit Ugol <augol>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: dhellmann, zbitter
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-01 17:41:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel 2020-11-30 12:27:49 UTC
Description of problem:
According to the https://issues.redhat.com/browse/KNIDEPLOY-3821 epic
Soft shutdown should allow us to identify when the node is about to be shut down and give us an option to evict the node immediately so the workload can move to a free worker faster.

Version-Release number of selected component (if applicable):
OCP4.7

How reproducible:
Constantly

Steps to Reproduce:
1.Create an httpd pod
2.Look for the worker, the pod is running on.
3.Set the online flag of the corresponding bmh on off

Actual results:
The node on which the pod is running should be evicted.

Expected results:
The node is not evicted and it takes the pod about 10-15 minutes to become terminated and a new pod is created on another worker.

Additional info:
* I also looked in the baremetal-operator logs, to verify that a softshut down signal was actually sent...

Comment 1 Zane Bitter 2020-12-01 17:41:01 UTC
It appears to me that graceful shutdown of kubelet was only implemented in upstream k8s in the last month, and it is behind a feature gate: https://github.com/kubernetes/kubernetes/pull/96129

If users want to shut down a host, they should cordon and drain it first. (Deleting the Machine object will do this automatically, though it will also deprovision.) There are valid reasons to want to soft-power-off (e.g. to avoid data corruption/lost disk writes), but relying on it to make k8s drain the Node is not one of them (at least not yet).

I'm going to close this bug, but feel free to open another one if it turns out that we are not actually successfully soft powering down the host.