Bug 1927678

Summary: Reboot interface defaults to softPowerOff so fencing is too slow
Product: OpenShift Container Platform Reporter: Rhys Oxenham <roxenham>
Component: Bare Metal Hardware ProvisioningAssignee: Rhys Oxenham <roxenham>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Shelly Miron <smiron>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: rbartal, shardy
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: Added new capabilities to the baremetal-operator to allow for different reboot modes to be utilised. Reason: For high availability purposes, workloads need to be relocated as quickly as possible in the event of a node failure. This provides a path for clients to quickly power down systems for remediation purposes and to recover workloads. Result: Environment dependant, workload recovery time can be significantly reduced.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:43:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1936407    

Description Rhys Oxenham 2021-02-11 10:25:31 UTC
Description of problem:

The current implementation of the reboot interface relies on the `reboot.metal3.io` annotation being applied via the MachineHealthCheck system, and whilst this works, the implementation for the Ironic provisioner first attempts to use a soft power off function, as implemented with this PR (https://github.com/metal3-io/baremetal-operator/pull/294), with the specific line of code here: https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/ironic.go#L1678

In some circumstances, e.g. soft power off is issued but it doesn't get enacted, we can run into a situation where a three minute timeout must occur before the hard power off will be attempted. Unfortunately for use-cases where we want to rely on this interface to rapidly reboot hosts and recover workloads for high-availability purposes, e.g. virtual machine failover, this can lead to a circa 5 minute total recovery time, too long to be widely adopted for high-availability requirements. For high availability use-cases we should default to hardPoweroff().

We recognise that softPowerOff is favourable when we want to instruct the hosts to reboot but we want them to do this gracefully, e.g. software update, so the reboot interface needs to maintain the option for a soft reboot. Ideally we need to be able to select whether we first invoke the hard reboot or not, and make that configurable to the operator. Perhaps on a per-host basis with an annotation, or with a global config-map, or perhaps even via a MachineHealthCheck option.

Some further discussion on this can be found here: https://github.com/metal3-io/metal3-docs/pull/164, and the suggestion is that we implement an extension to the reboot annotation, e.g. `reboot.metal3.io = {'mode': 'hard'}` where we override the default behaviour for circumstances in which customers need to enable much quicker recovery of workloads, if this is left off, we default to soft; I'm just not sure which mechanism could be used to get the MHC to label it this way.

Version-Release number of selected component (if applicable):

Tested with OpenShift 4.6.16 on virtualised baremetal (IPI install with Metal3 and virtualbmc).

How reproducible:
Every time

Steps to Reproduce:
1. Deploy OpenShift baremetal IPI cluster via Metal3
2. Apply a MachineHealthCheck (see example below)
3. Cause the machine to go NotReady, e.g. sysrq or something
4. Observe the MHC logs to see that it recognises node unhealthy and trigger remediation after ~100 seconds (40s k8s timeout, 60s health check).
5. Remediation then takes a further ~200s, as 180s is lost on softPowerOff() which will fail under certain circumstances.

Actual results:

Recovery of workload takes circa 5 minutes, too long for a true HA scenario.

Expected results:

Recovery of workload takes circa 2 minutes maximum (can be further tweaked with tighter MHC timings, but we have to give k8s 40s by default - 4x10s notification timeouts)

Additional info:

MHC example-

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 maxUnhealthy: 100%
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
 - type: Ready
   status: 'False'
   timeout: 60s


Other resources that help explain where the softPowerOff() default came from:

BMO bug report tracking the fact that we should do soft-shutdown
https://github.com/metal3-io/baremetal-operator/issues/273

BMO bugfix to make soft-shutdown the default
https://github.com/metal3-io/baremetal-operator/pull/294

BMO proposal for reboot API
https://github.com/metal3-io/metal3-docs/pull/48

BMO PR to implement reboot API
https://github.com/metal3-io/baremetal-operator/pull/424

Comment 1 Rhys Oxenham 2021-03-10 12:12:21 UTC
Additional information for verification purposes.

1. Make sure test clusters have the right code pulled-

For 4.8 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/138 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/128 (BMO)
For 4.7 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/144 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/132 (BMO)

2. Manually set a worker baremetal machine to power down via the UI, you'll recognise that this defaults to a *soft* power off (OS is powered down gracefully), and will hold it down until the node is powered back on by the UI client-

$ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i soft

{"level":"info","ts":1615376435.4487739,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"soft","reboot process":true}
{"level":"info","ts":1615376435.448852,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: soft)","host":"worker2"}
{"level":"info","ts":1615376435.4488945,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"soft power off\" command","host":"worker2"}

3. After powering the node back on via the UI. Apply a manual annotation to the worker BMH object to verify that hard power-off works (OS is *not* powered down gracefully and is immediately shut down):

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    reboot.metal3.io/testing: '{"mode":"hard"}'
  selfLink: >-
(...)

$ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i hard

{"level":"info","ts":1615376529.5041094,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"hard","reboot process":true}
{"level":"info","ts":1615376529.5041282,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"worker2"}
{"level":"info","ts":1615376529.5041316,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"hard power off\" command","host":"worker2"}

4. Remove the annotation and save the yaml definition and witness the machine be immediately powered back on. Wait for it to come back online.

5. Apply a MachineHealthCheck to verify that the CAPBM will apply the hard reboot flag automatically (note the following is quite aggressive, but for testing)-

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 nodeStartupTimeout: 5m
 maxUnhealthy: 100%
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 20s
 - type: Ready
   status: 'False'
   timeout: 20s

6. Manually stop the kubelet service on the target machine, or even better hold it in a position where the machine shows as powered-on, but is no longer reporting into k8s (for this I usually reboot the virtual BMH and hold it at the grub prompt). After 40s (k8s timeout) and 20s MachineHealthCheck timeout you should see the remediation starting:

$ oc logs -n openshift-machine-api machine-api-controllers-78474bd74b-hvfgz -c machine-healthcheck-controller -f

I0310 11:49:04.669780       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets
I0310 11:49:04.670264       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking
I0310 11:49:04.670430       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking
I0310 11:49:04.670694       1 machinehealthcheck_controller.go:346] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: is likely to go unhealthy in 1.329319222s
I0310 11:49:04.671999       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2,  max unhealthy: 100%, unhealthy targets: 1
I0310 11:49:04.708765       1 machinehealthcheck_controller.go:259] Reconciling openshift-machine-api/workers: some targets might go unhealthy. Ensuring a requeue happens in 1.329319222s
I0310 11:49:06.014136       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/workers
I0310 11:49:06.014673       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets
I0310 11:49:06.016758       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking
I0310 11:49:06.017017       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking
I0310 11:49:06.017231       1 machinehealthcheck_controller.go:660] openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: unhealthy: condition Ready in state False longer than {10s}
I0310 11:49:06.018255       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2,  max unhealthy: 100%, unhealthy targets: 1
I0310 11:49:06.057770       1 machinehealthcheck_controller.go:244] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: meet unhealthy criteria, triggers remediation
I0310 11:49:06.058069       1 machinehealthcheck_controller.go:492]  openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: start remediation logic
I0310 11:49:06.058261       1 machinehealthcheck_controller.go:563] Machine cnv-pdkrb-worker-0-gtbkc has been unhealthy for too long, adding external annotation

Then the capbm remediation annotation should be automatically added to the BMH object (the user doesn't add this, it's automated by CAPMB):

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    reboot.metal3.io/capbm-requested-power-off: '{"mode":"hard"}'
  selfLink: >-

The BMH should then immediately power down, the node will be deleted (to allow workload recovery), and it will come back up and re-register to the cluster, fully automated.

Comment 11 errata-xmlrpc 2021-07-27 22:43:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438