Bug 1927678 - Reboot interface defaults to softPowerOff so fencing is too slow
Summary: Reboot interface defaults to softPowerOff so fencing is too slow
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.8.0
Assignee: Rhys Oxenham
QA Contact: Shelly Miron
URL:
Whiteboard:
Depends On:
Blocks: 1936407
TreeView+ depends on / blocked
 
Reported: 2021-02-11 10:25 UTC by Rhys Oxenham
Modified: 2021-07-27 22:44 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Added new capabilities to the baremetal-operator to allow for different reboot modes to be utilised. Reason: For high availability purposes, workloads need to be relocated as quickly as possible in the event of a node failure. This provides a path for clients to quickly power down systems for remediation purposes and to recover workloads. Result: Environment dependant, workload recovery time can be significantly reduced.
Clone Of:
Environment:
Last Closed: 2021-07-27 22:43:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github metal3-io baremetal-operator pull 795 0 None closed Implement explicit reboot mode options 2021-03-08 17:26:18 UTC
Github metal3-io metal3-docs pull 164 0 None closed Add explicit reboot mode options 2021-03-08 17:26:20 UTC
Github openshift baremetal-operator pull 128 0 None closed Bug 1927678: Backporting BMO extensions to support different reboot modes 2021-03-08 17:26:21 UTC
Github openshift cluster-api-provider-baremetal pull 138 0 None closed Changing the default behaviour of the CAPBM to request hard reboot 2021-03-10 11:07:18 UTC
Red Hat Bugzilla 1937122 1 high CLOSED CAPBM changes to support flexible reboot modes 2021-07-27 22:52:54 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:44:07 UTC

Internal Links: 1936844

Description Rhys Oxenham 2021-02-11 10:25:31 UTC
Description of problem:

The current implementation of the reboot interface relies on the `reboot.metal3.io` annotation being applied via the MachineHealthCheck system, and whilst this works, the implementation for the Ironic provisioner first attempts to use a soft power off function, as implemented with this PR (https://github.com/metal3-io/baremetal-operator/pull/294), with the specific line of code here: https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/ironic.go#L1678

In some circumstances, e.g. soft power off is issued but it doesn't get enacted, we can run into a situation where a three minute timeout must occur before the hard power off will be attempted. Unfortunately for use-cases where we want to rely on this interface to rapidly reboot hosts and recover workloads for high-availability purposes, e.g. virtual machine failover, this can lead to a circa 5 minute total recovery time, too long to be widely adopted for high-availability requirements. For high availability use-cases we should default to hardPoweroff().

We recognise that softPowerOff is favourable when we want to instruct the hosts to reboot but we want them to do this gracefully, e.g. software update, so the reboot interface needs to maintain the option for a soft reboot. Ideally we need to be able to select whether we first invoke the hard reboot or not, and make that configurable to the operator. Perhaps on a per-host basis with an annotation, or with a global config-map, or perhaps even via a MachineHealthCheck option.

Some further discussion on this can be found here: https://github.com/metal3-io/metal3-docs/pull/164, and the suggestion is that we implement an extension to the reboot annotation, e.g. `reboot.metal3.io = {'mode': 'hard'}` where we override the default behaviour for circumstances in which customers need to enable much quicker recovery of workloads, if this is left off, we default to soft; I'm just not sure which mechanism could be used to get the MHC to label it this way.

Version-Release number of selected component (if applicable):

Tested with OpenShift 4.6.16 on virtualised baremetal (IPI install with Metal3 and virtualbmc).

How reproducible:
Every time

Steps to Reproduce:
1. Deploy OpenShift baremetal IPI cluster via Metal3
2. Apply a MachineHealthCheck (see example below)
3. Cause the machine to go NotReady, e.g. sysrq or something
4. Observe the MHC logs to see that it recognises node unhealthy and trigger remediation after ~100 seconds (40s k8s timeout, 60s health check).
5. Remediation then takes a further ~200s, as 180s is lost on softPowerOff() which will fail under certain circumstances.

Actual results:

Recovery of workload takes circa 5 minutes, too long for a true HA scenario.

Expected results:

Recovery of workload takes circa 2 minutes maximum (can be further tweaked with tighter MHC timings, but we have to give k8s 40s by default - 4x10s notification timeouts)

Additional info:

MHC example-

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 maxUnhealthy: 100%
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
 - type: Ready
   status: 'False'
   timeout: 60s


Other resources that help explain where the softPowerOff() default came from:

BMO bug report tracking the fact that we should do soft-shutdown
https://github.com/metal3-io/baremetal-operator/issues/273

BMO bugfix to make soft-shutdown the default
https://github.com/metal3-io/baremetal-operator/pull/294

BMO proposal for reboot API
https://github.com/metal3-io/metal3-docs/pull/48

BMO PR to implement reboot API
https://github.com/metal3-io/baremetal-operator/pull/424

Comment 1 Rhys Oxenham 2021-03-10 12:12:21 UTC
Additional information for verification purposes.

1. Make sure test clusters have the right code pulled-

For 4.8 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/138 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/128 (BMO)
For 4.7 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/144 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/132 (BMO)

2. Manually set a worker baremetal machine to power down via the UI, you'll recognise that this defaults to a *soft* power off (OS is powered down gracefully), and will hold it down until the node is powered back on by the UI client-

$ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i soft

{"level":"info","ts":1615376435.4487739,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"soft","reboot process":true}
{"level":"info","ts":1615376435.448852,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: soft)","host":"worker2"}
{"level":"info","ts":1615376435.4488945,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"soft power off\" command","host":"worker2"}

3. After powering the node back on via the UI. Apply a manual annotation to the worker BMH object to verify that hard power-off works (OS is *not* powered down gracefully and is immediately shut down):

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    reboot.metal3.io/testing: '{"mode":"hard"}'
  selfLink: >-
(...)

$ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i hard

{"level":"info","ts":1615376529.5041094,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"hard","reboot process":true}
{"level":"info","ts":1615376529.5041282,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"worker2"}
{"level":"info","ts":1615376529.5041316,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"hard power off\" command","host":"worker2"}

4. Remove the annotation and save the yaml definition and witness the machine be immediately powered back on. Wait for it to come back online.

5. Apply a MachineHealthCheck to verify that the CAPBM will apply the hard reboot flag automatically (note the following is quite aggressive, but for testing)-

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 nodeStartupTimeout: 5m
 maxUnhealthy: 100%
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 20s
 - type: Ready
   status: 'False'
   timeout: 20s

6. Manually stop the kubelet service on the target machine, or even better hold it in a position where the machine shows as powered-on, but is no longer reporting into k8s (for this I usually reboot the virtual BMH and hold it at the grub prompt). After 40s (k8s timeout) and 20s MachineHealthCheck timeout you should see the remediation starting:

$ oc logs -n openshift-machine-api machine-api-controllers-78474bd74b-hvfgz -c machine-healthcheck-controller -f

I0310 11:49:04.669780       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets
I0310 11:49:04.670264       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking
I0310 11:49:04.670430       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking
I0310 11:49:04.670694       1 machinehealthcheck_controller.go:346] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: is likely to go unhealthy in 1.329319222s
I0310 11:49:04.671999       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2,  max unhealthy: 100%, unhealthy targets: 1
I0310 11:49:04.708765       1 machinehealthcheck_controller.go:259] Reconciling openshift-machine-api/workers: some targets might go unhealthy. Ensuring a requeue happens in 1.329319222s
I0310 11:49:06.014136       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/workers
I0310 11:49:06.014673       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets
I0310 11:49:06.016758       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking
I0310 11:49:06.017017       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking
I0310 11:49:06.017231       1 machinehealthcheck_controller.go:660] openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: unhealthy: condition Ready in state False longer than {10s}
I0310 11:49:06.018255       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2,  max unhealthy: 100%, unhealthy targets: 1
I0310 11:49:06.057770       1 machinehealthcheck_controller.go:244] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: meet unhealthy criteria, triggers remediation
I0310 11:49:06.058069       1 machinehealthcheck_controller.go:492]  openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: start remediation logic
I0310 11:49:06.058261       1 machinehealthcheck_controller.go:563] Machine cnv-pdkrb-worker-0-gtbkc has been unhealthy for too long, adding external annotation

Then the capbm remediation annotation should be automatically added to the BMH object (the user doesn't add this, it's automated by CAPMB):

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    reboot.metal3.io/capbm-requested-power-off: '{"mode":"hard"}'
  selfLink: >-

The BMH should then immediately power down, the node will be deleted (to allow workload recovery), and it will come back up and re-register to the cluster, fully automated.

Comment 11 errata-xmlrpc 2021-07-27 22:43:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.