This bug was initially created as a copy of Bug #1927678 I am copying this bug because: This is a backport BZ for the BMO code that just landed in 4.8 (https://github.com/openshift/baremetal-operator/pull/128) so we can pull it into 4.7.z. Note this code relies on CAPBM code also being pulled in and backported, this is currently unmerged (https://github.com/openshift/cluster-api-provider-baremetal/pull/138). Description of problem: The current implementation of the reboot interface relies on the `reboot.metal3.io` annotation being applied via the MachineHealthCheck system, and whilst this works, the implementation for the Ironic provisioner first attempts to use a soft power off function, as implemented with this PR (https://github.com/metal3-io/baremetal-operator/pull/294), with the specific line of code here: https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/ironic.go#L1678 In some circumstances, e.g. soft power off is issued but it doesn't get enacted, we can run into a situation where a three minute timeout must occur before the hard power off will be attempted. Unfortunately for use-cases where we want to rely on this interface to rapidly reboot hosts and recover workloads for high-availability purposes, e.g. virtual machine failover, this can lead to a circa 5 minute total recovery time, too long to be widely adopted for high-availability requirements. For high availability use-cases we should default to hardPoweroff(). We recognise that softPowerOff is favourable when we want to instruct the hosts to reboot but we want them to do this gracefully, e.g. software update, so the reboot interface needs to maintain the option for a soft reboot. Ideally we need to be able to select whether we first invoke the hard reboot or not, and make that configurable to the operator. Perhaps on a per-host basis with an annotation, or with a global config-map, or perhaps even via a MachineHealthCheck option. Some further discussion on this can be found here: https://github.com/metal3-io/metal3-docs/pull/164, and the suggestion is that we implement an extension to the reboot annotation, e.g. `reboot.metal3.io = {'mode': 'hard'}` where we override the default behaviour for circumstances in which customers need to enable much quicker recovery of workloads, if this is left off, we default to soft; I'm just not sure which mechanism could be used to get the MHC to label it this way. Version-Release number of selected component (if applicable): Tested with OpenShift 4.6.16 on virtualised baremetal (IPI install with Metal3 and virtualbmc). How reproducible: Every time Steps to Reproduce: 1. Deploy OpenShift baremetal IPI cluster via Metal3 2. Apply a MachineHealthCheck (see example below) 3. Cause the machine to go NotReady, e.g. sysrq or something 4. Observe the MHC logs to see that it recognises node unhealthy and trigger remediation after ~100 seconds (40s k8s timeout, 60s health check). 5. Remediation then takes a further ~200s, as 180s is lost on softPowerOff() which will fail under certain circumstances. Actual results: Recovery of workload takes circa 5 minutes, too long for a true HA scenario. Expected results: Recovery of workload takes circa 2 minutes maximum (can be further tweaked with tighter MHC timings, but we have to give k8s 40s by default - 4x10s notification timeouts) Additional info: MHC example- apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: workers namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal spec: maxUnhealthy: 100% selector: matchLabels: machine.openshift.io/cluster-api-machine-role: worker unhealthyConditions: - type: Ready status: Unknown timeout: 60s - type: Ready status: 'False' timeout: 60s Other resources that help explain where the softPowerOff() default came from: BMO bug report tracking the fact that we should do soft-shutdown https://github.com/metal3-io/baremetal-operator/issues/273 BMO bugfix to make soft-shutdown the default https://github.com/metal3-io/baremetal-operator/pull/294 BMO proposal for reboot API https://github.com/metal3-io/metal3-docs/pull/48 BMO PR to implement reboot API https://github.com/metal3-io/baremetal-operator/pull/424
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.6 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1075