Bug 1936407 - 4.7 Backport - Reboot interface defaults to softPowerOff so fencing is too slow
Summary: 4.7 Backport - Reboot interface defaults to softPowerOff so fencing is too slow
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.7
Hardware: x86_64
OS: Linux
Target Milestone: ---
: 4.7.z
Assignee: Rhys Oxenham
QA Contact: Shelly Miron
Depends On: 1927678
TreeView+ depends on / blocked
Reported: 2021-03-08 12:24 UTC by Rhys Oxenham
Modified: 2021-04-12 23:23 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Added new capabilities to the baremetal-operator to allow for different reboot modes to be utilised. Reason: For high availability purposes, workloads need to be relocated as quickly as possible in the event of a node failure. This provides a path for clients to quickly power down systems for remediation purposes and to recover workloads. Result: Environment dependant, workload recovery time can be significantly reduced.
Clone Of:
Last Closed: 2021-04-12 23:22:56 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift baremetal-operator pull 132 0 None open Bug 1936407: Backport of BMO code to 4.7 to support different reboot modes 2021-03-10 15:07:23 UTC
Red Hat Product Errata RHBA-2021:1075 0 None None None 2021-04-12 23:23:10 UTC

Internal Links: 1936844

Description Rhys Oxenham 2021-03-08 12:24:06 UTC
This bug was initially created as a copy of Bug #1927678

I am copying this bug because: 

This is a backport BZ for the BMO code that just landed in 4.8 (https://github.com/openshift/baremetal-operator/pull/128) so we can pull it into 4.7.z.

Note this code relies on CAPBM code also being pulled in and backported, this is currently unmerged (https://github.com/openshift/cluster-api-provider-baremetal/pull/138).

Description of problem:

The current implementation of the reboot interface relies on the `reboot.metal3.io` annotation being applied via the MachineHealthCheck system, and whilst this works, the implementation for the Ironic provisioner first attempts to use a soft power off function, as implemented with this PR (https://github.com/metal3-io/baremetal-operator/pull/294), with the specific line of code here: https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/ironic.go#L1678

In some circumstances, e.g. soft power off is issued but it doesn't get enacted, we can run into a situation where a three minute timeout must occur before the hard power off will be attempted. Unfortunately for use-cases where we want to rely on this interface to rapidly reboot hosts and recover workloads for high-availability purposes, e.g. virtual machine failover, this can lead to a circa 5 minute total recovery time, too long to be widely adopted for high-availability requirements. For high availability use-cases we should default to hardPoweroff().

We recognise that softPowerOff is favourable when we want to instruct the hosts to reboot but we want them to do this gracefully, e.g. software update, so the reboot interface needs to maintain the option for a soft reboot. Ideally we need to be able to select whether we first invoke the hard reboot or not, and make that configurable to the operator. Perhaps on a per-host basis with an annotation, or with a global config-map, or perhaps even via a MachineHealthCheck option.

Some further discussion on this can be found here: https://github.com/metal3-io/metal3-docs/pull/164, and the suggestion is that we implement an extension to the reboot annotation, e.g. `reboot.metal3.io = {'mode': 'hard'}` where we override the default behaviour for circumstances in which customers need to enable much quicker recovery of workloads, if this is left off, we default to soft; I'm just not sure which mechanism could be used to get the MHC to label it this way.

Version-Release number of selected component (if applicable):

Tested with OpenShift 4.6.16 on virtualised baremetal (IPI install with Metal3 and virtualbmc).

How reproducible:
Every time

Steps to Reproduce:
1. Deploy OpenShift baremetal IPI cluster via Metal3
2. Apply a MachineHealthCheck (see example below)
3. Cause the machine to go NotReady, e.g. sysrq or something
4. Observe the MHC logs to see that it recognises node unhealthy and trigger remediation after ~100 seconds (40s k8s timeout, 60s health check).
5. Remediation then takes a further ~200s, as 180s is lost on softPowerOff() which will fail under certain circumstances.

Actual results:

Recovery of workload takes circa 5 minutes, too long for a true HA scenario.

Expected results:

Recovery of workload takes circa 2 minutes maximum (can be further tweaked with tighter MHC timings, but we have to give k8s 40s by default - 4x10s notification timeouts)

Additional info:

MHC example-

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
 name: workers
 namespace: openshift-machine-api
   machine.openshift.io/remediation-strategy: external-baremetal
 maxUnhealthy: 100%
     machine.openshift.io/cluster-api-machine-role: worker
 - type: Ready
   status: Unknown
   timeout: 60s
 - type: Ready
   status: 'False'
   timeout: 60s

Other resources that help explain where the softPowerOff() default came from:

BMO bug report tracking the fact that we should do soft-shutdown

BMO bugfix to make soft-shutdown the default

BMO proposal for reboot API

BMO PR to implement reboot API

Comment 16 errata-xmlrpc 2021-04-12 23:22:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.