1937122 – CAPBM changes to support flexible reboot modes

Bug 1937122 - CAPBM changes to support flexible reboot modes

Summary: CAPBM changes to support flexible reboot modes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Rhys Oxenham
QA Contact:	Shelly Miron
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1936844
TreeView+	depends on / blocked

Reported:	2021-03-09 21:50 UTC by Rhys Oxenham
Modified:	2021-07-27 22:52 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Adds capabilities to the CAPBM to request a hard power off upon remediation, leveraging the recent changes to the baremetal-operator to support new reboot modes. Reason: The baremetal-operator recently got extended to support flexible reboot modes, either hard or soft. The default mode is a soft reboot, but for remediation purposes we want this to be hard to recover workloads as quickly as possible. Result: CAPBM requests hard reboot when remediation is required, bypassing the default soft power-off that the baremetal-operator issues.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:52:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 138	0	None	closed	Changing the default behaviour of the CAPBM to request hard reboot	2021-03-09 21:50:58 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:52:54 UTC

Internal Links: 1927678

Description Rhys Oxenham 2021-03-09 21:50:59 UTC

This bug was initially created as a copy of Bug #1936844

I am copying this bug because we need a 4.8 BZ to track for the CAPBM changes that were recently introduced here: https://github.com/openshift/cluster-api-provider-baremetal/pull/138. This code has merged into 4.8 upstream and will need to go into downstream so we can pull it back into 4.7.

This work is split into two components-

1) The BMO changes to support the reboot mode

2) The CAPBM changes to apply the hard reboot mode for remediation purposes

Additional info:

PR that pulled the BMO changes into 4.8
https://github.com/openshift/baremetal-operator/pull/128
https://bugzilla.redhat.com/show_bug.cgi?id=1927678

PR for backporting above BMO changes into 4.7
https://github.com/openshift/baremetal-operator/pull/130
https://bugzilla.redhat.com/show_bug.cgi?id=1936407

PR that pulled the CAPBM changes into 4.8
https://github.com/openshift/cluster-api-provider-baremetal/pull/138

PR for backporting above CAPBM changes into 4.7
https://github.com/openshift/cluster-api-provider-baremetal/pull/144

Comment 1 Rhys Oxenham 2021-03-10 13:01:07 UTC

Additional information for verification purposes.

1. Make sure test clusters have the right code pulled-

For 4.8 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/138 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/128 (BMO)
For 4.7 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/144 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/132 (BMO)

2. Manually set a worker baremetal machine to power down via the UI, you'll recognise that this defaults to a *soft* power off (OS is powered down gracefully), and will hold it down until the node is powered back on by the UI client-

$ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i soft

{"level":"info","ts":1615376435.4487739,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"soft","reboot process":true}
{"level":"info","ts":1615376435.448852,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: soft)","host":"worker2"}
{"level":"info","ts":1615376435.4488945,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"soft power off\" command","host":"worker2"}

3. After powering the node back on via the UI. Apply a manual annotation to the worker BMH object to verify that hard power-off works (OS is *not* powered down gracefully and is immediately shut down):

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    reboot.metal3.io/testing: '{"mode":"hard"}'
  selfLink: >-
(...)

$ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i hard

{"level":"info","ts":1615376529.5041094,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"hard","reboot process":true}
{"level":"info","ts":1615376529.5041282,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"worker2"}
{"level":"info","ts":1615376529.5041316,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"hard power off\" command","host":"worker2"}

4. Remove the annotation and save the yaml definition and witness the machine be immediately powered back on. Wait for it to come back online.

5. Apply a MachineHealthCheck to verify that the CAPBM will apply the hard reboot flag automatically (note the following is quite aggressive, but for testing)-

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 nodeStartupTimeout: 5m
 maxUnhealthy: 100%
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 20s
 - type: Ready
   status: 'False'
   timeout: 20s

6. Manually stop the kubelet service on the target machine, or even better hold it in a position where the machine shows as powered-on, but is no longer reporting into k8s (for this I usually reboot the virtual BMH and hold it at the grub prompt). After 40s (k8s timeout) and 20s MachineHealthCheck timeout you should see the remediation starting:

$ oc logs -n openshift-machine-api machine-api-controllers-78474bd74b-hvfgz -c machine-healthcheck-controller -f

I0310 11:49:04.669780       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets
I0310 11:49:04.670264       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking
I0310 11:49:04.670430       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking
I0310 11:49:04.670694       1 machinehealthcheck_controller.go:346] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: is likely to go unhealthy in 1.329319222s
I0310 11:49:04.671999       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2,  max unhealthy: 100%, unhealthy targets: 1
I0310 11:49:04.708765       1 machinehealthcheck_controller.go:259] Reconciling openshift-machine-api/workers: some targets might go unhealthy. Ensuring a requeue happens in 1.329319222s
I0310 11:49:06.014136       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/workers
I0310 11:49:06.014673       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets
I0310 11:49:06.016758       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking
I0310 11:49:06.017017       1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking
I0310 11:49:06.017231       1 machinehealthcheck_controller.go:660] openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: unhealthy: condition Ready in state False longer than {10s}
I0310 11:49:06.018255       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2,  max unhealthy: 100%, unhealthy targets: 1
I0310 11:49:06.057770       1 machinehealthcheck_controller.go:244] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: meet unhealthy criteria, triggers remediation
I0310 11:49:06.058069       1 machinehealthcheck_controller.go:492]  openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: start remediation logic
I0310 11:49:06.058261       1 machinehealthcheck_controller.go:563] Machine cnv-pdkrb-worker-0-gtbkc has been unhealthy for too long, adding external annotation

Then the capbm remediation annotation should be automatically added to the BMH object (the user doesn't add this, it's automated by CAPMB):

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    reboot.metal3.io/capbm-requested-power-off: '{"mode":"hard"}'
  selfLink: >-

The BMH should then immediately power down, the node will be deleted (to allow workload recovery), and it will come back up and re-register to the cluster, fully automated.

Comment 6 errata-xmlrpc 2021-07-27 22:52:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.