Bug 1937122
| Summary: | CAPBM changes to support flexible reboot modes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Rhys Oxenham <roxenham> |
| Component: | Bare Metal Hardware Provisioning | Assignee: | Rhys Oxenham <roxenham> |
| Bare Metal Hardware Provisioning sub component: | baremetal-operator | QA Contact: | Shelly Miron <smiron> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | beth.white, smiron |
| Version: | 4.8 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
Feature: Adds capabilities to the CAPBM to request a hard power off upon remediation, leveraging the recent changes to the baremetal-operator to support new reboot modes.
Reason: The baremetal-operator recently got extended to support flexible reboot modes, either hard or soft. The default mode is a soft reboot, but for remediation purposes we want this to be hard to recover workloads as quickly as possible.
Result: CAPBM requests hard reboot when remediation is required, bypassing the default soft power-off that the baremetal-operator issues.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:52:25 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1936844 | ||
|
Description
Rhys Oxenham
2021-03-09 21:50:59 UTC
Additional information for verification purposes. 1. Make sure test clusters have the right code pulled- For 4.8 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/138 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/128 (BMO) For 4.7 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/144 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/132 (BMO) 2. Manually set a worker baremetal machine to power down via the UI, you'll recognise that this defaults to a *soft* power off (OS is powered down gracefully), and will hold it down until the node is powered back on by the UI client- $ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i soft {"level":"info","ts":1615376435.4487739,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"soft","reboot process":true} {"level":"info","ts":1615376435.448852,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: soft)","host":"worker2"} {"level":"info","ts":1615376435.4488945,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"soft power off\" command","host":"worker2"} 3. After powering the node back on via the UI. Apply a manual annotation to the worker BMH object to verify that hard power-off works (OS is *not* powered down gracefully and is immediately shut down): apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: annotations: reboot.metal3.io/testing: '{"mode":"hard"}' selfLink: >- (...) $ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i hard {"level":"info","ts":1615376529.5041094,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"hard","reboot process":true} {"level":"info","ts":1615376529.5041282,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"worker2"} {"level":"info","ts":1615376529.5041316,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"hard power off\" command","host":"worker2"} 4. Remove the annotation and save the yaml definition and witness the machine be immediately powered back on. Wait for it to come back online. 5. Apply a MachineHealthCheck to verify that the CAPBM will apply the hard reboot flag automatically (note the following is quite aggressive, but for testing)- apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: workers namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal spec: nodeStartupTimeout: 5m maxUnhealthy: 100% selector: matchLabels: machine.openshift.io/cluster-api-machine-role: worker unhealthyConditions: - type: Ready status: Unknown timeout: 20s - type: Ready status: 'False' timeout: 20s 6. Manually stop the kubelet service on the target machine, or even better hold it in a position where the machine shows as powered-on, but is no longer reporting into k8s (for this I usually reboot the virtual BMH and hold it at the grub prompt). After 40s (k8s timeout) and 20s MachineHealthCheck timeout you should see the remediation starting: $ oc logs -n openshift-machine-api machine-api-controllers-78474bd74b-hvfgz -c machine-healthcheck-controller -f I0310 11:49:04.669780 1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets I0310 11:49:04.670264 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking I0310 11:49:04.670430 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking I0310 11:49:04.670694 1 machinehealthcheck_controller.go:346] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: is likely to go unhealthy in 1.329319222s I0310 11:49:04.671999 1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2, max unhealthy: 100%, unhealthy targets: 1 I0310 11:49:04.708765 1 machinehealthcheck_controller.go:259] Reconciling openshift-machine-api/workers: some targets might go unhealthy. Ensuring a requeue happens in 1.329319222s I0310 11:49:06.014136 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/workers I0310 11:49:06.014673 1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets I0310 11:49:06.016758 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking I0310 11:49:06.017017 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking I0310 11:49:06.017231 1 machinehealthcheck_controller.go:660] openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: unhealthy: condition Ready in state False longer than {10s} I0310 11:49:06.018255 1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2, max unhealthy: 100%, unhealthy targets: 1 I0310 11:49:06.057770 1 machinehealthcheck_controller.go:244] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: meet unhealthy criteria, triggers remediation I0310 11:49:06.058069 1 machinehealthcheck_controller.go:492] openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: start remediation logic I0310 11:49:06.058261 1 machinehealthcheck_controller.go:563] Machine cnv-pdkrb-worker-0-gtbkc has been unhealthy for too long, adding external annotation Then the capbm remediation annotation should be automatically added to the BMH object (the user doesn't add this, it's automated by CAPMB): apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: annotations: reboot.metal3.io/capbm-requested-power-off: '{"mode":"hard"}' selfLink: >- The BMH should then immediately power down, the node will be deleted (to allow workload recovery), and it will come back up and re-register to the cluster, fully automated. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |