This bug was initially created as a copy of Bug #1936844 I am copying this bug because we need a 4.8 BZ to track for the CAPBM changes that were recently introduced here: https://github.com/openshift/cluster-api-provider-baremetal/pull/138. This code has merged into 4.8 upstream and will need to go into downstream so we can pull it back into 4.7. This work is split into two components- 1) The BMO changes to support the reboot mode 2) The CAPBM changes to apply the hard reboot mode for remediation purposes Additional info: PR that pulled the BMO changes into 4.8 https://github.com/openshift/baremetal-operator/pull/128 https://bugzilla.redhat.com/show_bug.cgi?id=1927678 PR for backporting above BMO changes into 4.7 https://github.com/openshift/baremetal-operator/pull/130 https://bugzilla.redhat.com/show_bug.cgi?id=1936407 PR that pulled the CAPBM changes into 4.8 https://github.com/openshift/cluster-api-provider-baremetal/pull/138 PR for backporting above CAPBM changes into 4.7 https://github.com/openshift/cluster-api-provider-baremetal/pull/144
Additional information for verification purposes. 1. Make sure test clusters have the right code pulled- For 4.8 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/138 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/128 (BMO) For 4.7 clusters, make sure you've got https://github.com/openshift/cluster-api-provider-baremetal/pull/144 (CAPBM) *and* https://github.com/openshift/baremetal-operator/pull/132 (BMO) 2. Manually set a worker baremetal machine to power down via the UI, you'll recognise that this defaults to a *soft* power off (OS is powered down gracefully), and will hold it down until the node is powered back on by the UI client- $ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i soft {"level":"info","ts":1615376435.4487739,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"soft","reboot process":true} {"level":"info","ts":1615376435.448852,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: soft)","host":"worker2"} {"level":"info","ts":1615376435.4488945,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"soft power off\" command","host":"worker2"} 3. After powering the node back on via the UI. Apply a manual annotation to the worker BMH object to verify that hard power-off works (OS is *not* powered down gracefully and is immediately shut down): apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: annotations: reboot.metal3.io/testing: '{"mode":"hard"}' selfLink: >- (...) $ oc logs metal3-595cd58b4c-47xqf -n openshift-machine-api -c metal3-baremetal-operator -f | grep -i hard {"level":"info","ts":1615376529.5041094,"logger":"controllers.BareMetalHost","msg":"power state change needed","baremetalhost":"openshift-machine-api/worker2","provisioningState":"provisioned","expected":false,"actual":true,"reboot mode":"hard","reboot process":true} {"level":"info","ts":1615376529.5041282,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"worker2"} {"level":"info","ts":1615376529.5041316,"logger":"provisioner.ironic","msg":"ensuring host is powered off by \"hard power off\" command","host":"worker2"} 4. Remove the annotation and save the yaml definition and witness the machine be immediately powered back on. Wait for it to come back online. 5. Apply a MachineHealthCheck to verify that the CAPBM will apply the hard reboot flag automatically (note the following is quite aggressive, but for testing)- apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: workers namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal spec: nodeStartupTimeout: 5m maxUnhealthy: 100% selector: matchLabels: machine.openshift.io/cluster-api-machine-role: worker unhealthyConditions: - type: Ready status: Unknown timeout: 20s - type: Ready status: 'False' timeout: 20s 6. Manually stop the kubelet service on the target machine, or even better hold it in a position where the machine shows as powered-on, but is no longer reporting into k8s (for this I usually reboot the virtual BMH and hold it at the grub prompt). After 40s (k8s timeout) and 20s MachineHealthCheck timeout you should see the remediation starting: $ oc logs -n openshift-machine-api machine-api-controllers-78474bd74b-hvfgz -c machine-healthcheck-controller -f I0310 11:49:04.669780 1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets I0310 11:49:04.670264 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking I0310 11:49:04.670430 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking I0310 11:49:04.670694 1 machinehealthcheck_controller.go:346] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: is likely to go unhealthy in 1.329319222s I0310 11:49:04.671999 1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2, max unhealthy: 100%, unhealthy targets: 1 I0310 11:49:04.708765 1 machinehealthcheck_controller.go:259] Reconciling openshift-machine-api/workers: some targets might go unhealthy. Ensuring a requeue happens in 1.329319222s I0310 11:49:06.014136 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/workers I0310 11:49:06.014673 1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/workers: finding targets I0310 11:49:06.016758 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-dxmgq/ocp4-worker1.cnv.example.com: health checking I0310 11:49:06.017017 1 machinehealthcheck_controller.go:332] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: health checking I0310 11:49:06.017231 1 machinehealthcheck_controller.go:660] openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: unhealthy: condition Ready in state False longer than {10s} I0310 11:49:06.018255 1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/workers: total targets: 2, max unhealthy: 100%, unhealthy targets: 1 I0310 11:49:06.057770 1 machinehealthcheck_controller.go:244] Reconciling openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: meet unhealthy criteria, triggers remediation I0310 11:49:06.058069 1 machinehealthcheck_controller.go:492] openshift-machine-api/workers/cnv-pdkrb-worker-0-gtbkc/ocp4-worker2.cnv.example.com: start remediation logic I0310 11:49:06.058261 1 machinehealthcheck_controller.go:563] Machine cnv-pdkrb-worker-0-gtbkc has been unhealthy for too long, adding external annotation Then the capbm remediation annotation should be automatically added to the BMH object (the user doesn't add this, it's automated by CAPMB): apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: annotations: reboot.metal3.io/capbm-requested-power-off: '{"mode":"hard"}' selfLink: >- The BMH should then immediately power down, the node will be deleted (to allow workload recovery), and it will come back up and re-register to the cluster, fully automated.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438