Bug 1834172 - Node Health Check does not account for different remediation strategies
Summary: Node Health Check does not account for different remediation strategies
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.5
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.5.0
Assignee: Yadan Pei
QA Contact: Yadan Pei
Depends On:
TreeView+ depends on / blocked
Reported: 2020-05-11 08:46 UTC by Rastislav Wagner
Modified: 2020-07-13 17:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2020-07-13 17:36:48 UTC
Target Upstream Version:

Attachments (Terms of Use)
MachineHealthCheck external-baremetal (462.72 KB, image/png)
2020-05-18 06:47 UTC, Yadan Pei
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift console pull 5373 0 None closed Bug 1834172: Detect remediation strategy for Machine 2020-06-24 02:15:33 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:37:26 UTC

Description Rastislav Wagner 2020-05-11 08:46:30 UTC
MachineHealthCheck CR can be set to `external-baremetal` remediation strategy which means that the machine will reboot instead of reprovision. This is not recognized by Node's Overview page - Health checks item which always shows that the reprovision is pending.

Comment 3 Yadan Pei 2020-05-18 06:47:09 UTC
Created attachment 1689534 [details]
MachineHealthCheck external-baremetal

1. Create MachineHealthCheck with YAML below which targets one machine
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
  name: example
  namespace: openshift-machine-api
      machine.openshift.io/cluster-api-cluster: qe-ui4-l99xm
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: qe-ui4-l99xm-w-c
    - type: Ready
      status: Unknown
      timeout: 300s
    - type: Ready
      status: 'False'
      timeout: 300s
  maxUnhealthy: 40%

2. Goes to cloud provider console and stop the instances targeted in MHC
3. The node will become NotReady firstly, after 300s MHC will report accurate status, on Nodes Overview Status card, click on 'Health Checks' it will show `Warning alert:Reboot pending` instead of  `reprovisioning pending`

Verified on 4.5.0-0.nightly-2020-05-17-220731

Comment 4 errata-xmlrpc 2020-07-13 17:36:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.