+++ This bug was initially created as a clone of Bug #1814582 +++ Description of problem: If a Machine does not get a node within 10 minutes, the MHC determines the Machine to be unhealthy and remediates it. This 10 minute default is currently hard coded and should instead be user configurable so that different platforms can specify different values Version-Release number of selected component (if applicable): 4.4 MachineHealthCheck How reproducible: Prevent Node from getting machine for 10 minutes Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: This is related to https://github.com/openshift/machine-api-operator/pull/501
Description of problem: If a Machine does not get a node within 10 minutes, the MHC determines the Machine to be unhealthy and remediates it. This 10 minute default is currently hard coded and should instead be user configurable so that different platforms can specify different values [miyadav@miyadav cloudtestcasefiles]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-03-17-225152 True False 7h31m Cluster version is 4.5.0-0.nightly-2020-03-17-225152 Steps : 1.Create a MHC to monitor the existing machineset --- apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: mh1 namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-cluster: miyadav-1803-sltw6 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: miyadav-1803-sltw6-worker-us-east-2a nodeStartupTimeout: 2m unhealthyConditions: - status: "False" timeout: 300s type: Ready - status: Unknown timeout: 300s type: Ready maxUnhealthy: 2 mhc mh1 got created successfully 2.set the MHC field nodeStartupTimeout to <timetotest> (mostly like 12mins) to start remediation only after that period Restart/delete the machine in the monitored machineset use oc edit mhc mh1 -- then update nodeStartupTimeout ~6m mhc mh1 updated successfully 3.Monitor the logs of mhc to validate if remediation started only after nodeStartupTimeout time. oc logs -f mhc mh1 Actual & Expected : I0318 08:46:54.780790 1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation I0318 08:46:54.780880 1 machinehealthcheck_controller.go:434] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic I0318 08:46:54.780911 1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting I0318 08:46:54.791150 1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 6m0.227788757s I0318 08:46:56.290424 1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1 I0318 08:46:56.290479 1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets I0318 08:46:56.290668 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking I0318 08:46:56.290710 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking I0318 08:46:56.290728 1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.709282835s I0318 08:46:56.300152 1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2, maxUnhealthy: 2, unhealthy: 2. Remediations are allowed I0318 08:46:56.300205 1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation I0318 08:46:56.300218 1 machinehealthcheck_controller.go:434] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic I0318 08:46:56.300233 1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting I0318 08:46:56.311066 1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.709282835s I0318 08:46:56.311141 1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1 I0318 08:46:56.311250 1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets I0318 08:46:56.311434 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking I0318 08:46:56.311455 1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.688555109s I0318 08:46:56.311484 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking I0318 08:46:56.317180 1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2, maxUnhealthy: 2, unhealthy: 2. Remediations are allowed I0318 08:46:56.317237 1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation I0318 08:46:56.317250 1 machinehealthcheck_controller.go:434] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic I0318 08:46:56.317264 1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting I0318 08:46:56.327079 1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.688555109s . . . Additional info : Tested with time as 2m also, in that since it is a very less time for node to come back on , it keeps on remediation continously
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409