Bug 1814589

Summary: NodeStartupTimeout should be user configurable
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: jhou, miyadav
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: Add nodeStartupTimeout to the MachineHealthCheck specification. Reason: MHC remediates machine if they do not start up within a given time period. Previously this was fixed at 10 minutes, it is now configurable via this new field but defaults to the original 10 minutes. Result: Users can now configure how long they would like to allow for a Machine to start before assuming it needs remediation.
Story Points: ---
Clone Of: 1814582 Environment:
Last Closed: 2020-07-13 17:22:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1814582    

Description Joel Speed 2020-03-18 10:34:33 UTC
+++ This bug was initially created as a clone of Bug #1814582 +++

Description of problem:

If a Machine does not get a node within 10 minutes, the MHC determines the Machine to be unhealthy and remediates it.

This 10 minute default is currently hard coded and should instead be user configurable so that different platforms can specify different values

Version-Release number of selected component (if applicable):
4.4
MachineHealthCheck

How reproducible:
Prevent Node from getting machine for 10 minutes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

This is related to https://github.com/openshift/machine-api-operator/pull/501

Comment 1 Milind Yadav 2020-03-18 10:52:26 UTC
Description of problem:

If a Machine does not get a node within 10 minutes, the MHC determines the Machine to be unhealthy and remediates it.

This 10 minute default is currently hard coded and should instead be user configurable so that different platforms can specify different values

[miyadav@miyadav cloudtestcasefiles]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-03-17-225152   True        False         7h31m   Cluster version is 4.5.0-0.nightly-2020-03-17-225152

Steps :
1.Create a MHC to monitor the existing machineset 
---
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: mh1
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: miyadav-1803-sltw6
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: miyadav-1803-sltw6-worker-us-east-2a
  nodeStartupTimeout: 2m
  unhealthyConditions:
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready
  maxUnhealthy: 2

mhc mh1 got created successfully

2.set the MHC field nodeStartupTimeout to <timetotest> (mostly like 12mins) to start remediation only after that period
Restart/delete the machine in the monitored machineset

use oc edit mhc mh1 -- then update nodeStartupTimeout  ~6m

mhc mh1 updated successfully

3.Monitor the logs of mhc to validate if remediation started only after nodeStartupTimeout time.

oc logs -f  mhc mh1

Actual & Expected :
I0318 08:46:54.780790       1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:54.780880       1 machinehealthcheck_controller.go:434]  openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:54.780911       1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:54.791150       1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 6m0.227788757s
I0318 08:46:56.290424       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1
I0318 08:46:56.290479       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets
I0318 08:46:56.290668       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking
I0318 08:46:56.290710       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking
I0318 08:46:56.290728       1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.709282835s
I0318 08:46:56.300152       1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2,  maxUnhealthy: 2, unhealthy: 2. Remediations are allowed
I0318 08:46:56.300205       1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:56.300218       1 machinehealthcheck_controller.go:434]  openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:56.300233       1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:56.311066       1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.709282835s
I0318 08:46:56.311141       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1
I0318 08:46:56.311250       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets
I0318 08:46:56.311434       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking
I0318 08:46:56.311455       1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.688555109s
I0318 08:46:56.311484       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking
I0318 08:46:56.317180       1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2,  maxUnhealthy: 2, unhealthy: 2. Remediations are allowed
I0318 08:46:56.317237       1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:56.317250       1 machinehealthcheck_controller.go:434]  openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:56.317264       1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:56.327079       1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.688555109s
.
.
.

Additional info :

Tested with time as 2m also, in that since it is a very less time for node to come back on , it keeps on remediation continously

Comment 4 errata-xmlrpc 2020-07-13 17:22:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409