Bug 1814589 - NodeStartupTimeout should be user configurable
Summary: NodeStartupTimeout should be user configurable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.5.0
Assignee: Joel Speed
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks: 1814582
TreeView+ depends on / blocked
 
Reported: 2020-03-18 10:34 UTC by Joel Speed
Modified: 2020-07-13 17:22 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Add nodeStartupTimeout to the MachineHealthCheck specification. Reason: MHC remediates machine if they do not start up within a given time period. Previously this was fixed at 10 minutes, it is now configurable via this new field but defaults to the original 10 minutes. Result: Users can now configure how long they would like to allow for a Machine to start before assuming it needs remediation.
Clone Of: 1814582
Environment:
Last Closed: 2020-07-13 17:22:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:22:45 UTC

Description Joel Speed 2020-03-18 10:34:33 UTC
+++ This bug was initially created as a clone of Bug #1814582 +++

Description of problem:

If a Machine does not get a node within 10 minutes, the MHC determines the Machine to be unhealthy and remediates it.

This 10 minute default is currently hard coded and should instead be user configurable so that different platforms can specify different values

Version-Release number of selected component (if applicable):
4.4
MachineHealthCheck

How reproducible:
Prevent Node from getting machine for 10 minutes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

This is related to https://github.com/openshift/machine-api-operator/pull/501

Comment 1 Milind Yadav 2020-03-18 10:52:26 UTC
Description of problem:

If a Machine does not get a node within 10 minutes, the MHC determines the Machine to be unhealthy and remediates it.

This 10 minute default is currently hard coded and should instead be user configurable so that different platforms can specify different values

[miyadav@miyadav cloudtestcasefiles]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-03-17-225152   True        False         7h31m   Cluster version is 4.5.0-0.nightly-2020-03-17-225152

Steps :
1.Create a MHC to monitor the existing machineset 
---
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: mh1
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: miyadav-1803-sltw6
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: miyadav-1803-sltw6-worker-us-east-2a
  nodeStartupTimeout: 2m
  unhealthyConditions:
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready
  maxUnhealthy: 2

mhc mh1 got created successfully

2.set the MHC field nodeStartupTimeout to <timetotest> (mostly like 12mins) to start remediation only after that period
Restart/delete the machine in the monitored machineset

use oc edit mhc mh1 -- then update nodeStartupTimeout  ~6m

mhc mh1 updated successfully

3.Monitor the logs of mhc to validate if remediation started only after nodeStartupTimeout time.

oc logs -f  mhc mh1

Actual & Expected :
I0318 08:46:54.780790       1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:54.780880       1 machinehealthcheck_controller.go:434]  openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:54.780911       1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:54.791150       1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 6m0.227788757s
I0318 08:46:56.290424       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1
I0318 08:46:56.290479       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets
I0318 08:46:56.290668       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking
I0318 08:46:56.290710       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking
I0318 08:46:56.290728       1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.709282835s
I0318 08:46:56.300152       1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2,  maxUnhealthy: 2, unhealthy: 2. Remediations are allowed
I0318 08:46:56.300205       1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:56.300218       1 machinehealthcheck_controller.go:434]  openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:56.300233       1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:56.311066       1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.709282835s
I0318 08:46:56.311141       1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1
I0318 08:46:56.311250       1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets
I0318 08:46:56.311434       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking
I0318 08:46:56.311455       1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.688555109s
I0318 08:46:56.311484       1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking
I0318 08:46:56.317180       1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2,  maxUnhealthy: 2, unhealthy: 2. Remediations are allowed
I0318 08:46:56.317237       1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:56.317250       1 machinehealthcheck_controller.go:434]  openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:56.317264       1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:56.327079       1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.688555109s
.
.
.

Additional info :

Tested with time as 2m also, in that since it is a very less time for node to come back on , it keeps on remediation continously

Comment 4 errata-xmlrpc 2020-07-13 17:22:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.