Bug 1814589
| Summary: | NodeStartupTimeout should be user configurable | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Joel Speed <jspeed> |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
| Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | jhou, miyadav |
| Version: | 4.5 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
Feature: Add nodeStartupTimeout to the MachineHealthCheck specification.
Reason: MHC remediates machine if they do not start up within a given time period. Previously this was fixed at 10 minutes, it is now configurable via this new field but defaults to the original 10 minutes.
Result: Users can now configure how long they would like to allow for a Machine to start before assuming it needs remediation.
|
Story Points: | --- |
| Clone Of: | 1814582 | Environment: | |
| Last Closed: | 2020-07-13 17:22:24 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1814582 | ||
|
Description
Joel Speed
2020-03-18 10:34:33 UTC
Description of problem:
If a Machine does not get a node within 10 minutes, the MHC determines the Machine to be unhealthy and remediates it.
This 10 minute default is currently hard coded and should instead be user configurable so that different platforms can specify different values
[miyadav@miyadav cloudtestcasefiles]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.0-0.nightly-2020-03-17-225152 True False 7h31m Cluster version is 4.5.0-0.nightly-2020-03-17-225152
Steps :
1.Create a MHC to monitor the existing machineset
---
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: mh1
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: miyadav-1803-sltw6
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: miyadav-1803-sltw6-worker-us-east-2a
nodeStartupTimeout: 2m
unhealthyConditions:
-
status: "False"
timeout: 300s
type: Ready
-
status: Unknown
timeout: 300s
type: Ready
maxUnhealthy: 2
mhc mh1 got created successfully
2.set the MHC field nodeStartupTimeout to <timetotest> (mostly like 12mins) to start remediation only after that period
Restart/delete the machine in the monitored machineset
use oc edit mhc mh1 -- then update nodeStartupTimeout ~6m
mhc mh1 updated successfully
3.Monitor the logs of mhc to validate if remediation started only after nodeStartupTimeout time.
oc logs -f mhc mh1
Actual & Expected :
I0318 08:46:54.780790 1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:54.780880 1 machinehealthcheck_controller.go:434] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:54.780911 1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:54.791150 1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 6m0.227788757s
I0318 08:46:56.290424 1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1
I0318 08:46:56.290479 1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets
I0318 08:46:56.290668 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking
I0318 08:46:56.290710 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking
I0318 08:46:56.290728 1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.709282835s
I0318 08:46:56.300152 1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2, maxUnhealthy: 2, unhealthy: 2. Remediations are allowed
I0318 08:46:56.300205 1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:56.300218 1 machinehealthcheck_controller.go:434] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:56.300233 1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:56.311066 1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.709282835s
I0318 08:46:56.311141 1 machinehealthcheck_controller.go:149] Reconciling openshift-machine-api/mh1
I0318 08:46:56.311250 1 machinehealthcheck_controller.go:162] Reconciling openshift-machine-api/mh1: finding targets
I0318 08:46:56.311434 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: health checking
I0318 08:46:56.311455 1 machinehealthcheck_controller.go:288] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-62smj/: is likely to go unhealthy in 5m58.688555109s
I0318 08:46:56.311484 1 machinehealthcheck_controller.go:274] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: health checking
I0318 08:46:56.317180 1 machinehealthcheck_controller.go:201] Reconciling openshift-machine-api/mh1: monitoring MHC: total targets: 2, maxUnhealthy: 2, unhealthy: 2. Remediations are allowed
I0318 08:46:56.317237 1 machinehealthcheck_controller.go:210] Reconciling openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: meet unhealthy criteria, triggers remediation
I0318 08:46:56.317250 1 machinehealthcheck_controller.go:434] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: start remediation logic
I0318 08:46:56.317264 1 machinehealthcheck_controller.go:458] openshift-machine-api/mh1/miyadav-1803-sltw6-worker-us-east-2a-whbsk/ip-10-0-133-191.us-east-2.compute.internal: deleting
I0318 08:46:56.327079 1 machinehealthcheck_controller.go:225] Reconciling openshift-machine-api/mh1: some targets might go unhealthy. Ensuring a requeue happens in 5m58.688555109s
.
.
.
Additional info :
Tested with time as 2m also, in that since it is a very less time for node to come back on , it keeps on remediation continously
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |