Bug 1816606

Summary:	MHC MaxUnhealthy string value can have unexpected behaviour
Product:	OpenShift Container Platform	Reporter:	Joel Speed <jspeed>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	agarcial, jhou
Version:	4.4
Target Milestone:	---
Target Release:	4.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: A value for the MaxUnhealthy field on a MachineHealthCheck can take multiple value formats (eg 10, "10", "10%). Any quoted value was interpreted as a percent value even if it did not contain a percentage sign. Consequence: The interpreted value of MaxUnhealthy may not have matched the users intention and Machines may have been remediated when they were not meant to be/may have not been remediated when they were meant to be. Fix: Check if the value contains a percentage sign before marking the value as a percentage value. Result: A value of 10 or "10" now has the same behaviour and "10" is not interpreted as "10%".	Story Points:	---
Clone Of:	1812862	Environment:
Last Closed:	2020-06-29 15:33:54 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1812862
Bug Blocks:

Description Joel Speed 2020-03-24 11:08:30 UTC

+++ This bug was initially created as a clone of Bug #1812862 +++

Description of problem:

When checking if remediations are allowed, there is a difference in behaviour between `{"maxUnhealthy": 1}` and `{"maxUnhealthy": "1"}`, but no difference in behaviour between `{"maxUnhealthy": "1"}` and `{"maxUnhealthy": "1%"}`.

The code that checks if the value should be a int or a percentage does not check for the presence of a percentage symbol and assumes that any string should be a percentage

This can cause unexpected behaviour as users could specify `"1"` expecting this to allow 1 unhealthy machine, but in-fact it will only allow at most 1% of the total number of Machines


Version-Release number of selected component (if applicable):

4.3, 4.4, 4.5
MachineHealthCheck


How reproducible:

Create a MachineHealthCheck and specify `maxUnhealthy: "1"` including the quotes, then attempt to get the MHC controller to remediate an unhealthy node


Actual results:

Remediation is blocked as the number of unhealthy nodes exceeds the threshold


Expected results:

The unhealthy machine should be remediated

Additional info:

This is not a problem in most cases where intstr is used as they check with validation that any string is a valid percentage, we don not have validation on our types so this is more difficult to achieve

Fix will likely include copying some code from the intstr package and ensuring that the percentage symbol is checked for before asserting that the value is a percentage

Comment 1 Joel Speed 2020-04-20 11:18:29 UTC

PR https://github.com/openshift/machine-api-operator/pull/539 will update BZ once 4.4.z release branch opens

Comment 4 Milind Yadav 2020-06-22 04:30:36 UTC

Validated at :
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-06-21-210301   True        False         36m     Cluster version is 4.4.0-0.nightly-2020-06-21-210301


Step 1 . Create a mhc with maxUnhealthy value as “1”  refer yaml :
---
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  creationTimestamp: "2020-02-14T09:47:08Z"
  generation: 1
  name: mhc1
  namespace: openshift-machine-api
  resourceVersion: "71059"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c
  uid: ef74b735-e58e-4c24-aa69-015d90998b77
spec:
  maxUnhealthy: "1"
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: miyadav-0622-cpsfs
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: miyadav-0622-cpsfs-worker-us-east-2a
  unhealthyConditions:
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready

[miyadav@miyadav bugzilla]$ oc create -f mhc_1816606.yml 
machinehealthcheck.machine.openshift.io/mhc1 created

[miyadav@miyadav bugzilla]$ oc get mhc
NAME   MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
mhc1   1              1                  1


Step 2:[miyadav@miyadav bugzilla]$ oc  delete machine miyadav-0622-cpsfs-worker-us-east-2a-72qnw
machine.machine.openshift.io "miyadav-0622-cpsfs-worker-us-east-2a-72qnw" deleted
 , check the logs
      

oc logs -f machine-api-controllers-77d9ccd587-d6hp6 -c machine-healthcheck-controller 
.
.
.I0622 04:15:52.678252       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets
I0622 04:15:52.678389       1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking
I0622 04:15:52.678452       1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 5m0.321562927s
I0622 04:15:52.685076       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1,  maxUnhealthy: 1, unhealthy: 1. Remediations are allowed
I0622 04:15:52.685114       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 5m0.321562927s
I0622 04:15:53.027627       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1
I0622 04:15:53.028123       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets
I0622 04:15:53.028245       1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking
I0622 04:15:53.028278       1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 4m59.971736102s
I0622 04:15:53.037795       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1,  maxUnhealthy: 1, unhealthy: 1. Remediations are allowed
I0622 04:15:53.037830       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 4m59.971736102s
I0622 04:16:02.602147       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1
I0622 04:16:02.602182       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets
I0622 04:16:02.602263       1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking
I0622 04:16:02.602288       1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 4m50.397725768s
I0622 04:16:02.608958       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1,  maxUnhealthy: 1, unhealthy: 1. Remediations are allowed
I0622 04:16:02.608992       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 4m50.397725768s
I0622 04:16:52.747694       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1
I0622 04:16:52.748552       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets
I0622 04:16:52.748671       1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking
I0622 04:16:52.748701       1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 4m0.251310018s
I0622 04:16:52.755874       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1,  maxUnhealthy: 1, unhealthy: 1. Remediations are allowed
I0622 04:16:52.755969       1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 4m0.251310018s
.
.
.
Actual & Expected:Remediation happened successfully as maxUnhealthy value is 1 

 Step 3: Edit the mhc mhc1 with value of maxUnhealthy as “1%”

[miyadav@miyadav bugzilla]$ oc edit mhc mhc1
machinehealthcheck.machine.openshift.io/mhc1 edited


Step 4: Repeat step 2 

Step 5 : Monitor mhc logs  , oc logs -f machine-api-controllers-77d9ccd587-d6hp6 -c machine-healthcheck-controller 

I0622 04:27:51.415666       1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-5pb2d/: health checking
I0622 04:27:51.415716       1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-5pb2d/: is likely to go unhealthy in 9m56.584292844s
W0622 04:27:51.421869       1 machinehealthcheck_controller.go:182] Reconciling openshift-machine-api/mhc1: total targets: 2,  maxUnhealthy: 1%, unhealthy: 2. Short-circuiting remediation

Actual:Remediation did not happen as maxUnhealthy value is 1 percent
Expected : Remediation should not happen as maxunhealthy value is 1 percent ( exceeded max condition to allow remediation)

Additional Info:
Moved to VERIFIED
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-28859

Comment 6 errata-xmlrpc 2020-06-29 15:33:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2713