1862556 – [OCP 4.6][Machine Health Check] MHC CR accepts negative value for spec.maxUnhealthy

Bug 1862556 - [OCP 4.6][Machine Health Check] MHC CR accepts negative value for spec.maxUnhealthy

Summary: [OCP 4.6][Machine Health Check] MHC CR accepts negative value for spec.maxUnh...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Joel Speed
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1862585
TreeView+	depends on / blocked

Reported:	2020-07-31 17:19 UTC by gharden
Modified:	2020-10-27 16:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A negative value cannot be blocked but would be considered to block remediation even when no machines need remediation. Consequence: An event was produced on every reconcile. Fix: Interpret a negative value as meaning 0. Result: No more spurious log messages.
Clone Of:
Clones:	1862585 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:21:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 650	None	closed	[OCPCLOUD-911] Add OpenAPI Validation for MHC Timeout fields and MaxUnhealthy field	2020-10-15 20:14:48 UTC
Github	openshift machine-api-operator pull 680	None	closed	BUG 1862556: Ensure that negative values for maxUnhealthy do not produce events	2020-10-15 20:14:48 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:22:21 UTC

Description gharden 2020-07-31 17:19:25 UTC

Description of problem:
Machine Health Check CR accepts negative value for spec.maxUnhealthy. spec.macUnhealthy should only accept values greater than or equal to 0

Version-Release number of selected component (if applicable):
4.6

How reproducible:
100%


Steps to Reproduce:
1. oc create -f <mhc cr> with negaitive value for spec.maxUnhealthy

Actual results:
[root@sealusa8 ~]# cat mHc_Ready.yaml 
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: ready-example
 namespace: openshift-machine-api
 annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
spec:
 maxUnhealthy: -1
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: InvalidStatus
   timeout: 60s
[root@sealusa8 ~]# oc create -f mHc_Ready.yaml 
machinehealthcheck.machine.openshift.io/ready-example created
[root@sealusa8 ~]# oc get mhc
NAME            MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
ready-example   -1             2                  2



Expected results:
[root@sealusa8 ~]# cat mHc_Ready.yaml 
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: ready-example
 namespace: openshift-machine-api
 annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
spec:
 maxUnhealthy: -1
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: InvalidStatus
   timeout: 60s
[root@sealusa8 ~]# oc create -f mHc_Ready.yaml 
Error: spec.maxUnhealthy must be greater than or equal to 0

Additional info:

Comment 3 Andrew Beekhof 2020-08-11 03:49:21 UTC

I think this is something for the Cloud team.  
Input validation is not specific to bare metal.

Comment 4 Joel Speed 2020-08-11 09:09:39 UTC

This should no longer be the case in 4.6 since we added https://github.com/openshift/machine-api-operator/pull/650 & https://issues.redhat.com/browse/OCPCLOUD-911

Since that card hasn't yet been verified, I'll move this to modified so that QE are aware of this case when they come to verify that work.

Comment 7 Andrew Beekhof 2020-08-12 03:42:05 UTC

(In reply to Joel Speed from comment #4)
> This should no longer be the case in 4.6 since we added
> https://github.com/openshift/machine-api-operator/pull/650 &
> https://issues.redhat.com/browse/OCPCLOUD-911
> 
> Since that card hasn't yet been verified, I'll move this to modified so that
> QE are aware of this case when they come to verify that work.

Thanks Joel.
Backportable to 4.5 for 1862585?

Comment 8 sunzhaohua 2020-08-12 06:31:09 UTC

Verify failed
clusterversion: 4.6.0-0.nightly-2020-08-12-003456

1. oc create -f <mhc cr> with negaitive value for spec.maxUnhealthy
apiVersion: "machine.openshift.io/v1beta1"
kind: "MachineHealthCheck"
metadata:
  name: mhc2
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: yinzhou-tsc-mggh6
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: yinzhou-tsc-mggh6-worker-us-east-2c
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: Unknown
    timeout: 300s
  maxUnhealthy: -2


$ oc create -f mhc.yml 
machinehealthcheck.machine.openshift.io/mhc2 created

Comment 9 Joel Speed 2020-08-12 10:14:39 UTC

Ok so there is an annoying thing about OpenAPI validation, which is that the `maxUnhealthy` field is an `intstr.IntOrString`, so the field is allowed to be either an integer or a string. Integers will always trump strings in this case due to the ordering of the unmarshalling code (which we won't be able to change as this is core Kubernetes code). 
We can add validation for the string, which we have done, that ensures it is either numeric, or a percentage between 0% and 100%, however, the value in the example is being interpreted as an integer, which skips the string validation.
Because we have string validation, we can't also have integer validation.

These are the options I can see to fix this:
- Restrict the type so that it can only be a string (means you would have to quote as in `maxUnhealthy: "-2"`), this would reduce the UX but mean that we can validate it to not be negative
- Implement a validating webhook that will check the value is great than zero (quite a lot of work for such a small validation)
- Not fix this and warn users somehow (via a condition on the MHC status?) that the value they have entered is invalid.
- Drop the string validation and use integer validation only, this would allow any value in the string which would be worse and more likely to cause errors.

Comment 10 Andrew Beekhof 2020-08-12 11:26:10 UTC

Would it be valid to simply treat all numbers < 0 as 0?

Comment 11 Joel Speed 2020-08-12 11:31:32 UTC

Within the controller I believe we already do so, or at least, they have the same effect. This bug appears to be more about the field value that's being stored in etcd and presented to the user I thought.

I don't think the value being negative causes any issue to the controller operation.

Comment 12 Joel Speed 2020-08-17 11:19:16 UTC

Since the only way to fix this would be to create a webhook, which is a substantial amount of work, and the change is only cosmetic since the controller treats values less than zero as zero anyway, I'm going to close this for now.

If you feel like this should be fixed, please re-open the bug and we will defer to the 4.7 release cycle to introduce a fix.

Comment 13 gharden 2020-08-17 14:09:58 UTC

Hi Joel, 

Not sure what you mean by, "controller treats values less than zero as zero anyway". Because if you set maxUhealthy to 0, you will get a different result than if you set maxUnheathy to -1. In fact, when you set it to a negative value you will automatically get a RemediationRestricted event.

Events:
  Type     Reason                 Age               From                           Message
  ----     ------                 ----              ----                           -------
  Warning  RemediationRestricted  2s (x11 over 7s)  machinehealthcheck-controller  Remediation restricted due to exceeded number of unhealthy machines (total: 2, unhealthy: 0, maxUnhealthy: -1)

Comment 14 Joel Speed 2020-08-17 14:19:07 UTC

> In fact, when you set it to a negative value you will automatically get a RemediationRestricted event.


Ok I hadn't realised this difference, I've double checked the code and can see why this is. I'll add a patch so that we don't create the stream of remediation restricted events when maxUnhealthy is negative.
It wasn't obvious from the previous discussion that this was actually causing an issue, since the calculation logic shouldn't have any functional difference between 0 and -1, either will prevent any unhealthy machine from being remediated.

Comment 16 Milind Yadav 2020-08-24 04:53:22 UTC

VALIDATED ON - 

oc get clusterversion --config aws
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-23-214712   True        False         53m     Cluster version is 4.6.0-0.nightly-2020-08-23-214712


Steps:
1.Created mhc with negative value for maxUnheathy 

mhc created successfully 
2.oc get mhc --config aws
NAME   MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
mhc1   -3             0                  0
mhc2   0              0                  0

3.Create a unhealthy condition and monitor mhc logs - 
Maxunhealthy negative and zero shows same logs:

I0824 04:36:56.039682       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1
I0824 04:36:56.039713       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets
I0824 04:36:56.048189       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 0,  maxUnhealthy: -3, unhealthy: 0. Remediations are allowed
I0824 04:36:56.048219       1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc1: no more targets meet unhealthy criteria
I0824 04:36:56.048243       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1
I0824 04:36:56.048265       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets
I0824 04:36:56.053144       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 0,  maxUnhealthy: -3, unhealthy: 0. Remediations are allowed
I0824 04:36:56.053161       1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc1: no more targets meet unhealthy criteria


..
.
.
I0824 04:48:48.655113       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc2
I0824 04:48:48.655150       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc2: finding targets
I0824 04:48:48.663987       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc2: monitoring MHC: total targets: 0,  maxUnhealthy: 0, unhealthy: 0. Remediations are allowed
I0824 04:48:48.664059       1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc2: no more targets meet unhealthy criteria
I0824 04:48:48.664080       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc2
I0824 04:48:48.664095       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc2: finding targets
I0824 04:48:48.671638       1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc2: monitoring MHC: total targets: 0,  maxUnhealthy: 0, unhealthy: 0. Remediations are allowed
I0824 04:48:48.671659       1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc2: no more targets meet unhealthy criteria

Comment 17 Milind Yadav 2020-08-24 04:56:46 UTC

Adding one more scenario that was tested and is good - 
Adding maxunhealthy as "-2" instead of -3


it validates the string and gives expected output 

[miyadav@miyadav aws]$ oc create -f mhc_malformed.yaml --config aws
The MachineHealthCheck "mhc3" is invalid: spec.maxUnhealthy: Invalid value: "": spec.maxUnhealthy in body should match '^((100|[0-9]{1,2})%|[0-9]+)$'

Comment 19 errata-xmlrpc 2020-10-27 16:21:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.