Bug 1862556
Summary: | [OCP 4.6][Machine Health Check] MHC CR accepts negative value for spec.maxUnhealthy | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | gharden | |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | |
Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | low | |||
Priority: | low | CC: | abeekhof, bjacot, gharden, jspeed, mlammon, zhsun | |
Version: | 4.5 | Keywords: | Reopened, Triaged | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: A negative value cannot be blocked but would be considered to block remediation even when no machines need remediation.
Consequence: An event was produced on every reconcile.
Fix: Interpret a negative value as meaning 0.
Result: No more spurious log messages.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1862585 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:21:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1862585 |
Description
gharden
2020-07-31 17:19:25 UTC
I think this is something for the Cloud team. Input validation is not specific to bare metal. This should no longer be the case in 4.6 since we added https://github.com/openshift/machine-api-operator/pull/650 & https://issues.redhat.com/browse/OCPCLOUD-911 Since that card hasn't yet been verified, I'll move this to modified so that QE are aware of this case when they come to verify that work. (In reply to Joel Speed from comment #4) > This should no longer be the case in 4.6 since we added > https://github.com/openshift/machine-api-operator/pull/650 & > https://issues.redhat.com/browse/OCPCLOUD-911 > > Since that card hasn't yet been verified, I'll move this to modified so that > QE are aware of this case when they come to verify that work. Thanks Joel. Backportable to 4.5 for 1862585? Verify failed clusterversion: 4.6.0-0.nightly-2020-08-12-003456 1. oc create -f <mhc cr> with negaitive value for spec.maxUnhealthy apiVersion: "machine.openshift.io/v1beta1" kind: "MachineHealthCheck" metadata: name: mhc2 namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-cluster: yinzhou-tsc-mggh6 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: yinzhou-tsc-mggh6-worker-us-east-2c unhealthyConditions: - type: Ready status: "False" timeout: 300s - type: Ready status: Unknown timeout: 300s maxUnhealthy: -2 $ oc create -f mhc.yml machinehealthcheck.machine.openshift.io/mhc2 created Ok so there is an annoying thing about OpenAPI validation, which is that the `maxUnhealthy` field is an `intstr.IntOrString`, so the field is allowed to be either an integer or a string. Integers will always trump strings in this case due to the ordering of the unmarshalling code (which we won't be able to change as this is core Kubernetes code). We can add validation for the string, which we have done, that ensures it is either numeric, or a percentage between 0% and 100%, however, the value in the example is being interpreted as an integer, which skips the string validation. Because we have string validation, we can't also have integer validation. These are the options I can see to fix this: - Restrict the type so that it can only be a string (means you would have to quote as in `maxUnhealthy: "-2"`), this would reduce the UX but mean that we can validate it to not be negative - Implement a validating webhook that will check the value is great than zero (quite a lot of work for such a small validation) - Not fix this and warn users somehow (via a condition on the MHC status?) that the value they have entered is invalid. - Drop the string validation and use integer validation only, this would allow any value in the string which would be worse and more likely to cause errors. Would it be valid to simply treat all numbers < 0 as 0? Within the controller I believe we already do so, or at least, they have the same effect. This bug appears to be more about the field value that's being stored in etcd and presented to the user I thought. I don't think the value being negative causes any issue to the controller operation. Since the only way to fix this would be to create a webhook, which is a substantial amount of work, and the change is only cosmetic since the controller treats values less than zero as zero anyway, I'm going to close this for now. If you feel like this should be fixed, please re-open the bug and we will defer to the 4.7 release cycle to introduce a fix. Hi Joel, Not sure what you mean by, "controller treats values less than zero as zero anyway". Because if you set maxUhealthy to 0, you will get a different result than if you set maxUnheathy to -1. In fact, when you set it to a negative value you will automatically get a RemediationRestricted event. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning RemediationRestricted 2s (x11 over 7s) machinehealthcheck-controller Remediation restricted due to exceeded number of unhealthy machines (total: 2, unhealthy: 0, maxUnhealthy: -1) > In fact, when you set it to a negative value you will automatically get a RemediationRestricted event.
Ok I hadn't realised this difference, I've double checked the code and can see why this is. I'll add a patch so that we don't create the stream of remediation restricted events when maxUnhealthy is negative.
It wasn't obvious from the previous discussion that this was actually causing an issue, since the calculation logic shouldn't have any functional difference between 0 and -1, either will prevent any unhealthy machine from being remediated.
VALIDATED ON - oc get clusterversion --config aws NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-23-214712 True False 53m Cluster version is 4.6.0-0.nightly-2020-08-23-214712 Steps: 1.Created mhc with negative value for maxUnheathy mhc created successfully 2.oc get mhc --config aws NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY mhc1 -3 0 0 mhc2 0 0 0 3.Create a unhealthy condition and monitor mhc logs - Maxunhealthy negative and zero shows same logs: I0824 04:36:56.039682 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1 I0824 04:36:56.039713 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets I0824 04:36:56.048189 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 0, maxUnhealthy: -3, unhealthy: 0. Remediations are allowed I0824 04:36:56.048219 1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc1: no more targets meet unhealthy criteria I0824 04:36:56.048243 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1 I0824 04:36:56.048265 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets I0824 04:36:56.053144 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 0, maxUnhealthy: -3, unhealthy: 0. Remediations are allowed I0824 04:36:56.053161 1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc1: no more targets meet unhealthy criteria .. . . I0824 04:48:48.655113 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc2 I0824 04:48:48.655150 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc2: finding targets I0824 04:48:48.663987 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc2: monitoring MHC: total targets: 0, maxUnhealthy: 0, unhealthy: 0. Remediations are allowed I0824 04:48:48.664059 1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc2: no more targets meet unhealthy criteria I0824 04:48:48.664080 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc2 I0824 04:48:48.664095 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc2: finding targets I0824 04:48:48.671638 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc2: monitoring MHC: total targets: 0, maxUnhealthy: 0, unhealthy: 0. Remediations are allowed I0824 04:48:48.671659 1 machinehealthcheck_controller.go:227] Reconciling openshift-machine-api/mhc2: no more targets meet unhealthy criteria Adding one more scenario that was tested and is good - Adding maxunhealthy as "-2" instead of -3 it validates the string and gives expected output [miyadav@miyadav aws]$ oc create -f mhc_malformed.yaml --config aws The MachineHealthCheck "mhc3" is invalid: spec.maxUnhealthy: Invalid value: "": spec.maxUnhealthy in body should match '^((100|[0-9]{1,2})%|[0-9]+)$' Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |