+++ This bug was initially created as a clone of Bug #1812862 +++ Description of problem: When checking if remediations are allowed, there is a difference in behaviour between `{"maxUnhealthy": 1}` and `{"maxUnhealthy": "1"}`, but no difference in behaviour between `{"maxUnhealthy": "1"}` and `{"maxUnhealthy": "1%"}`. The code that checks if the value should be a int or a percentage does not check for the presence of a percentage symbol and assumes that any string should be a percentage This can cause unexpected behaviour as users could specify `"1"` expecting this to allow 1 unhealthy machine, but in-fact it will only allow at most 1% of the total number of Machines Version-Release number of selected component (if applicable): 4.3, 4.4, 4.5 MachineHealthCheck How reproducible: Create a MachineHealthCheck and specify `maxUnhealthy: "1"` including the quotes, then attempt to get the MHC controller to remediate an unhealthy node Actual results: Remediation is blocked as the number of unhealthy nodes exceeds the threshold Expected results: The unhealthy machine should be remediated Additional info: This is not a problem in most cases where intstr is used as they check with validation that any string is a valid percentage, we don not have validation on our types so this is more difficult to achieve Fix will likely include copying some code from the intstr package and ensuring that the percentage symbol is checked for before asserting that the value is a percentage
PR https://github.com/openshift/machine-api-operator/pull/539 will update BZ once 4.4.z release branch opens
Validated at : NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-06-21-210301 True False 36m Cluster version is 4.4.0-0.nightly-2020-06-21-210301 Step 1 . Create a mhc with maxUnhealthy value as “1” refer yaml : --- apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: creationTimestamp: "2020-02-14T09:47:08Z" generation: 1 name: mhc1 namespace: openshift-machine-api resourceVersion: "71059" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c uid: ef74b735-e58e-4c24-aa69-015d90998b77 spec: maxUnhealthy: "1" selector: matchLabels: machine.openshift.io/cluster-api-cluster: miyadav-0622-cpsfs machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: miyadav-0622-cpsfs-worker-us-east-2a unhealthyConditions: - status: "False" timeout: 300s type: Ready - status: Unknown timeout: 300s type: Ready [miyadav@miyadav bugzilla]$ oc create -f mhc_1816606.yml machinehealthcheck.machine.openshift.io/mhc1 created [miyadav@miyadav bugzilla]$ oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY mhc1 1 1 1 Step 2:[miyadav@miyadav bugzilla]$ oc delete machine miyadav-0622-cpsfs-worker-us-east-2a-72qnw machine.machine.openshift.io "miyadav-0622-cpsfs-worker-us-east-2a-72qnw" deleted , check the logs oc logs -f machine-api-controllers-77d9ccd587-d6hp6 -c machine-healthcheck-controller . . .I0622 04:15:52.678252 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets I0622 04:15:52.678389 1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking I0622 04:15:52.678452 1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 5m0.321562927s I0622 04:15:52.685076 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1, maxUnhealthy: 1, unhealthy: 1. Remediations are allowed I0622 04:15:52.685114 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 5m0.321562927s I0622 04:15:53.027627 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1 I0622 04:15:53.028123 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets I0622 04:15:53.028245 1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking I0622 04:15:53.028278 1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 4m59.971736102s I0622 04:15:53.037795 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1, maxUnhealthy: 1, unhealthy: 1. Remediations are allowed I0622 04:15:53.037830 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 4m59.971736102s I0622 04:16:02.602147 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1 I0622 04:16:02.602182 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets I0622 04:16:02.602263 1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking I0622 04:16:02.602288 1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 4m50.397725768s I0622 04:16:02.608958 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1, maxUnhealthy: 1, unhealthy: 1. Remediations are allowed I0622 04:16:02.608992 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 4m50.397725768s I0622 04:16:52.747694 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/mhc1 I0622 04:16:52.748552 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/mhc1: finding targets I0622 04:16:52.748671 1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: health checking I0622 04:16:52.748701 1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-mpcmm/ip-10-0-135-30.us-east-2.compute.internal: is likely to go unhealthy in 4m0.251310018s I0622 04:16:52.755874 1 machinehealthcheck_controller.go:199] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 1, maxUnhealthy: 1, unhealthy: 1. Remediations are allowed I0622 04:16:52.755969 1 machinehealthcheck_controller.go:223] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 4m0.251310018s . . . Actual & Expected:Remediation happened successfully as maxUnhealthy value is 1 Step 3: Edit the mhc mhc1 with value of maxUnhealthy as “1%” [miyadav@miyadav bugzilla]$ oc edit mhc mhc1 machinehealthcheck.machine.openshift.io/mhc1 edited Step 4: Repeat step 2 Step 5 : Monitor mhc logs , oc logs -f machine-api-controllers-77d9ccd587-d6hp6 -c machine-healthcheck-controller I0622 04:27:51.415666 1 machinehealthcheck_controller.go:272] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-5pb2d/: health checking I0622 04:27:51.415716 1 machinehealthcheck_controller.go:286] Reconciling openshift-machine-api/mhc1/miyadav-0622-cpsfs-worker-us-east-2a-5pb2d/: is likely to go unhealthy in 9m56.584292844s W0622 04:27:51.421869 1 machinehealthcheck_controller.go:182] Reconciling openshift-machine-api/mhc1: total targets: 2, maxUnhealthy: 1%, unhealthy: 2. Short-circuiting remediation Actual:Remediation did not happen as maxUnhealthy value is 1 percent Expected : Remediation should not happen as maxunhealthy value is 1 percent ( exceeded max condition to allow remediation) Additional Info: Moved to VERIFIED https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-28859
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2713