A user creating an invalid machine health check has no indication that the controller is ignoring it. 1. If possible, the CRD schema for MachineHealthCheck should make this required 2. After that, the controller should be writing conditions to machine health check status
I just spun up a 4.4 nightly cluster and tried the two following MHCs: ``` apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: worker-unready-5m namespace: openshift-machine-api labels: nodepool: nodepool-0 spec: maxUnhealthy: 1 unhealthyConditions: - type: Ready status: "False" timeout: 300s - type: Ready status: "Unknown" timeout: 300s ``` Which resulted in the error `The MachineHealthCheck "worker-unready-5m" is invalid: spec.selector: Required value` as I tried to create it. So there is validation that `selector` has to exist. Though, since the fields within it (https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071) are optional, we can't require one of them to be set via the validation (I also don't think we should since users might want to use either) and ``` apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: worker-unready-5m namespace: openshift-machine-api labels: nodepool: nodepool-0 spec: selector: {} maxUnhealthy: 1 unhealthyConditions: - type: Ready status: "False" timeout: 300s - type: Ready status: "Unknown" timeout: 300s ``` Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.") The controller starter logging `"worker-unready-5m" machineHealthCheck has empty selector` as part of watching (https://github.com/openshift/machine-api-operator/blob/70d185ab215d032004ab5df7bf4997d578ea31a9/pkg/controller/machinehealthcheck/machinehealthcheck_controller.go#L664-L668) the Machines and Node objects. So while a MachineHealthCheck with an empty selector is currently reconciled, we don't have any events triggering reconciliation from Machines and Nodes so the behaviour will be broken. I think we have two options going forward: 1) Explicitly don't reconcile MHC with empty selector and put a condition on the status to tell the user why it's not being accepted 2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false)
> Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.") What about master machines? Is this going to automatically select master machines or machines not otherwise in a machineset? If so, this is a defect. > A user creating an invalid machine health check has no indication that the controller is ignoring it. What specific conditions are we considering invalid? This is more or less part of the problem I have with making this thing too configurable. It's going to be easy to over or under select machines due to easy misconfiguration.
> What about master machines? Is this going to automatically select master machines or machines not otherwise in a machineset? If so, this is a defect. Yes, it will select all machines, including control plane machines. We have nothing preventing users from selecting over control plane machines as it is. There is logic within the controller which prevents the MHC from remediating a control plane machine, but there is nothing currently to say you can't create an MHC that includes control-plane machines, we just recommend against it as it isn't helpful > What specific conditions are we considering invalid? I was interpreting this as the suggestion that an empty selector `selector: {}` is invalid as it appears to cause the controller to not reconcile the MHC
>Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.") Monitor all machines with one MHC is a valid use case and an empty selector is how you declarative express that. We programatically ensure this is safe by skipping remediation for control plane machines and machines which are not owned by a machineSet. >2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false) So I'd be favour of 2.
Okay, as long as we're ensuring that a machine is owned by a machineset, covering all machines is fine by me.
Validated on: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-04-01-204204 True False 75m Cluster version is 4.5.0-0.nightly-2020-04-01-204204 Steps: 1.Create a mhc with empty selectors use below yaml for reference : --- apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: creationTimestamp: "2020-02-14T09:47:08Z" generation: 1 name: mhc1 namespace: openshift-machine-api resourceVersion: "71059" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c uid: ef74b735-e58e-4c24-aa69-015d90998b77 spec: selector: {} maxUnhealthy: "2" unhealthyConditions: - status: "False" timeout: 300s type: Ready - status: Unknown timeout: 300s type: Ready ~ MHC gets created successfully [miyadav@miyadav ManualRun]$ oc create -f mhc_bzapr04.yml machinehealthcheck.machine.openshift.io/mhc1 created [miyadav@miyadav verification-tests]$ oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY mhc1 2 6 6 2.Delete any machine from the IAAS provider (other than control plane or not a part of any machineset) Machine gets deleted successfully and remediation starts Monitor logs : oc logs -f <machine-api-controller> -c machine-healthcheck-controller I0403 03:25:33.909158 1 machinehealthcheck_controller.go:277] Reconciling openshift-machine-api/mhc1/miyadav-0304-9bwlt-worker-us-east-2c-8f8tv/ip-10-0-174-47.us-east-2.compute.internal: health checking I0403 03:25:33.909174 1 machinehealthcheck_controller.go:277] Reconciling openshift-machine-api/mhc1/miyadav-0304-9bwlt-worker-us-east-2a-wpzrn/ip-10-0-142-166.us-east-2.compute.internal: health checking I0403 03:25:33.909188 1 machinehealthcheck_controller.go:277] Reconciling openshift-machine-api/mhc1/miyadav-0304-9bwlt-master-0/ip-10-0-128-101.us-east-2.compute.internal: health checking I0403 03:25:33.915349 1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 6, maxUnhealthy: 2, unhealthy: 1. Remediations are allowed I0403 03:25:33.915422 1 machinehealthcheck_controller.go:228] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 9m20.090880335s 3 Machine should get remediated oc get machines -o wide [miyadav@miyadav verification-tests]$ oc get machines -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE miyadav-0304-9bwlt-master-0 Running m4.xlarge us-east-2 us-east-2a 57m ip-10-0-128-101.us-east-2.compute.internal aws:///us-east-2a/i-0e90d43018b1c2184 running miyadav-0304-9bwlt-master-1 Running m4.xlarge us-east-2 us-east-2b 57m ip-10-0-147-227.us-east-2.compute.internal aws:///us-east-2b/i-0f08160d86a3d7326 running miyadav-0304-9bwlt-master-2 Running m4.xlarge us-east-2 us-east-2c 57m ip-10-0-167-78.us-east-2.compute.internal aws:///us-east-2c/i-043d3f455ce2e7b03 running miyadav-0304-9bwlt-worker-us-east-2a-wpzrn Running m4.large us-east-2 us-east-2a 44m ip-10-0-142-166.us-east-2.compute.internal aws:///us-east-2a/i-033e04f56e5b1eb23 running miyadav-0304-9bwlt-worker-us-east-2b-l2885 Provisioned m4.large us-east-2 us-east-2b 2m29s aws:///us-east-2b/i-00599f1a6d0676d9b running miyadav-0304-9bwlt-worker-us-east-2c-8f8tv Running m4.large us-east-2 us-east-2c 44m ip-10-0-174-47.us-east-2.compute.internal aws:///us-east-2c/i-01cb89af56fedeae3 running. . . miyadav-0304-9bwlt-worker-us-east-2b-l2885 Running m4.large us-east-2 us-east-2b 3m17s ip-10-0-145-115.us-east-2.compute.internal aws:///us-east-2b/i-00599f1a6d0676d9b running
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409