+++ This bug was initially created as a clone of Bug #1815680 +++ A user creating an invalid machine health check has no indication that the controller is ignoring it. 1. If possible, the CRD schema for MachineHealthCheck should make this required 2. After that, the controller should be writing conditions to machine health check status --- Additional comment from Joel Speed on 2020-03-23 13:01:16 UTC --- I just spun up a 4.4 nightly cluster and tried the two following MHCs: ``` apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: worker-unready-5m namespace: openshift-machine-api labels: nodepool: nodepool-0 spec: maxUnhealthy: 1 unhealthyConditions: - type: Ready status: "False" timeout: 300s - type: Ready status: "Unknown" timeout: 300s ``` Which resulted in the error `The MachineHealthCheck "worker-unready-5m" is invalid: spec.selector: Required value` as I tried to create it. So there is validation that `selector` has to exist. Though, since the fields within it (https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071) are optional, we can't require one of them to be set via the validation (I also don't think we should since users might want to use either) and ``` apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: worker-unready-5m namespace: openshift-machine-api labels: nodepool: nodepool-0 spec: selector: {} maxUnhealthy: 1 unhealthyConditions: - type: Ready status: "False" timeout: 300s - type: Ready status: "Unknown" timeout: 300s ``` Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.") The controller starter logging `"worker-unready-5m" machineHealthCheck has empty selector` as part of watching (https://github.com/openshift/machine-api-operator/blob/70d185ab215d032004ab5df7bf4997d578ea31a9/pkg/controller/machinehealthcheck/machinehealthcheck_controller.go#L664-L668) the Machines and Node objects. So while a MachineHealthCheck with an empty selector is currently reconciled, we don't have any events triggering reconciliation from Machines and Nodes so the behaviour will be broken. I think we have two options going forward: 1) Explicitly don't reconcile MHC with empty selector and put a condition on the status to tell the user why it's not being accepted 2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false) --- Additional comment from Michael Gugino on 2020-03-23 13:31:03 UTC --- > Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.") What about master machines? Is this going to automatically select master machines or machines not otherwise in a machineset? If so, this is a defect. > A user creating an invalid machine health check has no indication that the controller is ignoring it. What specific conditions are we considering invalid? This is more or less part of the problem I have with making this thing too configurable. It's going to be easy to over or under select machines due to easy misconfiguration. --- Additional comment from Joel Speed on 2020-03-23 13:59:05 UTC --- > What about master machines? Is this going to automatically select master machines or machines not otherwise in a machineset? If so, this is a defect. Yes, it will select all machines, including control plane machines. We have nothing preventing users from selecting over control plane machines as it is. There is logic within the controller which prevents the MHC from remediating a control plane machine, but there is nothing currently to say you can't create an MHC that includes control-plane machines, we just recommend against it as it isn't helpful > What specific conditions are we considering invalid? I was interpreting this as the suggestion that an empty selector `selector: {}` is invalid as it appears to cause the controller to not reconcile the MHC --- Additional comment from Alberto on 2020-03-23 14:10:01 UTC --- >Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.") Monitor all machines with one MHC is a valid use case and an empty selector is how you declarative express that. We programatically ensure this is safe by skipping remediation for control plane machines and machines which are not owned by a machineSet. >2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false) So I'd be favour of 2. --- Additional comment from Michael Gugino on 2020-03-23 14:26:10 UTC --- Okay, as long as we're ensuring that a machine is owned by a machineset, covering all machines is fine by me.
PR https://github.com/openshift/machine-api-operator/pull/538 will update BZ once 4.4.z branch opens
Discussed with @JoelSpeed and agreed on dropping the backport for this since it'd be intrusive to include this fix which changes behaviour for existing resources in z stream. We'll rather add the proper release note for 4.5.