Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1817008

Summary: No validation error or indication in status when machine health check has empty selector
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED NEXTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: agarcial, ccoleman, jhou, mgugino
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1815680 Environment:
Last Closed: 2020-06-19 09:13:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1815680    
Bug Blocks:    

Description Joel Speed 2020-03-25 11:52:56 UTC
+++ This bug was initially created as a clone of Bug #1815680 +++

A user creating an invalid machine health check has no indication that the controller is ignoring it.

1. If possible, the CRD schema for MachineHealthCheck should make this required
2. After that, the controller should be writing conditions to machine health check status

--- Additional comment from Joel Speed on 2020-03-23 13:01:16 UTC ---

I just spun up a 4.4 nightly cluster and tried the two following MHCs: 

```
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: worker-unready-5m
  namespace: openshift-machine-api
  labels:
    nodepool: nodepool-0
spec:
  maxUnhealthy: 1
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: "Unknown"
    timeout: 300s
```
Which resulted in the error `The MachineHealthCheck "worker-unready-5m" is invalid: spec.selector: Required value` as I tried to create it. So there is validation that `selector` has to exist. Though, since the fields within it (https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071) are optional, we can't require one of them to be set via the validation (I also don't think we should since users might want to use either)

and

```
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: worker-unready-5m
  namespace: openshift-machine-api
  labels:
    nodepool: nodepool-0
spec:
  selector: {}
  maxUnhealthy: 1
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: "Unknown"
    timeout: 300s
```

Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.")

The controller starter logging `"worker-unready-5m" machineHealthCheck has empty selector` as part of watching (https://github.com/openshift/machine-api-operator/blob/70d185ab215d032004ab5df7bf4997d578ea31a9/pkg/controller/machinehealthcheck/machinehealthcheck_controller.go#L664-L668) the Machines and Node objects.
So while a MachineHealthCheck with an empty selector is currently reconciled, we don't have any events triggering reconciliation from Machines and Nodes so the behaviour will be broken.

I think we have two options going forward:
1) Explicitly don't reconcile MHC with empty selector and put a condition on the status to tell the user why it's not being accepted
2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false)

--- Additional comment from Michael Gugino on 2020-03-23 13:31:03 UTC ---

> Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.")

What about master machines?  Is this going to automatically select master machines or machines not otherwise in a machineset?  If so, this is a defect.

> A user creating an invalid machine health check has no indication that the controller is ignoring it.

What specific conditions are we considering invalid?

This is more or less part of the problem I have with making this thing too configurable.  It's going to be easy to over or under select machines due to easy misconfiguration.

--- Additional comment from Joel Speed on 2020-03-23 13:59:05 UTC ---

> What about master machines?  Is this going to automatically select master machines or machines not otherwise in a machineset?  If so, this is a defect.

Yes, it will select all machines, including control plane machines. We have nothing preventing users from selecting over control plane machines as it is. There is logic within the controller which prevents the MHC from remediating a control plane machine, but there is nothing currently to say you can't create an MHC that includes control-plane machines, we just recommend against it as it isn't helpful

> What specific conditions are we considering invalid?

I was interpreting this as the suggestion that an empty selector `selector: {}` is invalid as it appears to cause the controller to not reconcile the MHC

--- Additional comment from Alberto on 2020-03-23 14:10:01 UTC ---

>Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.")

Monitor all machines with one MHC is a valid use case and an empty selector is how you declarative express that.
We programatically ensure this is safe by skipping remediation for control plane machines and machines which are not owned by a machineSet.

>2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false)

So I'd be favour of 2.

--- Additional comment from Michael Gugino on 2020-03-23 14:26:10 UTC ---

Okay, as long as we're ensuring that a machine is owned by a machineset, covering all machines is fine by me.

Comment 1 Joel Speed 2020-04-20 11:16:41 UTC
PR https://github.com/openshift/machine-api-operator/pull/538 will update BZ once 4.4.z branch opens

Comment 2 Alberto 2020-06-19 09:13:45 UTC
Discussed with @JoelSpeed and agreed on dropping the backport for this since it'd be intrusive to include this fix which changes behaviour for existing resources in z stream. We'll rather add the proper release note for 4.5.