1815680 – No validation error or indication in status when machine health check has empty selector

Bug 1815680 - No validation error or indication in status when machine health check has empty selector

Summary: No validation error or indication in status when machine health check has emp...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Joel Speed
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1817008
TreeView+	depends on / blocked

Reported:	2020-03-20 21:27 UTC by Clayton Coleman
Modified:	2020-07-13 17:23 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: An empty selector in a MachineHealthCheck was considered invalid, but was not reported to the user. Consequence: Having an empty selector would result in the MachineHealthCheck not performing any health checking or remediation. Fix: Allow an empty selector and consider this to match all Machines in the cluster Result: Having an empty selector no longer causes the MachineHealthCheck to be blocked and instead matches all Machines as per upstream behaviour of selectors
Clone Of:
Clones:	1817008 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:23:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 533	0	None	closed	Bug 1815680: Allow MachineHealthCheck to have an empty selector	2020-06-23 09:32:04 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:23:51 UTC

Description Clayton Coleman 2020-03-20 21:27:31 UTC

A user creating an invalid machine health check has no indication that the controller is ignoring it.

1. If possible, the CRD schema for MachineHealthCheck should make this required
2. After that, the controller should be writing conditions to machine health check status

Comment 1 Joel Speed 2020-03-23 13:01:16 UTC

I just spun up a 4.4 nightly cluster and tried the two following MHCs: 

```
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: worker-unready-5m
  namespace: openshift-machine-api
  labels:
    nodepool: nodepool-0
spec:
  maxUnhealthy: 1
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: "Unknown"
    timeout: 300s
```
Which resulted in the error `The MachineHealthCheck "worker-unready-5m" is invalid: spec.selector: Required value` as I tried to create it. So there is validation that `selector` has to exist. Though, since the fields within it (https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071) are optional, we can't require one of them to be set via the validation (I also don't think we should since users might want to use either)

and

```
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: worker-unready-5m
  namespace: openshift-machine-api
  labels:
    nodepool: nodepool-0
spec:
  selector: {}
  maxUnhealthy: 1
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: "Unknown"
    timeout: 300s
```

Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.")

The controller starter logging `"worker-unready-5m" machineHealthCheck has empty selector` as part of watching (https://github.com/openshift/machine-api-operator/blob/70d185ab215d032004ab5df7bf4997d578ea31a9/pkg/controller/machinehealthcheck/machinehealthcheck_controller.go#L664-L668) the Machines and Node objects.
So while a MachineHealthCheck with an empty selector is currently reconciled, we don't have any events triggering reconciliation from Machines and Nodes so the behaviour will be broken.

I think we have two options going forward:
1) Explicitly don't reconcile MHC with empty selector and put a condition on the status to tell the user why it's not being accepted
2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false)

Comment 2 Michael Gugino 2020-03-23 13:31:03 UTC

> Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.")

What about master machines?  Is this going to automatically select master machines or machines not otherwise in a machineset?  If so, this is a defect.

> A user creating an invalid machine health check has no indication that the controller is ignoring it.

What specific conditions are we considering invalid?

This is more or less part of the problem I have with making this thing too configurable.  It's going to be easy to over or under select machines due to easy misconfiguration.

Comment 3 Joel Speed 2020-03-23 13:59:05 UTC

> What about master machines?  Is this going to automatically select master machines or machines not otherwise in a machineset?  If so, this is a defect.

Yes, it will select all machines, including control plane machines. We have nothing preventing users from selecting over control plane machines as it is. There is logic within the controller which prevents the MHC from remediating a control plane machine, but there is nothing currently to say you can't create an MHC that includes control-plane machines, we just recommend against it as it isn't helpful

> What specific conditions are we considering invalid?

I was interpreting this as the suggestion that an empty selector `selector: {}` is invalid as it appears to cause the controller to not reconcile the MHC

Comment 4 Alberto 2020-03-23 14:10:01 UTC

>Which was accepted and the controller reconciled as a selector over all machines, I believe this is a valid selector https://github.com/kubernetes/apimachinery/blob/48159c651603a061d16fa1dbab2cfe32eceba27a/pkg/apis/meta/v1/types.go#L1059-L1071 where it says "An empty label selector matches all objects.")

Monitor all machines with one MHC is a valid use case and an empty selector is how you declarative express that.
We programatically ensure this is safe by skipping remediation for control plane machines and machines which are not owned by a machineSet.

>2) Allow the empty selector and let an MHC cover all Machines within a cluster (this would involve changing the emptiness check to return true for empty rather than false)

So I'd be favour of 2.

Comment 5 Michael Gugino 2020-03-23 14:26:10 UTC

Okay, as long as we're ensuring that a machine is owned by a machineset, covering all machines is fine by me.

Comment 8 Milind Yadav 2020-04-03 05:53:12 UTC

Validated on:
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-01-204204   True        False         75m     Cluster version is 4.5.0-0.nightly-2020-04-01-204204


Steps:
1.Create a mhc with empty selectors use below yaml for reference :


---
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  creationTimestamp: "2020-02-14T09:47:08Z"
  generation: 1
  name: mhc1
  namespace: openshift-machine-api
  resourceVersion: "71059"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c
  uid: ef74b735-e58e-4c24-aa69-015d90998b77
spec:
  selector: {}
  maxUnhealthy: "2"
  unhealthyConditions:
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready
~                  
MHC gets created successfully 


[miyadav@miyadav ManualRun]$ oc create -f mhc_bzapr04.yml 
machinehealthcheck.machine.openshift.io/mhc1 created
[miyadav@miyadav verification-tests]$ oc get mhc
NAME   MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
mhc1   2              6                  6


2.Delete any machine from the IAAS provider (other than control plane or not a part of any machineset)


Machine gets deleted  successfully and remediation starts 
Monitor logs : oc logs -f <machine-api-controller> -c machine-healthcheck-controller


I0403 03:25:33.909158       1 machinehealthcheck_controller.go:277] Reconciling openshift-machine-api/mhc1/miyadav-0304-9bwlt-worker-us-east-2c-8f8tv/ip-10-0-174-47.us-east-2.compute.internal: health checking
I0403 03:25:33.909174       1 machinehealthcheck_controller.go:277] Reconciling openshift-machine-api/mhc1/miyadav-0304-9bwlt-worker-us-east-2a-wpzrn/ip-10-0-142-166.us-east-2.compute.internal: health checking
I0403 03:25:33.909188       1 machinehealthcheck_controller.go:277] Reconciling openshift-machine-api/mhc1/miyadav-0304-9bwlt-master-0/ip-10-0-128-101.us-east-2.compute.internal: health checking
I0403 03:25:33.915349       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1: monitoring MHC: total targets: 6,  maxUnhealthy: 2, unhealthy: 1. Remediations are allowed
I0403 03:25:33.915422       1 machinehealthcheck_controller.go:228] Reconciling openshift-machine-api/mhc1: some targets might go unhealthy. Ensuring a requeue happens in 9m20.090880335s


3 Machine should get remediated 
oc get machines -o wide 
[miyadav@miyadav verification-tests]$ oc get machines -o wide
NAME                                         PHASE         TYPE        REGION      ZONE         AGE     NODE                                         PROVIDERID                              STATE
miyadav-0304-9bwlt-master-0                  Running       m4.xlarge   us-east-2   us-east-2a   57m     ip-10-0-128-101.us-east-2.compute.internal   aws:///us-east-2a/i-0e90d43018b1c2184   running
miyadav-0304-9bwlt-master-1                  Running       m4.xlarge   us-east-2   us-east-2b   57m     ip-10-0-147-227.us-east-2.compute.internal   aws:///us-east-2b/i-0f08160d86a3d7326   running
miyadav-0304-9bwlt-master-2                  Running       m4.xlarge   us-east-2   us-east-2c   57m     ip-10-0-167-78.us-east-2.compute.internal    aws:///us-east-2c/i-043d3f455ce2e7b03   running
miyadav-0304-9bwlt-worker-us-east-2a-wpzrn   Running       m4.large    us-east-2   us-east-2a   44m     ip-10-0-142-166.us-east-2.compute.internal   aws:///us-east-2a/i-033e04f56e5b1eb23   running
miyadav-0304-9bwlt-worker-us-east-2b-l2885   Provisioned   m4.large    us-east-2   us-east-2b   2m29s                                                aws:///us-east-2b/i-00599f1a6d0676d9b   running
miyadav-0304-9bwlt-worker-us-east-2c-8f8tv   Running       m4.large    us-east-2   us-east-2c   44m     ip-10-0-174-47.us-east-2.compute.internal    aws:///us-east-2c/i-01cb89af56fedeae3   running.
.
.
miyadav-0304-9bwlt-worker-us-east-2b-l2885   Running   m4.large    us-east-2   us-east-2b   3m17s   ip-10-0-145-115.us-east-2.compute.internal   aws:///us-east-2b/i-00599f1a6d0676d9b   running

Comment 10 errata-xmlrpc 2020-07-13 17:23:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.