Description of problem: EventSkippedMaster string = "SkippedMaster" is not emitting event under Normal mHC when a master node is in unhealthy node. Version-Release number of selected component (if applicable): ocp-release:4.6.0-fc.3-x86_64 How reproducible: 100% Steps to Reproduce: 1. Install OCP 4.6 2. Create mHc_Ready_Master_Normal.yaml (see below) and deploy normal machine health check for unknown condition (oc apply -f mHc_Ready_Master_Normal.yaml) 3. In virtual environment (virsh suspend <node> to simulate cat > mHc_Ready_Master_Normal.yaml << EOF apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: masters namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-machine-role: master unhealthyConditions: - type: Ready status: Unknown timeout: 60s EOF Actual results: SkippedMaster event is never emitted Expected results: Normal SkippedMaster event Additional info: https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machinehealthcheck/machinehealthcheck_controller.go
We lost this event when we switched to checking for controller owners rather than checking for "master" machines. I've raised a PR to add a new event that will be sent any time remediation is skipped because the Machine has no owner.
It is not 100% clear to me if this alarm should only emit on master node. I originally thought it was worker. Today, I tested both master and worker. I only see it work with master Using this mHC: apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: test-master-events namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-machine-role: master unhealthyConditions: - type: Ready status: Unknown timeout: 60s virsh suspend master-0-0 # to generate Unknown status We found alarm looking at oc describe machine ocp-edge-cluster-rdu1-lkjmw-master-0 -n openshift-machine-api | grep -A5 Events Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SkippedNoController 101m (x53 over 3h56m) machinehealthcheck-controller Machine openshift-machine-api/test-master-events/ocp-edge-cluster-rdu1-lkjmw-master-0/master-0-0 has no controller owner, skipping remediation Question: According to https://github.com/openshift/machine-api-operator/pull/694 "To improve the user experience of the MHC, we should always emit an event when we are skipping remediation. We used to have this for master Machines, though this was removed when we migrated to the controller owner check (#543). This restores the behaviour, but with a new event that is more agnostic of the type of Machine." Can you elaborate on type of machine? Are you suggesting it should emit the alarm on worker machine?
Apologies for the terse description on the PR. It is correct to only see this for master nodes now. The behaviour we are expecting is that any Machine that doesn't have an owner (eg a MachineSet), would have this event emitted. Normally, all worker Machines are created by a MachineSet and as such will have an owner reference attached. Master Machines however are not created by MachineSets and as such have no owner reference. If you were to create a Machine on it's own, directly, without using a MachineSet, this should also not be remediated and should have the event omitted.
Thanks Joel, we can then verify this now!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196