Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1874861

Summary: No equivalent of "SkippedMaster" event emitted when hasControllerOwner() is false for a unhealthy node
Product: OpenShift Container Platform Reporter: mlammon
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: gharden, jspeed
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:37:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mlammon 2020-09-02 12:54:23 UTC
Description of problem:
EventSkippedMaster string = "SkippedMaster" is not emitting event under Normal mHC when a master node is in unhealthy node.


Version-Release number of selected component (if applicable):
ocp-release:4.6.0-fc.3-x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install OCP 4.6
2. Create mHc_Ready_Master_Normal.yaml (see below) and deploy normal machine health check for unknown condition (oc apply -f mHc_Ready_Master_Normal.yaml)
3. In virtual environment (virsh suspend <node> to simulate


cat > mHc_Ready_Master_Normal.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: masters
 namespace: openshift-machine-api
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: master
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF


Actual results:
SkippedMaster event is never emitted

Expected results:
Normal SkippedMaster event 

Additional info:

https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machinehealthcheck/machinehealthcheck_controller.go

Comment 1 Joel Speed 2020-09-07 13:05:53 UTC
We lost this event when we switched to checking for controller owners rather than checking for "master" machines. I've raised a PR to add a new event that will be sent any time remediation is skipped because the Machine has no owner.

Comment 4 mlammon 2020-09-11 19:40:59 UTC
It is not 100% clear to me if this alarm should only emit on master node. I originally thought it was worker.
Today, I tested both master and worker.  I only see it work with master

Using this mHC:
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: test-master-events
 namespace: openshift-machine-api
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: master
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s

virsh suspend master-0-0 # to generate Unknown status

We found alarm looking at
oc describe machine ocp-edge-cluster-rdu1-lkjmw-master-0 -n openshift-machine-api | grep -A5 Events
Events:
  Type    Reason               Age                    From                           Message
  ----    ------               ----                   ----                           -------
  Normal  SkippedNoController  101m (x53 over 3h56m)  machinehealthcheck-controller  Machine openshift-machine-api/test-master-events/ocp-edge-cluster-rdu1-lkjmw-master-0/master-0-0 has no controller owner, skipping remediation




Question:
According to https://github.com/openshift/machine-api-operator/pull/694
"To improve the user experience of the MHC, we should always emit an event when we are skipping remediation.

We used to have this for master Machines, though this was removed when we migrated to the controller owner check (#543). This restores the behaviour, but with a new event that is more agnostic of the type of Machine."

Can you elaborate on type of machine?  Are you suggesting it should emit the alarm on worker machine?

Comment 5 Joel Speed 2020-09-14 08:26:22 UTC
Apologies for the terse description on the PR.

It is correct to only see this for master nodes now.

The behaviour we are expecting is that any Machine that doesn't have an owner (eg a MachineSet), would have this event emitted.

Normally, all worker Machines are created by a MachineSet and as such will have an owner reference attached.
Master Machines however are not created by MachineSets and as such have no owner reference.

If you were to create a Machine on it's own, directly, without using a MachineSet, this should also not be remediated and should have the event omitted.

Comment 6 mlammon 2020-09-14 13:15:41 UTC
Thanks Joel,  we can then verify this now!

Comment 8 errata-xmlrpc 2020-10-27 16:37:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196