Bug 1874861 - No equivalent of "SkippedMaster" event emitted when hasControllerOwner() is false for a unhealthy node
Summary: No equivalent of "SkippedMaster" event emitted when hasControllerOwner() is f...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-02 12:54 UTC by mlammon
Modified: 2020-10-27 16:37 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:37:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 694 0 None closed BUG 1874861: Ensure an event is emitted when remediation is skipped 2020-11-12 05:45:10 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:37:26 UTC

Description mlammon 2020-09-02 12:54:23 UTC
Description of problem:
EventSkippedMaster string = "SkippedMaster" is not emitting event under Normal mHC when a master node is in unhealthy node.


Version-Release number of selected component (if applicable):
ocp-release:4.6.0-fc.3-x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install OCP 4.6
2. Create mHc_Ready_Master_Normal.yaml (see below) and deploy normal machine health check for unknown condition (oc apply -f mHc_Ready_Master_Normal.yaml)
3. In virtual environment (virsh suspend <node> to simulate


cat > mHc_Ready_Master_Normal.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: masters
 namespace: openshift-machine-api
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: master
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF


Actual results:
SkippedMaster event is never emitted

Expected results:
Normal SkippedMaster event 

Additional info:

https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machinehealthcheck/machinehealthcheck_controller.go

Comment 1 Joel Speed 2020-09-07 13:05:53 UTC
We lost this event when we switched to checking for controller owners rather than checking for "master" machines. I've raised a PR to add a new event that will be sent any time remediation is skipped because the Machine has no owner.

Comment 4 mlammon 2020-09-11 19:40:59 UTC
It is not 100% clear to me if this alarm should only emit on master node. I originally thought it was worker.
Today, I tested both master and worker.  I only see it work with master

Using this mHC:
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: test-master-events
 namespace: openshift-machine-api
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: master
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s

virsh suspend master-0-0 # to generate Unknown status

We found alarm looking at
oc describe machine ocp-edge-cluster-rdu1-lkjmw-master-0 -n openshift-machine-api | grep -A5 Events
Events:
  Type    Reason               Age                    From                           Message
  ----    ------               ----                   ----                           -------
  Normal  SkippedNoController  101m (x53 over 3h56m)  machinehealthcheck-controller  Machine openshift-machine-api/test-master-events/ocp-edge-cluster-rdu1-lkjmw-master-0/master-0-0 has no controller owner, skipping remediation




Question:
According to https://github.com/openshift/machine-api-operator/pull/694
"To improve the user experience of the MHC, we should always emit an event when we are skipping remediation.

We used to have this for master Machines, though this was removed when we migrated to the controller owner check (#543). This restores the behaviour, but with a new event that is more agnostic of the type of Machine."

Can you elaborate on type of machine?  Are you suggesting it should emit the alarm on worker machine?

Comment 5 Joel Speed 2020-09-14 08:26:22 UTC
Apologies for the terse description on the PR.

It is correct to only see this for master nodes now.

The behaviour we are expecting is that any Machine that doesn't have an owner (eg a MachineSet), would have this event emitted.

Normally, all worker Machines are created by a MachineSet and as such will have an owner reference attached.
Master Machines however are not created by MachineSets and as such have no owner reference.

If you were to create a Machine on it's own, directly, without using a MachineSet, this should also not be remediated and should have the event omitted.

Comment 6 mlammon 2020-09-14 13:15:41 UTC
Thanks Joel,  we can then verify this now!

Comment 8 errata-xmlrpc 2020-10-27 16:37:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.