1765427 – [MHC] Remediation does not trigger when machine has nodeRef but node is deleted

Bug 1765427 - [MHC] Remediation does not trigger when machine has nodeRef but node is deleted

Summary: [MHC] Remediation does not trigger when machine has nodeRef but node is deleted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-25 04:27 UTC by Jianwei Hou
Modified:	2020-01-23 11:09 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:09:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 436	0	'None'	closed	bug 1765427: Add index for getting machines from node.	2021-01-28 11:25:17 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:09:35 UTC

Description Jianwei Hou 2019-10-25 04:27:44 UTC

Description of problem:
Delete a node, machine still has nodeRef to the node, remediation does not trigger

Version-Release number of selected component (if applicable):
4.3.0-0.ci-2019-10-24-213642

How reproducible:
Always

Steps to Reproduce:
1. Create a machinehealthcheck

apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
  name: test
  namespce: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: jhou-m9xln
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: jhou-m9xln-w-a
  unhealthyConditions:
  - type: "Ready"
    status: "False"
    timeout: "300s"
  - type: "Ready"
    status: "Unknown"
    timeout: "300s"

2. Delete a node that is linked to a machine of the machineset in the above

3. Monitor mhc logs

Actual results:
I1025 03:34:26.017131       1 machinehealthcheck_controller.go:200] Reconciling openshift-machine-api/test/jhou-m9xln-w-a-v8w98/jhou-m9xln-w-a-v8w98.c.openshift-gce-devel.internal: health checking
I1025 03:34:26.024572       1 machinehealthcheck_controller.go:135] Reconciling openshift-machine-api/test: monitoring MHC: total targets: 1,  maxUnhealthy: <nil>, unhealthy: 0. Remediations are allowed
I1025 03:34:26.024866       1 machinehealthcheck_controller.go:163] Reconciling openshift-machine-api/test: no more targets meet unhealthy criteria
E1025 03:34:28.453846       1 machinehealthcheck_controller.go:303] No-op: Unable to retrieve node "/jhou-m9xln-w-a-v8w98.c.openshift-gce-devel.internal" from store: Node "jhou-m9xln-w-a-v8w98.c.openshift-gce-devel.internal" not found


Expected results:
MHC logged no-op.

Additional info:

Comment 2 Jianwei Hou 2019-11-18 06:11:00 UTC

Verified in 4.3.0-0.nightly-2019-11-17-224250, machine is remediated when it has nodeRef but node is deleted.

Comment 4 errata-xmlrpc 2020-01-23 11:09:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.