2009475 – Deleting Machine Node object throws reconciliation error after WMCO restart

Bug 2009475 - Deleting Machine Node object throws reconciliation error after WMCO restart

Summary: Deleting Machine Node object throws reconciliation error after WMCO restart

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Windows Containers
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.z
Assignee:	Mohammad Saif Shaikh
QA Contact:	gaoshang
Docs Contact:
URL:
Whiteboard:
Depends On:	2009474
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-30 18:09 UTC by OpenShift BugZilla Robot
Modified:	2021-12-08 22:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, deleting the Node associated with a Windows Machine object threw a reconciliation error upon restart of the operator. This fix opts not to react or reconcile when the node referenced by a Windows Machine in Running state is not found within the cluster, preventing any error loop and standardizing functionality with Linux Machine objects.
Clone Of:
Environment:
Last Closed:	2021-12-08 22:07:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift windows-machine-config-operator pull 724	0	None	open	[release-4.8] Bug 2009475: Do not react to Windows Node deletion	2021-11-18 14:03:04 UTC
Red Hat Product Errata	RHBA-2021:4710	0	None	None	None	2021-12-08 22:07:49 UTC

Description OpenShift BugZilla Robot 2021-09-30 18:09:08 UTC

+++ This bug was initially created as a clone of Bug #1992841 +++

Description of problem:

After adding a Windows node created through MachineSet and contains label 'machine.openshift.io/os-id:Windows', if the node is deleted using `oc delete node xxx` command and the operator pod is restarted, the Windowsmachine_controller throws a reconciliation error:
"could not get node associated with machine <machine-name>"

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-02-145924

How reproducible:
Always.

Steps to Reproduce:
1, Install WMCO operator
2, Create a MachineSet using the Windows label.
3, After Windows instance is fully configured as a node, run command `oc delete node xxx`
4. Restart the operator.

logs:

2021-08-11T19:38:41.402Z	ERROR	controller-runtime.manager.controller.machine	Reconciler error	{"reconciler group": "machine.openshift.io", "reconciler kind": "Machine", "name": "mankulka-04-5r472-windows-worker-us-east-2a-dtw2n", "namespace": "openshift-machine-api", "error": "could not get node associated with machine mankulka-04-5r472-windows-worker-us-east-2a-dtw2n: Node \"ip-10-0-154-73.us-east-2.compute.internal\" not found", "errorVerbose": "Node \"ip-10-0-154-73.us-east-2.compute.internal\" not found\ncould not get node associated with machine mankulka-04-5r472-windows-worker-us-east-2a-dtw2n\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:251\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214

Actual results:
Operator throws reconciliation error on trying to find a matching node.

Expected results:
Operator should run without error.

Additional info:

windowsmachine_controller is running into this issue because it was assuming the nodeRef to be nil, if a corresponding node is not currently present in the cluster. However the nodeRef is never updated to nil after a node is deleted by the nodelink_controller. The reasoning being, they do not expect a new node to be reverse mapped to an existing machine rather it is expected the machine would be deleted and re-created to configure a new node. Given this, proposing few ways to solve this issue:
1. WMCO does not requeue, if no node is present(not found) corresponding to a machine.
2. If a node is not found, WMCO deletes the machine(assuming the deletion will lead to recreation and configuration of new node)

--- Additional comment from mohashai on 2021-09-29 20:21:43 UTC ---

After discussing with the team, the decision was made to not react to Windows node deletion events.

This approach was chosen rather than deleting & re-creating the unassociated Machine as optimizing machine management is not in scope of WMCO responsibilities. In addition, not reacting is in line with the
current behavior with Linux Machines (MCO), standardizing OpenShift functionality across OSs. Also, the Machine cannot be reconfigured to create a new Node object since the machine-api's nodelink_controller will not update any Machine's Node reference, neither after deleting a Node nor after reconfiguring a Machine.

A fix has been tested and the PR is under review here https://github.com/openshift/windows-machine-config-operator/pull/675.

Comment 3 Ronnie Rasouli 2021-11-23 10:38:39 UTC

Verified on WMCO 3.1.1+309c49d

Comment 5 errata-xmlrpc 2021-12-08 22:07:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Windows Container Support for Red Hat OpenShift 3.1.1 product release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4710

Note You need to log in before you can comment on or make changes to this bug.