1866719 – Nodelink controller should not remove machine.nodeRef when node is being deleted

Bug 1866719 - Nodelink controller should not remove machine.nodeRef when node is being deleted

Summary: Nodelink controller should not remove machine.nodeRef when node is being deleted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Beth White
QA Contact:	Shelly Miron
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1862180
TreeView+	depends on / blocked

Reported:	2020-08-06 08:24 UTC by Nir
Modified:	2020-10-27 16:25 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:25:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 669	0	None	closed	Bug 1866719: Keep machine.nodeRef even if node was marked for deletion	2020-11-19 08:29:09 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:25:45 UTC

Description Nir 2020-08-06 08:24:52 UTC

Description of problem:
If node get deleted, nodelink controller will remove the nodeRef from the machine.
It will happen only if there's something that delays the deletion, such as node finalizer. Otherwise, nodelink controller won't be notified that the node was deleted.

It means that we could reach different outcome, depends on the time taken for a node to be deleted.
If node was deleted immediately - nodeRef will stay on the machine.
If there's some delay (e.g. due to finalizer) - nodeRef will be deleted 

Apart from that, it created a race with CAPBM.
CAPBM placed a finalizer on each node, in order to store annotations and label before node deleted, and restore them again after the nodes comes up again (the context is remediation where we delete the node to release workload, then power-cycle the host).

So we may hit this flow:
1. CAPBM puts finalizer on a node
2. node being deleted
3. nodelink controller removes machine.nodeRef
4. CAPBM reconciles that node change, but it finds that machine.nodeRef is nil, thus can't do anything

Moreover, Machine in its lifecycle should have only one node. Thus, there's no reason to remove that nodeRef.

Version-Release number of selected component (if applicable): 4.6.0-0.ci-2020-07-21-114552


How reproducible: always 


Steps to Reproduce:
1. delete a node with finalizer which prevents deletion

Actual results:
noedRef was deleted

Expected results:
nodeRef will be still present on the machine

Comment 4 Shelly Miron 2020-09-02 09:41:50 UTC

Verified.

Steps:
--------------
1. Took a node with finalizer and delete it:
  
$ oc edit node worker-0-1

apiVersion: v1
kind: Node
metadata:
  annotations:
    k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_worker-0-1","mac-address":"52:54:00:bf:c6:88","ip-addresses":["192.168.123.115/24"],"ip-address":"192.168.123.115/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}}'
    k8s.ovn.org/node-chassis-id: bd9f98b3-7880-4973-89f4-0864a5019d51
    k8s.ovn.org/node-join-subnets: '{"default":"100.64.3.0/29"}'
    k8s.ovn.org/node-local-nat-ip: '{"default":["169.254.6.11"]}'
    k8s.ovn.org/node-mgmt-port-mac-address: 02:ed:bd:86:93:04
    k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"192.168.123.115/24"}'
    k8s.ovn.org/node-subnets: '{"default":"10.131.0.0/23"}'
    machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-p8jnq-worker-0-pjxjq
    machineconfiguration.openshift.io/currentConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-09-01T12:57:17Z"
  finalizers:
  - metal3.io/capbm <<<<<---------
  labels:
  ..........
  .....................
  ...................................

[kni@provisionhost-0-0 ~]$ oc delete node worker-0-1
node "worker-0-1" deleted

2. Check the machine connected to the node that we deleted; nodeRef still present on the machine-

..........................................
..................................
................
Status:
  Addresses:
    Address:     192.168.123.115
    Type:        InternalIP
    Address:     fd00:1101::a7f7:def2:65ab:6c43
    Type:        InternalIP
    Address:     worker-0-1
    Type:        Hostname
    Address:     worker-0-1
    Type:        InternalDNS
  Last Updated:  2020-09-01T18:18:12Z
  Node Ref:    <<<<<<<---------------------------------
    Kind:  Node
    Name:  worker-0-1
    UID:   85799dba-8a16-4950-b5c2-ac76a0b70644
  Phase:   Running
Events:    <none>

Comment 7 errata-xmlrpc 2020-10-27 16:25:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.