Bug 1866719 - Nodelink controller should not remove machine.nodeRef when node is being deleted
Summary: Nodelink controller should not remove machine.nodeRef when node is being deleted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Beth White
QA Contact: Shelly Miron
URL:
Whiteboard:
Depends On:
Blocks: 1862180
TreeView+ depends on / blocked
 
Reported: 2020-08-06 08:24 UTC by Nir
Modified: 2020-10-27 16:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:25:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 669 0 None closed Bug 1866719: Keep machine.nodeRef even if node was marked for deletion 2020-11-19 08:29:09 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:25:45 UTC

Description Nir 2020-08-06 08:24:52 UTC
Description of problem:
If node get deleted, nodelink controller will remove the nodeRef from the machine.
It will happen only if there's something that delays the deletion, such as node finalizer. Otherwise, nodelink controller won't be notified that the node was deleted.

It means that we could reach different outcome, depends on the time taken for a node to be deleted.
If node was deleted immediately - nodeRef will stay on the machine.
If there's some delay (e.g. due to finalizer) - nodeRef will be deleted 

Apart from that, it created a race with CAPBM.
CAPBM placed a finalizer on each node, in order to store annotations and label before node deleted, and restore them again after the nodes comes up again (the context is remediation where we delete the node to release workload, then power-cycle the host).

So we may hit this flow:
1. CAPBM puts finalizer on a node
2. node being deleted
3. nodelink controller removes machine.nodeRef
4. CAPBM reconciles that node change, but it finds that machine.nodeRef is nil, thus can't do anything

Moreover, Machine in its lifecycle should have only one node. Thus, there's no reason to remove that nodeRef.

Version-Release number of selected component (if applicable): 4.6.0-0.ci-2020-07-21-114552


How reproducible: always 


Steps to Reproduce:
1. delete a node with finalizer which prevents deletion

Actual results:
noedRef was deleted

Expected results:
nodeRef will be still present on the machine

Comment 4 Shelly Miron 2020-09-02 09:41:50 UTC
Verified.

Steps:
--------------
1. Took a node with finalizer and delete it:
  
$ oc edit node worker-0-1

apiVersion: v1
kind: Node
metadata:
  annotations:
    k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_worker-0-1","mac-address":"52:54:00:bf:c6:88","ip-addresses":["192.168.123.115/24"],"ip-address":"192.168.123.115/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}}'
    k8s.ovn.org/node-chassis-id: bd9f98b3-7880-4973-89f4-0864a5019d51
    k8s.ovn.org/node-join-subnets: '{"default":"100.64.3.0/29"}'
    k8s.ovn.org/node-local-nat-ip: '{"default":["169.254.6.11"]}'
    k8s.ovn.org/node-mgmt-port-mac-address: 02:ed:bd:86:93:04
    k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"192.168.123.115/24"}'
    k8s.ovn.org/node-subnets: '{"default":"10.131.0.0/23"}'
    machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-p8jnq-worker-0-pjxjq
    machineconfiguration.openshift.io/currentConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-09-01T12:57:17Z"
  finalizers:
  - metal3.io/capbm <<<<<---------
  labels:
  ..........
  .....................
  ...................................

[kni@provisionhost-0-0 ~]$ oc delete node worker-0-1
node "worker-0-1" deleted

2. Check the machine connected to the node that we deleted; nodeRef still present on the machine-

..........................................
..................................
................
Status:
  Addresses:
    Address:     192.168.123.115
    Type:        InternalIP
    Address:     fd00:1101::a7f7:def2:65ab:6c43
    Type:        InternalIP
    Address:     worker-0-1
    Type:        Hostname
    Address:     worker-0-1
    Type:        InternalDNS
  Last Updated:  2020-09-01T18:18:12Z
  Node Ref:    <<<<<<<---------------------------------
    Kind:  Node
    Name:  worker-0-1
    UID:   85799dba-8a16-4950-b5c2-ac76a0b70644
  Phase:   Running
Events:    <none>

Comment 7 errata-xmlrpc 2020-10-27 16:25:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.