Bug 1866719

Summary: Nodelink controller should not remove machine.nodeRef when node is being deleted
Product: OpenShift Container Platform Reporter: Nir <nyehia>
Component: Cloud ComputeAssignee: Beth White <beth.white>
Cloud Compute sub component: BareMetal Provider QA Contact: Shelly Miron <smiron>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: smiron, stbenjam, zbitter
Version: 4.6Keywords: Triaged, UpcomingSprint
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:25:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1862180    

Description Nir 2020-08-06 08:24:52 UTC
Description of problem:
If node get deleted, nodelink controller will remove the nodeRef from the machine.
It will happen only if there's something that delays the deletion, such as node finalizer. Otherwise, nodelink controller won't be notified that the node was deleted.

It means that we could reach different outcome, depends on the time taken for a node to be deleted.
If node was deleted immediately - nodeRef will stay on the machine.
If there's some delay (e.g. due to finalizer) - nodeRef will be deleted 

Apart from that, it created a race with CAPBM.
CAPBM placed a finalizer on each node, in order to store annotations and label before node deleted, and restore them again after the nodes comes up again (the context is remediation where we delete the node to release workload, then power-cycle the host).

So we may hit this flow:
1. CAPBM puts finalizer on a node
2. node being deleted
3. nodelink controller removes machine.nodeRef
4. CAPBM reconciles that node change, but it finds that machine.nodeRef is nil, thus can't do anything

Moreover, Machine in its lifecycle should have only one node. Thus, there's no reason to remove that nodeRef.

Version-Release number of selected component (if applicable): 4.6.0-0.ci-2020-07-21-114552


How reproducible: always 


Steps to Reproduce:
1. delete a node with finalizer which prevents deletion

Actual results:
noedRef was deleted

Expected results:
nodeRef will be still present on the machine

Comment 4 Shelly Miron 2020-09-02 09:41:50 UTC
Verified.

Steps:
--------------
1. Took a node with finalizer and delete it:
  
$ oc edit node worker-0-1

apiVersion: v1
kind: Node
metadata:
  annotations:
    k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_worker-0-1","mac-address":"52:54:00:bf:c6:88","ip-addresses":["192.168.123.115/24"],"ip-address":"192.168.123.115/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}}'
    k8s.ovn.org/node-chassis-id: bd9f98b3-7880-4973-89f4-0864a5019d51
    k8s.ovn.org/node-join-subnets: '{"default":"100.64.3.0/29"}'
    k8s.ovn.org/node-local-nat-ip: '{"default":["169.254.6.11"]}'
    k8s.ovn.org/node-mgmt-port-mac-address: 02:ed:bd:86:93:04
    k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"192.168.123.115/24"}'
    k8s.ovn.org/node-subnets: '{"default":"10.131.0.0/23"}'
    machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-p8jnq-worker-0-pjxjq
    machineconfiguration.openshift.io/currentConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-09-01T12:57:17Z"
  finalizers:
  - metal3.io/capbm <<<<<---------
  labels:
  ..........
  .....................
  ...................................

[kni@provisionhost-0-0 ~]$ oc delete node worker-0-1
node "worker-0-1" deleted

2. Check the machine connected to the node that we deleted; nodeRef still present on the machine-

..........................................
..................................
................
Status:
  Addresses:
    Address:     192.168.123.115
    Type:        InternalIP
    Address:     fd00:1101::a7f7:def2:65ab:6c43
    Type:        InternalIP
    Address:     worker-0-1
    Type:        Hostname
    Address:     worker-0-1
    Type:        InternalDNS
  Last Updated:  2020-09-01T18:18:12Z
  Node Ref:    <<<<<<<---------------------------------
    Kind:  Node
    Name:  worker-0-1
    UID:   85799dba-8a16-4950-b5c2-ac76a0b70644
  Phase:   Running
Events:    <none>

Comment 7 errata-xmlrpc 2020-10-27 16:25:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196