Bug 1866719 - Nodelink controller should not remove machine.nodeRef when node is being deleted
Summary: Nodelink controller should not remove machine.nodeRef when node is being deleted
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Beth White
QA Contact: Shelly Miron
Depends On:
Blocks: 1862180
TreeView+ depends on / blocked
Reported: 2020-08-06 08:24 UTC by Nir
Modified: 2020-10-27 16:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2020-10-27 16:25:22 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 669 0 None closed Bug 1866719: Keep machine.nodeRef even if node was marked for deletion 2020-11-19 08:29:09 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:25:45 UTC

Description Nir 2020-08-06 08:24:52 UTC
Description of problem:
If node get deleted, nodelink controller will remove the nodeRef from the machine.
It will happen only if there's something that delays the deletion, such as node finalizer. Otherwise, nodelink controller won't be notified that the node was deleted.

It means that we could reach different outcome, depends on the time taken for a node to be deleted.
If node was deleted immediately - nodeRef will stay on the machine.
If there's some delay (e.g. due to finalizer) - nodeRef will be deleted 

Apart from that, it created a race with CAPBM.
CAPBM placed a finalizer on each node, in order to store annotations and label before node deleted, and restore them again after the nodes comes up again (the context is remediation where we delete the node to release workload, then power-cycle the host).

So we may hit this flow:
1. CAPBM puts finalizer on a node
2. node being deleted
3. nodelink controller removes machine.nodeRef
4. CAPBM reconciles that node change, but it finds that machine.nodeRef is nil, thus can't do anything

Moreover, Machine in its lifecycle should have only one node. Thus, there's no reason to remove that nodeRef.

Version-Release number of selected component (if applicable): 4.6.0-0.ci-2020-07-21-114552

How reproducible: always 

Steps to Reproduce:
1. delete a node with finalizer which prevents deletion

Actual results:
noedRef was deleted

Expected results:
nodeRef will be still present on the machine

Comment 4 Shelly Miron 2020-09-02 09:41:50 UTC

1. Took a node with finalizer and delete it:
$ oc edit node worker-0-1

apiVersion: v1
kind: Node
    k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_worker-0-1","mac-address":"52:54:00:bf:c6:88","ip-addresses":[""],"ip-address":"","next-hops":[""],"next-hop":"","node-port-enable":"true","vlan-id":"0"}}'
    k8s.ovn.org/node-chassis-id: bd9f98b3-7880-4973-89f4-0864a5019d51
    k8s.ovn.org/node-join-subnets: '{"default":""}'
    k8s.ovn.org/node-local-nat-ip: '{"default":[""]}'
    k8s.ovn.org/node-mgmt-port-mac-address: 02:ed:bd:86:93:04
    k8s.ovn.org/node-primary-ifaddr: '{"ipv4":""}'
    k8s.ovn.org/node-subnets: '{"default":""}'
    machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-p8jnq-worker-0-pjxjq
    machineconfiguration.openshift.io/currentConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-09-01T12:57:17Z"
  - metal3.io/capbm <<<<<---------

[kni@provisionhost-0-0 ~]$ oc delete node worker-0-1
node "worker-0-1" deleted

2. Check the machine connected to the node that we deleted; nodeRef still present on the machine-

    Type:        InternalIP
    Address:     fd00:1101::a7f7:def2:65ab:6c43
    Type:        InternalIP
    Address:     worker-0-1
    Type:        Hostname
    Address:     worker-0-1
    Type:        InternalDNS
  Last Updated:  2020-09-01T18:18:12Z
  Node Ref:    <<<<<<<---------------------------------
    Kind:  Node
    Name:  worker-0-1
    UID:   85799dba-8a16-4950-b5c2-ac76a0b70644
  Phase:   Running
Events:    <none>

Comment 7 errata-xmlrpc 2020-10-27 16:25:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.