Description of problem: If node get deleted, nodelink controller will remove the nodeRef from the machine. It will happen only if there's something that delays the deletion, such as node finalizer. Otherwise, nodelink controller won't be notified that the node was deleted. It means that we could reach different outcome, depends on the time taken for a node to be deleted. If node was deleted immediately - nodeRef will stay on the machine. If there's some delay (e.g. due to finalizer) - nodeRef will be deleted Apart from that, it created a race with CAPBM. CAPBM placed a finalizer on each node, in order to store annotations and label before node deleted, and restore them again after the nodes comes up again (the context is remediation where we delete the node to release workload, then power-cycle the host). So we may hit this flow: 1. CAPBM puts finalizer on a node 2. node being deleted 3. nodelink controller removes machine.nodeRef 4. CAPBM reconciles that node change, but it finds that machine.nodeRef is nil, thus can't do anything Moreover, Machine in its lifecycle should have only one node. Thus, there's no reason to remove that nodeRef. Version-Release number of selected component (if applicable): 4.6.0-0.ci-2020-07-21-114552 How reproducible: always Steps to Reproduce: 1. delete a node with finalizer which prevents deletion Actual results: noedRef was deleted Expected results: nodeRef will be still present on the machine
Verified. Steps: -------------- 1. Took a node with finalizer and delete it: $ oc edit node worker-0-1 apiVersion: v1 kind: Node metadata: annotations: k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_worker-0-1","mac-address":"52:54:00:bf:c6:88","ip-addresses":["192.168.123.115/24"],"ip-address":"192.168.123.115/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}}' k8s.ovn.org/node-chassis-id: bd9f98b3-7880-4973-89f4-0864a5019d51 k8s.ovn.org/node-join-subnets: '{"default":"100.64.3.0/29"}' k8s.ovn.org/node-local-nat-ip: '{"default":["169.254.6.11"]}' k8s.ovn.org/node-mgmt-port-mac-address: 02:ed:bd:86:93:04 k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"192.168.123.115/24"}' k8s.ovn.org/node-subnets: '{"default":"10.131.0.0/23"}' machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-p8jnq-worker-0-pjxjq machineconfiguration.openshift.io/currentConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab machineconfiguration.openshift.io/desiredConfig: rendered-worker-9d6385f86a6d42a91d49bc184fc42bab machineconfiguration.openshift.io/reason: "" machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: "true" creationTimestamp: "2020-09-01T12:57:17Z" finalizers: - metal3.io/capbm <<<<<--------- labels: .......... ..................... ................................... [kni@provisionhost-0-0 ~]$ oc delete node worker-0-1 node "worker-0-1" deleted 2. Check the machine connected to the node that we deleted; nodeRef still present on the machine- .......................................... .................................. ................ Status: Addresses: Address: 192.168.123.115 Type: InternalIP Address: fd00:1101::a7f7:def2:65ab:6c43 Type: InternalIP Address: worker-0-1 Type: Hostname Address: worker-0-1 Type: InternalDNS Last Updated: 2020-09-01T18:18:12Z Node Ref: <<<<<<<--------------------------------- Kind: Node Name: worker-0-1 UID: 85799dba-8a16-4950-b5c2-ac76a0b70644 Phase: Running Events: <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196