1827589 – Machine controller: Failed to remove finalizer from machine

Bug 1827589 - Machine controller: Failed to remove finalizer from machine

Summary: Machine controller: Failed to remove finalizer from machine

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Joel Speed
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-24 09:33 UTC by Jianwei Hou
Modified:	2020-04-29 12:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-29 12:00:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jianwei Hou 2020-04-24 09:33:40 UTC

Description of problem:
On deleting a machine, machine and node are deleted successfully, but machine controller logged an error:
```
Failed to remove finalizer from machine "jhou-49x9c-worker-1-8j84h": Operation cannot be fulfilled on machines.machine.openshift.io "jhou-49x9c-worker-1-8j84h": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machine.openshift.io/machines/openshift-machine-api/jhou-49x9c-worker-1-8j84h, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 361ad86f-54c0-4ad2-a2bf-336b0bed55f3, UID in object meta:

```

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-21-103613

How reproducible:
Always

Steps to Reproduce:
1. Delete a machine, or scale down a machinset
2. Monitor machine controller log

Actual results:
Machine and node are successfully deleted, but there is an error in machine controller:
```
I0424 08:23:34.499466       1 controller.go:428] Machine "jhou-49x9c-worker-1-8j84h" going into phase "Deleting"
I0424 08:23:34.517717       1 controller.go:227] Reconciling machine "jhou-49x9c-worker-1-8j84h" triggers delete
I0424 08:23:34.542105       1 info.go:20] cordoned node "jhou-49x9c-worker-1-8j84h"
I0424 08:23:34.668164       1 info.go:16] ignoring DaemonSet-managed pods: tuned-sdj49, dns-default-cfsb7, node-ca-kpxhv, machine-config-daemon-tq79j, node-exporter-hd552, multus-2g82b, ovs-7cz2l, sdn-ql2wv
I0424 08:23:34.668196       1 info.go:20] drained node "jhou-49x9c-worker-1-8j84h"
I0424 08:23:34.668212       1 controller.go:368] drain successful for machine "jhou-49x9c-worker-1-8j84h"
I0424 08:23:34.680932       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0424 08:23:34.804892       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0424 08:23:50.088208       1 controller.go:254] Deleting node "jhou-49x9c-worker-1-8j84h" for machine "jhou-49x9c-worker-1-8j84h"
I0424 08:23:50.144537       1 controller.go:268] Machine "jhou-49x9c-worker-1-8j84h" deletion successful
I0424 08:23:50.144622       1 controller.go:164] Reconciling Machine "jhou-49x9c-worker-1-8j84h"
I0424 08:23:50.144631       1 controller.go:376] Machine "jhou-49x9c-worker-1-8j84h" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0424 08:23:50.144641       1 controller.go:227] Reconciling machine "jhou-49x9c-worker-1-8j84h" triggers delete
I0424 08:23:50.179136       1 controller.go:343] Could not find node from noderef, it may have already been deleted: jhou-49x9c-worker-1-8j84h
I0424 08:23:50.191768       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0424 08:23:55.298947       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0424 08:24:00.558353       1 actuator.go:314] Skipped deleting jhou-49x9c-worker-1-8j84h that is already deleted.
I0424 08:24:00.558514       1 controller.go:254] Deleting node "jhou-49x9c-worker-1-8j84h" for machine "jhou-49x9c-worker-1-8j84h"
I0424 08:24:00.558576       1 controller.go:397] Node "jhou-49x9c-worker-1-8j84h" not found
E0424 08:24:00.686310       1 controller.go:264] Failed to remove finalizer from machine "jhou-49x9c-worker-1-8j84h": Operation cannot be fulfilled on machines.machine.openshift.io "jhou-49x9c-worker-1-8j84h": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machine.openshift.io/machines/openshift-machine-api/jhou-49x9c-worker-1-8j84h, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 361ad86f-54c0-4ad2-a2bf-336b0bed55f3, UID in object meta:
```

Expected results:
Should not expect error in the log


Additional info:

Comment 1 Joel Speed 2020-04-24 13:09:14 UTC

@Jianwei Could you confirm for me whether or not the Machine doesn't actually get deleted? It looks to me like the cache is going stale and the controller is reconciling a second time after already removing the finalizer, so it gets a conflict when it tries to update the finalizer for a second time. If that's the case, I'm not really sure what we can do in that case apart from either not using a cached client or detecting this error in particular and ignoring it/not logging it

Comment 2 Jianwei Hou 2020-04-26 08:20:34 UTC

Hi Joel, as I checked from the openshift-machine-api namespace, they were deleted, but I didn't check whether they were deleted from the cloud provider side(openstack console). I'm trying to reproduce this today, sorry but I can not reproduce it until now. 

Lowering the priority and if we can't reproduce this in next few days, we can close this one.

Comment 3 Alberto 2020-04-29 12:00:11 UTC

Closing as per https://bugzilla.redhat.com/show_bug.cgi?id=1827589#c2

Please reopen if still relevant

Note You need to log in before you can comment on or make changes to this bug.