Description of problem:
The AWS machine actuator gets stuck in an error loop if a machine is deleted and can no longer be found.
In particular it gets stuck here: https://github.com/openshift/cluster-api-provider-aws/blob/a815e7e7e6f7e2241e3c9de66793cc9154945c1c/pkg/actuators/machine/reconciler.go#L260-L263
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Delete a `machine` object with a `.spec.providerID` that doesn't exist
The controller returns an error on reconcile, which causes the reconcile to be requeued. However, there is no breakout condition (like a timeout or retry counter), so the loop continues perpetually.
Eventually the machine actuator would recognize that it's a delete operation and would give up and clean up the machine object eventually.
*** Bug 2011089 has been marked as a duplicate of this bug. ***
machine could be deleted if instance was removed.
$ oc get machine -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
zhsunaws1018-58vff-worker-us-east-2c-sfj2g Running m5.large us-east-2 us-east-2c 14h ip-10-0-219-50.us-east-2.compute.internal aws:///us-east-2c/i-01f4559d15b9d2abc shutting-down
$ oc delete machine zhsunaws1018-58vff-worker-us-east-2c-sfj2g
machine.machine.openshift.io "zhsunaws1018-58vff-worker-us-east-2c-sfj2g" deleted
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.