Bug 2007802 - AWS machine actuator get stuck if machine is completely missing
Summary: AWS machine actuator get stuck if machine is completely missing
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.0
Assignee: Mike Fedosin
QA Contact: sunzhaohua
: 2011089 (view as bug list)
Depends On:
Blocks: 2015605
TreeView+ depends on / blocked
Reported: 2021-09-24 23:54 UTC by Christoph Blecker
Modified: 2022-03-12 04:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The check to ensure that the machine hasn't been updated before requeueing was accidentally removed. Consequence: It causes a problem with situations when the machine's vm has been removed, but the machine object's still available. In this case in starts requeueing the machine in an infinite loop, preventing it to be deleted or updated. Fix: Bring the check back. Result: We do not requeue machine if the machine has been updated.
Clone Of:
Last Closed: 2022-03-12 04:38:40 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 424 0 None Merged Bug 2007802: do not requeue if the machine has been updated 2022-02-03 00:09:52 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:38:55 UTC

Description Christoph Blecker 2021-09-24 23:54:34 UTC
Description of problem:
The AWS machine actuator gets stuck in an error loop if a machine is deleted and can no longer be found.

In particular it gets stuck here: https://github.com/openshift/cluster-api-provider-aws/blob/a815e7e7e6f7e2241e3c9de66793cc9154945c1c/pkg/actuators/machine/reconciler.go#L260-L263

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Delete a `machine` object with a `.spec.providerID` that doesn't exist

Actual results:
The controller returns an error on reconcile, which causes the reconcile to be requeued. However, there is no breakout condition (like a timeout or retry counter), so the loop continues perpetually.

Expected results:
Eventually the machine actuator would recognize that it's a delete operation and would give up and clean up the machine object eventually.

Additional info:

Comment 2 Joel Speed 2021-10-08 12:04:51 UTC
*** Bug 2011089 has been marked as a duplicate of this bug. ***

Comment 5 sunzhaohua 2021-10-19 03:05:48 UTC
clusterversion: 4.10.0-0.nightly-2021-10-16-173656

machine could be deleted if instance was removed.

$ oc get machine -o wide
NAME                                         PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
zhsunaws1018-58vff-worker-us-east-2c-sfj2g   Running   m5.large    us-east-2   us-east-2c   14h   ip-10-0-219-50.us-east-2.compute.internal    aws:///us-east-2c/i-01f4559d15b9d2abc   shutting-down
$ oc delete machine zhsunaws1018-58vff-worker-us-east-2c-sfj2g
machine.machine.openshift.io "zhsunaws1018-58vff-worker-us-east-2c-sfj2g" deleted

Comment 8 errata-xmlrpc 2022-03-12 04:38:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.