2007802 – AWS machine actuator get stuck if machine is completely missing

Bug 2007802 - AWS machine actuator get stuck if machine is completely missing

Summary: AWS machine actuator get stuck if machine is completely missing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Mike Fedosin
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2011089 (view as bug list)
Depends On:
Blocks:	2015605
TreeView+	depends on / blocked

Reported:	2021-09-24 23:54 UTC by Christoph Blecker
Modified:	2022-03-12 04:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The check to ensure that the machine hasn't been updated before requeueing was accidentally removed. Consequence: It causes a problem with situations when the machine's vm has been removed, but the machine object's still available. In this case in starts requeueing the machine in an infinite loop, preventing it to be deleted or updated. Fix: Bring the check back. Result: We do not requeue machine if the machine has been updated.
Clone Of:
Environment:
Last Closed:	2022-03-12 04:38:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 424	0	None	Merged	Bug 2007802: do not requeue if the machine has been updated	2022-02-03 00:09:52 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:38:55 UTC

Description Christoph Blecker 2021-09-24 23:54:34 UTC

Description of problem:
The AWS machine actuator gets stuck in an error loop if a machine is deleted and can no longer be found.

In particular it gets stuck here: https://github.com/openshift/cluster-api-provider-aws/blob/a815e7e7e6f7e2241e3c9de66793cc9154945c1c/pkg/actuators/machine/reconciler.go#L260-L263

Version-Release number of selected component (if applicable):
4.9.0-rc.1


How reproducible:
Consistent


Steps to Reproduce:
1. Delete a `machine` object with a `.spec.providerID` that doesn't exist
2. 
3.

Actual results:
The controller returns an error on reconcile, which causes the reconcile to be requeued. However, there is no breakout condition (like a timeout or retry counter), so the loop continues perpetually.


Expected results:
Eventually the machine actuator would recognize that it's a delete operation and would give up and clean up the machine object eventually.


Additional info:

Comment 2 Joel Speed 2021-10-08 12:04:51 UTC

*** Bug 2011089 has been marked as a duplicate of this bug. ***

Comment 5 sunzhaohua 2021-10-19 03:05:48 UTC

verified
clusterversion: 4.10.0-0.nightly-2021-10-16-173656

machine could be deleted if instance was removed.

$ oc get machine -o wide
NAME                                         PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
zhsunaws1018-58vff-worker-us-east-2c-sfj2g   Running   m5.large    us-east-2   us-east-2c   14h   ip-10-0-219-50.us-east-2.compute.internal    aws:///us-east-2c/i-01f4559d15b9d2abc   shutting-down
$ oc delete machine zhsunaws1018-58vff-worker-us-east-2c-sfj2g
machine.machine.openshift.io "zhsunaws1018-58vff-worker-us-east-2c-sfj2g" deleted

Comment 8 errata-xmlrpc 2022-03-12 04:38:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.