Bug 2007802

Summary: AWS machine actuator get stuck if machine is completely missing
Product: OpenShift Container Platform Reporter: Christoph Blecker <cblecker>
Component: Cloud ComputeAssignee: Mike Fedosin <mfedosin>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: mbargenq, travi
Version: 4.9Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The check to ensure that the machine hasn't been updated before requeueing was accidentally removed. Consequence: It causes a problem with situations when the machine's vm has been removed, but the machine object's still available. In this case in starts requeueing the machine in an infinite loop, preventing it to be deleted or updated. Fix: Bring the check back. Result: We do not requeue machine if the machine has been updated.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:38:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2015605    

Description Christoph Blecker 2021-09-24 23:54:34 UTC
Description of problem:
The AWS machine actuator gets stuck in an error loop if a machine is deleted and can no longer be found.

In particular it gets stuck here: https://github.com/openshift/cluster-api-provider-aws/blob/a815e7e7e6f7e2241e3c9de66793cc9154945c1c/pkg/actuators/machine/reconciler.go#L260-L263

Version-Release number of selected component (if applicable):
4.9.0-rc.1


How reproducible:
Consistent


Steps to Reproduce:
1. Delete a `machine` object with a `.spec.providerID` that doesn't exist
2. 
3.

Actual results:
The controller returns an error on reconcile, which causes the reconcile to be requeued. However, there is no breakout condition (like a timeout or retry counter), so the loop continues perpetually.


Expected results:
Eventually the machine actuator would recognize that it's a delete operation and would give up and clean up the machine object eventually.


Additional info:

Comment 2 Joel Speed 2021-10-08 12:04:51 UTC
*** Bug 2011089 has been marked as a duplicate of this bug. ***

Comment 5 sunzhaohua 2021-10-19 03:05:48 UTC
verified
clusterversion: 4.10.0-0.nightly-2021-10-16-173656

machine could be deleted if instance was removed.

$ oc get machine -o wide
NAME                                         PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
zhsunaws1018-58vff-worker-us-east-2c-sfj2g   Running   m5.large    us-east-2   us-east-2c   14h   ip-10-0-219-50.us-east-2.compute.internal    aws:///us-east-2c/i-01f4559d15b9d2abc   shutting-down
$ oc delete machine zhsunaws1018-58vff-worker-us-east-2c-sfj2g
machine.machine.openshift.io "zhsunaws1018-58vff-worker-us-east-2c-sfj2g" deleted

Comment 8 errata-xmlrpc 2022-03-12 04:38:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056