Bug 2065160

Summary:	Possible leak of load balancer targets on AWS Machine API Provider
Product:	OpenShift Container Platform	Reporter:	Joel Speed <jspeed>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium
Version:	4.10
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Failure to deregister IP based load balancer attachments did not result in the Machine being blocked from removal Consequence: Spurious IP based load balancer attachments could remain within the load balancer registration when replacing control plane machines Fix: Ensure that IP based attachments are removed from the load balancer before we remove the EC2 instance on AWS. Result: IP based load balancer attachments are no longer spuriously left behind when replacing machines	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:54:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joel Speed 2022-03-17 12:18:15 UTC

Description of problem:
When an AWS Machine uses NLB load balancer registration via MAPI, the registration type can either be IP based or instance ID based. The type is set when the target group is created, so for the control plane, is IP based due to limitations of the installer.

When we terminate a node, we check first if the instance exists, then terminate the instance, THEN remove it from the load balancer if it is IP based (instance ID is removed automatically).

In the case that the IP based deregistration fails, we may leak the target registration if the instance goes away before we manage to successfully deregister the instance

Version-Release number of selected component (if applicable):


How reproducible:
100% (theoretical, haven't actually tried)


Steps to Reproduce:
1. Remove the DeregisterTargets permission from the Machine API AWS Credentials Request (CVO needs to be turned off for this)
2. Create a Machine which uses an IP based target group NLB
3. Delete the Machine, deregistration should fail, but this doesn't block the Machine from being removed from the cluster
4. Once the Machine is gone, its IP will still be listed as a target for the NLB

Actual results:
IP address is left behind even after the Machine has gone away

Expected results:
Machine should not go away until the NLB target is successfully deregistered.

Additional info:

Comment 3 Milind Yadav 2022-03-22 06:49:29 UTC

Hi Joel , Tried it on 4.10 , by deleting master which were under NLB , the target got degistered successfully , so I was not able to reproduce it . 
Do you suggest any other way to reproduce ? 

Cluster version is 4.10.0-0.nightly-2022-03-19-230512

Steps :
Cluster installed with masters registered behind NLB .
Deleted master .
Master deleted successfully and deregistered from NLB also .


Additional info :
[miyadav@miyadav ~]$ oc get machine -o wide
NAME                                         PHASE     TYPE         REGION      ZONE         AGE    NODE                                         PROVIDERID                              STATE
miyadav-2203-kfq7h-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   173m   ip-10-0-139-34.us-east-2.compute.internal    aws:///us-east-2a/i-0d546d8294fc09c32   running
miyadav-2203-kfq7h-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   173m   ip-10-0-178-95.us-east-2.compute.internal    aws:///us-east-2b/i-01737aec6d9a658be   running
miyadav-2203-kfq7h-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   173m   ip-10-0-222-91.us-east-2.compute.internal    aws:///us-east-2c/i-0a3e4af7b4cd8dde0   running
miyadav-2203-kfq7h-worker-us-east-2a-n4qvc   Running   m6i.large    us-east-2   us-east-2a   170m   ip-10-0-153-203.us-east-2.compute.internal   aws:///us-east-2a/i-0dea7c5ca1d3a07e0   running
miyadav-2203-kfq7h-worker-us-east-2b-6tmzt   Running   m6i.large    us-east-2   us-east-2b   170m   ip-10-0-191-98.us-east-2.compute.internal    aws:///us-east-2b/i-06f9985f81bea5732   running
miyadav-2203-kfq7h-worker-us-east-2c-s9hj9   Running   m6i.large    us-east-2   us-east-2c   170m   ip-10-0-212-182.us-east-2.compute.internal   aws:///us-east-2c/i-06c5ca9a10cf79554   running
[miyadav@miyadav ~]$ oc delete machine miyadav-2203-kfq7h-master-0
machine.machine.openshift.io "miyadav-2203-kfq7h-master-0" deleted
[miyadav@miyadav ~]$ 

Deleted the 10.0.139.34 instance.

Comment 5 Milind Yadav 2022-03-22 06:53:58 UTC

It is registered , deregister needed , will create a case for this ( attaching snap which makes it clearer) thanks Joel.

Comment 7 Milind Yadav 2022-03-22 07:52:24 UTC

Validated at - Cluster version is 4.11.0-0.nightly-2022-03-20-160505

Steps:
1.Create cluster with masters behing NLB.
Cluster created successfully.

2.Take a note of registered targets

3.Delete on of the registered targets 
Expected and Actual  - Machine deleted successfully after some time ( 2-3mins)


4. Validate the target is deregistered .

Actual & expected - Target degistered successfuly after draining 


Additional Info:
Snaps attached to test case in polarion.
Moved to VERIFIED

Comment 8 Joel Speed 2022-03-22 10:12:40 UTC

The only way I could think to force this would be to remove the `DeregisterTargets` permission from the Machine API credentials request (if you do this, you need to disable CVO first). Once the credentials have synced, what you should see is that in the old case, the Machine would still go away. With the new case, the Machine will not go away until the target is removed (either via AWS console, or fixing the permissions issue)

Comment 10 errata-xmlrpc 2022-08-10 10:54:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069