2065160 – Possible leak of load balancer targets on AWS Machine API Provider

Bug 2065160 - Possible leak of load balancer targets on AWS Machine API Provider

Summary: Possible leak of load balancer targets on AWS Machine API Provider

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Joel Speed
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-17 12:18 UTC by Joel Speed
Modified:	2022-08-10 10:55 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Failure to deregister IP based load balancer attachments did not result in the Machine being blocked from removal Consequence: Spurious IP based load balancer attachments could remain within the load balancer registration when replacing control plane machines Fix: Ensure that IP based attachments are removed from the load balancer before we remove the EC2 instance on AWS. Result: IP based load balancer attachments are no longer spuriously left behind when replacing machines
Clone Of:
Environment:
Last Closed:	2022-08-10 10:54:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-provider-aws pull 28	0	None	open	Bug 2065160: Ensure IP based NLB targets are deregistered before Machines are removed	2022-03-17 14:29:30 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:55:04 UTC

Description Joel Speed 2022-03-17 12:18:15 UTC

Description of problem:
When an AWS Machine uses NLB load balancer registration via MAPI, the registration type can either be IP based or instance ID based. The type is set when the target group is created, so for the control plane, is IP based due to limitations of the installer.

When we terminate a node, we check first if the instance exists, then terminate the instance, THEN remove it from the load balancer if it is IP based (instance ID is removed automatically).

In the case that the IP based deregistration fails, we may leak the target registration if the instance goes away before we manage to successfully deregister the instance

Version-Release number of selected component (if applicable):


How reproducible:
100% (theoretical, haven't actually tried)


Steps to Reproduce:
1. Remove the DeregisterTargets permission from the Machine API AWS Credentials Request (CVO needs to be turned off for this)
2. Create a Machine which uses an IP based target group NLB
3. Delete the Machine, deregistration should fail, but this doesn't block the Machine from being removed from the cluster
4. Once the Machine is gone, its IP will still be listed as a target for the NLB

Actual results:
IP address is left behind even after the Machine has gone away

Expected results:
Machine should not go away until the NLB target is successfully deregistered.

Additional info:

Comment 3 Milind Yadav 2022-03-22 06:49:29 UTC

Hi Joel , Tried it on 4.10 , by deleting master which were under NLB , the target got degistered successfully , so I was not able to reproduce it . 
Do you suggest any other way to reproduce ? 

Cluster version is 4.10.0-0.nightly-2022-03-19-230512

Steps :
Cluster installed with masters registered behind NLB .
Deleted master .
Master deleted successfully and deregistered from NLB also .


Additional info :
[miyadav@miyadav ~]$ oc get machine -o wide
NAME                                         PHASE     TYPE         REGION      ZONE         AGE    NODE                                         PROVIDERID                              STATE
miyadav-2203-kfq7h-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   173m   ip-10-0-139-34.us-east-2.compute.internal    aws:///us-east-2a/i-0d546d8294fc09c32   running
miyadav-2203-kfq7h-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   173m   ip-10-0-178-95.us-east-2.compute.internal    aws:///us-east-2b/i-01737aec6d9a658be   running
miyadav-2203-kfq7h-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   173m   ip-10-0-222-91.us-east-2.compute.internal    aws:///us-east-2c/i-0a3e4af7b4cd8dde0   running
miyadav-2203-kfq7h-worker-us-east-2a-n4qvc   Running   m6i.large    us-east-2   us-east-2a   170m   ip-10-0-153-203.us-east-2.compute.internal   aws:///us-east-2a/i-0dea7c5ca1d3a07e0   running
miyadav-2203-kfq7h-worker-us-east-2b-6tmzt   Running   m6i.large    us-east-2   us-east-2b   170m   ip-10-0-191-98.us-east-2.compute.internal    aws:///us-east-2b/i-06f9985f81bea5732   running
miyadav-2203-kfq7h-worker-us-east-2c-s9hj9   Running   m6i.large    us-east-2   us-east-2c   170m   ip-10-0-212-182.us-east-2.compute.internal   aws:///us-east-2c/i-06c5ca9a10cf79554   running
[miyadav@miyadav ~]$ oc delete machine miyadav-2203-kfq7h-master-0
machine.machine.openshift.io "miyadav-2203-kfq7h-master-0" deleted
[miyadav@miyadav ~]$ 

Deleted the 10.0.139.34 instance.

Comment 5 Milind Yadav 2022-03-22 06:53:58 UTC

It is registered , deregister needed , will create a case for this ( attaching snap which makes it clearer) thanks Joel.

Comment 7 Milind Yadav 2022-03-22 07:52:24 UTC

Validated at - Cluster version is 4.11.0-0.nightly-2022-03-20-160505

Steps:
1.Create cluster with masters behing NLB.
Cluster created successfully.

2.Take a note of registered targets

3.Delete on of the registered targets 
Expected and Actual  - Machine deleted successfully after some time ( 2-3mins)


4. Validate the target is deregistered .

Actual & expected - Target degistered successfuly after draining 


Additional Info:
Snaps attached to test case in polarion.
Moved to VERIFIED

Comment 8 Joel Speed 2022-03-22 10:12:40 UTC

The only way I could think to force this would be to remove the `DeregisterTargets` permission from the Machine API credentials request (if you do this, you need to disable CVO first). Once the credentials have synced, what you should see is that in the old case, the Machine would still go away. With the new case, the Machine will not go away until the target is removed (either via AWS console, or fixing the permissions issue)

Comment 10 errata-xmlrpc 2022-08-10 10:54:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.