1880757 – AWS: master not removed from LB/target group when machine deleted

Bug 1880757 - AWS: master not removed from LB/target group when machine deleted

Summary: AWS: master not removed from LB/target group when machine deleted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Danil Grigorev
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-19 17:22 UTC by Michael Gugino
Modified:	2021-07-27 22:33 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: AWS: master not removed from LB/target group when machine deleted Consequence: Load balancer continue serving requests to removed master machines, even when IP address points nowhere. Fix: De-register machines from LoadBalancer if there is an IP registration. Result: Correct machine removal procedure by machine-api-operator updates load balancer attachments accordingly.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:33:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 389	None	open	Bug 1880757: Unset target groups from LB on deletion	2021-03-01 09:58:01 UTC
Github	openshift machine-api-operator pull 835	None	open	Bug 1880757: Add missing permission for target group de-registration	2021-03-26 12:30:33 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:33:45 UTC

Description Michael Gugino 2020-09-19 17:22:02 UTC

Description of problem:
Deleted a master machine via machine-api on AWS.  Machine is not removed from appropriate load balancers.

New machines are added to the load balancers, though.

Version-Release number of selected component (if applicable):


How reproducible:
TBD.

Steps to Reproduce:
1. delete an existing master that was created by the installer

Actual results:
In ec2 console, the master is still listed as a backend for both internal and external LBs.

Expected results:
It should not be present after deletion.

Additional info:
Master machine config
      loadBalancers:
      - name: mgugino-deva6-zcchn-int
        type: network
      - name: mgugino-deva6-zcchn-ext
        type: network

Comment 1 Joel Speed 2020-09-30 11:15:53 UTC

We will need to investigate or reassign this during the next sprint. It's not clear to me at present which component is responsible for the load balancer attachment of VMs, is this a problem with Machine API and the way we are creating machines? As far as I was aware we do not touch load balancer attachments

Comment 2 Michael McCune 2020-12-04 18:50:14 UTC

adding UpcomingSprint tag, the team is still investigating this issue.

Comment 3 Joel Speed 2020-12-16 12:09:26 UTC

In AWS, if you register an instance to the target group using instance ID, as is done in the Machine API provider, then you do not need to remove it, as it is automatically removed when the instance is terminated. Hence, the machine API provider does not have code to remove load balancer attachments, it assumes all instances are attached using instance ID.

However, the installer uses [1] IP addresses to register the instances, therefore manual removal is needed instead.

There are limitations to using instance IDs, such as not being able to use certain instance types [2]

> You cannot register instances by instance ID if they use one of the following instance types: C1, CC1, CC2, CG1, CG2, CR1, G1, G2, HI1, HS1, M1, M2, M3, or T1.

It would be good to understand whether registering by IP was a conscious decision on the installer team side or whether they might consider using instance IDs going forward?

If there's a technical reason the installer can't use instance IDs, then the Machine API should implement deregister logic for the load balancer attachments. Given clusters have been installed using the IP based attachments, we may want to do this anyway to catch instances installed in this way.

[1]: https://github.com/openshift/installer/blob/c0f508287415fd3bf489b214b0132f75e3c03c9f/data/data/aws/master/main.tf#L171
[2]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-register-targets.html

Comment 4 Michael Gugino 2020-12-16 14:06:01 UTC

It should be simple enough to add the removal logic to the deletion flow of a machine.  For existing clusters, my preference is to send out some kind of advisory to clean up anything that might be lingering from prior to us add this feature.  Reason being, we don't know what users might have added behind this LB since creation time, and we can't go removing just anything that doesn't match a machine object.  This is especially true for a UPI cluster.

Comment 5 Joel Speed 2020-12-16 14:13:20 UTC

Agreed, we don't want to affect any manual changes. We should only remove the load balancer attachment if it matches the instance ID of the Machine we are deleting, or it matches the IP of the Machine we are deleting.

Comment 7 Joel Speed 2021-02-08 10:23:05 UTC

I think we have an agreed plan of action here, setting target for 4.8

Comment 11 Danil Grigorev 2021-03-26 09:07:55 UTC

@miyadav This is not what I observed in my testing. The target group itself should start deprovisioning and eventually go away. I had to refresh console for that change to appear, but it worked for me.

Comment 12 Milind Yadav 2021-03-26 10:04:08 UTC

Hi Danil , I dont see that refreshed many time ..will share details on chat

Comment 13 Danil Grigorev 2021-03-26 12:22:46 UTC

I'm seeing we are missing permissions on that kind of operation. Didn't come up in my testing, targets from target groups went away as they should.

Here is a log snippet from QE provided cluster:

E0326 12:16:56.376058       1 loadbalancers.go:117] Failed to unregister instance "i-06828db8c6de8ac53" from target group "arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aext/e7b92a1a0249694a": AccessDenied: User: arn:aws:iam::301721915996:user/miyadav-aws-26-sq5k5-openshift-machine-api-aws-c5bws is not authorized to perform: elasticloadbalancing:DeregisterTargets on resource: arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aext/e7b92a1a0249694a
	status code: 403, request id: 931605e3-1b43-457f-858b-8f896cfc33fd
E0326 12:16:56.376099       1 reconciler.go:342] miyadav-aws-26-sq5k5-master-2: Failed to register network load balancers: [arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aint/fe879ca54c0b3cc0: AccessDenied: User: arn:aws:iam::301721915996:user/miyadav-aws-26-sq5k5-openshift-machine-api-aws-c5bws is not authorized to perform: elasticloadbalancing:DeregisterTargets on resource: arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aint/fe879ca54c0b3cc0

Comment 15 Milind Yadav 2021-04-19 06:05:57 UTC

Validated on : 
4.8.0-0.nightly-2021-04-18-101412

Steps :

1.Delete the master machine using oc delete machine <machine-name>
Keep a note of IP (master-0 in this case)
[miyadav@miyadav ~]$ oc get machines -o wide
NAME                                       PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
miyadav-19-88c9g-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   58m   ip-10-0-145-39.us-east-2.compute.internal    aws:///us-east-2a/i-0f7c4d1e58a97e8c8   running
miyadav-19-88c9g-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   58m   ip-10-0-178-16.us-east-2.compute.internal    aws:///us-east-2b/i-0050d064a1128d912   running
miyadav-19-88c9g-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   58m   ip-10-0-209-36.us-east-2.compute.internal    aws:///us-east-2c/i-024b93076a9f78046   running
miyadav-19-88c9g-worker-us-east-2a-w4sz4   Running   m5.large    us-east-2   us-east-2a   52m   ip-10-0-130-210.us-east-2.compute.internal   aws:///us-east-2a/i-024bde0fdf3b3f4b8   running
miyadav-19-88c9g-worker-us-east-2b-mfnn7   Running   m5.large    us-east-2   us-east-2b   52m   ip-10-0-185-195.us-east-2.compute.internal   aws:///us-east-2b/i-099432161351bb086   running
miyadav-19-88c9g-worker-us-east-2c-zwp8v   Running   m5.large    us-east-2   us-east-2c   52m   ip-10-0-215-246.us-east-2.compute.internal   aws:///us-east-2c/i-04c30eeb00dae7231   running


Check the target groups from aws console , all three masters will be present 

2.Delete the machine 
oc delete machine miyadav-19-88c9g-master-0 

3. Navigate to the aws console , to check if the ip is removed  from target groups from both internal and external LBs

Actual & Expected :
The deleted IP removed and the other two masters are marked as healthy ( have two targets instead of three now)

[miyadav@miyadav ~]$ aws elbv2 describe-target-groups --target-group arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-sint/f1f467f77f4eb381
TARGETGROUPS	True	10	/healthz	22623	HTTPS	10	2	22623	TCP	arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-sint/f1f467f77f4eb381	miyadav-19-88c9g-sint	ip	2	vpc-02382a021db4e9c2b
LOADBALANCERARNS	arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/miyadav-19-88c9g-int/13f10ea6185c1aae
MATCHER	200-399
[miyadav@miyadav ~]$ aws elbv2 describe-target-groups --target-group arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aint/bde8eb6f9a86d180
TARGETGROUPS	True	10	/readyz	6443	HTTPS	10	2	6443	TCP	arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aint/bde8eb6f9a86d180	miyadav-19-88c9g-aint	ip	2	vpc-02382a021db4e9c2b
LOADBALANCERARNS	arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/miyadav-19-88c9g-int/13f10ea6185c1aae
MATCHER	200-399
[miyadav@miyadav ~]$ aws elbv2 describe-target-groups --target-group arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aext/8b0575780d9cac07
TARGETGROUPS	True	10	/readyz	6443	HTTPS	10	2	6443	TCP	arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aext/8b0575780d9cac07	miyadav-19-88c9g-aext	ip	2	vpc-02382a021db4e9c2b
LOADBALANCERARNS	arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/miyadav-19-88c9g-ext/c7d5f8d030f846e4
MATCHER	200-399


Additional Info:
Moved to VERIFIED

Comment 18 errata-xmlrpc 2021-07-27 22:33:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.