Bug 1880757

Summary: AWS: master not removed from LB/target group when machine deleted
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: Cloud ComputeAssignee: Danil Grigorev <dgrigore>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: dgrigore, jspeed, mimccune, miyadav
Version: 4.5   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: AWS: master not removed from LB/target group when machine deleted Consequence: Load balancer continue serving requests to removed master machines, even when IP address points nowhere. Fix: De-register machines from LoadBalancer if there is an IP registration. Result: Correct machine removal procedure by machine-api-operator updates load balancer attachments accordingly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:33:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Gugino 2020-09-19 17:22:02 UTC
Description of problem:
Deleted a master machine via machine-api on AWS.  Machine is not removed from appropriate load balancers.

New machines are added to the load balancers, though.

Version-Release number of selected component (if applicable):


How reproducible:
TBD.

Steps to Reproduce:
1. delete an existing master that was created by the installer

Actual results:
In ec2 console, the master is still listed as a backend for both internal and external LBs.

Expected results:
It should not be present after deletion.

Additional info:
Master machine config
      loadBalancers:
      - name: mgugino-deva6-zcchn-int
        type: network
      - name: mgugino-deva6-zcchn-ext
        type: network

Comment 1 Joel Speed 2020-09-30 11:15:53 UTC
We will need to investigate or reassign this during the next sprint. It's not clear to me at present which component is responsible for the load balancer attachment of VMs, is this a problem with Machine API and the way we are creating machines? As far as I was aware we do not touch load balancer attachments

Comment 2 Michael McCune 2020-12-04 18:50:14 UTC
adding UpcomingSprint tag, the team is still investigating this issue.

Comment 3 Joel Speed 2020-12-16 12:09:26 UTC
In AWS, if you register an instance to the target group using instance ID, as is done in the Machine API provider, then you do not need to remove it, as it is automatically removed when the instance is terminated. Hence, the machine API provider does not have code to remove load balancer attachments, it assumes all instances are attached using instance ID.

However, the installer uses [1] IP addresses to register the instances, therefore manual removal is needed instead.

There are limitations to using instance IDs, such as not being able to use certain instance types [2]

> You cannot register instances by instance ID if they use one of the following instance types: C1, CC1, CC2, CG1, CG2, CR1, G1, G2, HI1, HS1, M1, M2, M3, or T1.

It would be good to understand whether registering by IP was a conscious decision on the installer team side or whether they might consider using instance IDs going forward?

If there's a technical reason the installer can't use instance IDs, then the Machine API should implement deregister logic for the load balancer attachments. Given clusters have been installed using the IP based attachments, we may want to do this anyway to catch instances installed in this way.

[1]: https://github.com/openshift/installer/blob/c0f508287415fd3bf489b214b0132f75e3c03c9f/data/data/aws/master/main.tf#L171
[2]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-register-targets.html

Comment 4 Michael Gugino 2020-12-16 14:06:01 UTC
It should be simple enough to add the removal logic to the deletion flow of a machine.  For existing clusters, my preference is to send out some kind of advisory to clean up anything that might be lingering from prior to us add this feature.  Reason being, we don't know what users might have added behind this LB since creation time, and we can't go removing just anything that doesn't match a machine object.  This is especially true for a UPI cluster.

Comment 5 Joel Speed 2020-12-16 14:13:20 UTC
Agreed, we don't want to affect any manual changes. We should only remove the load balancer attachment if it matches the instance ID of the Machine we are deleting, or it matches the IP of the Machine we are deleting.

Comment 7 Joel Speed 2021-02-08 10:23:05 UTC
I think we have an agreed plan of action here, setting target for 4.8

Comment 11 Danil Grigorev 2021-03-26 09:07:55 UTC
@miyadav This is not what I observed in my testing. The target group itself should start deprovisioning and eventually go away. I had to refresh console for that change to appear, but it worked for me.

Comment 12 Milind Yadav 2021-03-26 10:04:08 UTC
Hi Danil , I dont see that refreshed many time ..will share details on chat

Comment 13 Danil Grigorev 2021-03-26 12:22:46 UTC
I'm seeing we are missing permissions on that kind of operation. Didn't come up in my testing, targets from target groups went away as they should.

Here is a log snippet from QE provided cluster:

E0326 12:16:56.376058       1 loadbalancers.go:117] Failed to unregister instance "i-06828db8c6de8ac53" from target group "arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aext/e7b92a1a0249694a": AccessDenied: User: arn:aws:iam::301721915996:user/miyadav-aws-26-sq5k5-openshift-machine-api-aws-c5bws is not authorized to perform: elasticloadbalancing:DeregisterTargets on resource: arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aext/e7b92a1a0249694a
	status code: 403, request id: 931605e3-1b43-457f-858b-8f896cfc33fd
E0326 12:16:56.376099       1 reconciler.go:342] miyadav-aws-26-sq5k5-master-2: Failed to register network load balancers: [arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aint/fe879ca54c0b3cc0: AccessDenied: User: arn:aws:iam::301721915996:user/miyadav-aws-26-sq5k5-openshift-machine-api-aws-c5bws is not authorized to perform: elasticloadbalancing:DeregisterTargets on resource: arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-aws-26-sq5k5-aint/fe879ca54c0b3cc0

Comment 15 Milind Yadav 2021-04-19 06:05:57 UTC
Validated on : 
4.8.0-0.nightly-2021-04-18-101412

Steps :

1.Delete the master machine using oc delete machine <machine-name>
Keep a note of IP (master-0 in this case)
[miyadav@miyadav ~]$ oc get machines -o wide
NAME                                       PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
miyadav-19-88c9g-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   58m   ip-10-0-145-39.us-east-2.compute.internal    aws:///us-east-2a/i-0f7c4d1e58a97e8c8   running
miyadav-19-88c9g-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   58m   ip-10-0-178-16.us-east-2.compute.internal    aws:///us-east-2b/i-0050d064a1128d912   running
miyadav-19-88c9g-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   58m   ip-10-0-209-36.us-east-2.compute.internal    aws:///us-east-2c/i-024b93076a9f78046   running
miyadav-19-88c9g-worker-us-east-2a-w4sz4   Running   m5.large    us-east-2   us-east-2a   52m   ip-10-0-130-210.us-east-2.compute.internal   aws:///us-east-2a/i-024bde0fdf3b3f4b8   running
miyadav-19-88c9g-worker-us-east-2b-mfnn7   Running   m5.large    us-east-2   us-east-2b   52m   ip-10-0-185-195.us-east-2.compute.internal   aws:///us-east-2b/i-099432161351bb086   running
miyadav-19-88c9g-worker-us-east-2c-zwp8v   Running   m5.large    us-east-2   us-east-2c   52m   ip-10-0-215-246.us-east-2.compute.internal   aws:///us-east-2c/i-04c30eeb00dae7231   running


Check the target groups from aws console , all three masters will be present 

2.Delete the machine 
oc delete machine miyadav-19-88c9g-master-0 

3. Navigate to the aws console , to check if the ip is removed  from target groups from both internal and external LBs

Actual & Expected :
The deleted IP removed and the other two masters are marked as healthy ( have two targets instead of three now)

[miyadav@miyadav ~]$ aws elbv2 describe-target-groups --target-group arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-sint/f1f467f77f4eb381
TARGETGROUPS	True	10	/healthz	22623	HTTPS	10	2	22623	TCP	arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-sint/f1f467f77f4eb381	miyadav-19-88c9g-sint	ip	2	vpc-02382a021db4e9c2b
LOADBALANCERARNS	arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/miyadav-19-88c9g-int/13f10ea6185c1aae
MATCHER	200-399
[miyadav@miyadav ~]$ aws elbv2 describe-target-groups --target-group arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aint/bde8eb6f9a86d180
TARGETGROUPS	True	10	/readyz	6443	HTTPS	10	2	6443	TCP	arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aint/bde8eb6f9a86d180	miyadav-19-88c9g-aint	ip	2	vpc-02382a021db4e9c2b
LOADBALANCERARNS	arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/miyadav-19-88c9g-int/13f10ea6185c1aae
MATCHER	200-399
[miyadav@miyadav ~]$ aws elbv2 describe-target-groups --target-group arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aext/8b0575780d9cac07
TARGETGROUPS	True	10	/readyz	6443	HTTPS	10	2	6443	TCP	arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/miyadav-19-88c9g-aext/8b0575780d9cac07	miyadav-19-88c9g-aext	ip	2	vpc-02382a021db4e9c2b
LOADBALANCERARNS	arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/miyadav-19-88c9g-ext/c7d5f8d030f846e4
MATCHER	200-399


Additional Info:
Moved to VERIFIED

Comment 18 errata-xmlrpc 2021-07-27 22:33:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438