Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2094923

Summary:	service loadbalancer uses subnet(s) that is not owned by the openshift cluster
Product:	OpenShift Container Platform	Reporter:	Patrick Dillon <padillon>
Component:	Installer	Assignee:	sdasu
Installer sub component:	openshift-installer	QA Contact:	Yunfei Jiang <yunjiang>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	wking
Version:	4.8
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-26 21:07:27 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2027137

Description Patrick Dillon 2022-06-08 15:53:00 UTC

This bug was initially created as a copy of Bug #2027137

I am copying this bug because: 

Copying this to a new BZ because there are two distinct problems:
1. We need to handle newly installed clusters
2. We need to handle existing clusters, i.e. apply tags during upgrades

I am creating this bz and assigning to the installer to handle case 1. The original BZ should handle case 2. 


*********** ORIGINAL BZ ***********
Description of problem:
This issue occurs more specifically in a cluster that is installed on an existing customer VPC on AWS. The loadbalancer service tries to use a subnet that does not belong to the openshift cluster i.e it tries to use the subnet that was not provided during the installation of the cluster. One of the issues with this is that when the loadbalancer tries to use a subnet with less than 8 free IP adresses, kube controller throws the following error:

>controller.go:307] error processing service openshift-ingress/router-app-ceabr-io (will retry): failed to ensure load balancer: InvalidSubnet: Not enough IP space available in subnet-<ID>. ELB requires at least 8 free IP addresses in each subnet.

While the subnets provided by the customer have 200+ free IP adresses available.

The loadbalancer should only be using (and be concerned about) the subnets that are provided during install time. These subnets are also tagged with :

kubernetes.io/cluster/<cluster> : shared

from the code, it looks like kube-controller picks any private subnet in the VPC for an internal loadbalancer : https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3682
But this is not ideal as the controller is not smart enough to only filter the subnet that belongs to the openshift cluster and uses an unrelated subnet that's in the customer's VPC

The customer created a multi AZ cluster providing private and public subnets for  us-east-1a , us-east-1b and us-east-1d, however, the controller is tryuing to use the subnet from us-east-1c. It should really only use the AZs provided by the customer.

similarly, for public load-balancer service, the kube-controller picks all subnets that have an internet gateway in it's route table, which is the right definition of public subnet, however it should only pick the ones belonging to the cluster.

OpenShift release version:
4.8.17

Cluster Platform:
AWS

How reproducible:
Consistently reproducible 


Steps to Reproduce (in detail):
1. Create a cluster in an existing VPC with existing private subnets with less than 8 IP addresses free
2. create an ingresscontroller with spec.endpointPublishingStrategy.loadBalancer.scope=Internal
3. This should create a loadbalancer service for the ingresscontroller in openshift-ingress namespace and the service will get stuck in pending state as one of the existing subnets doesn't have enough free IPs


Actual results:
Looadbalancer service tries to attaches itself to subents not owned by the kubernetes/openshift cluster and fails if those un-owned subnets do not have enough free IP addresses

Expected results:
The loadbalancer should not attach itself to subnets that do not belong the kubernetes/openshift cluster. Only the subnets provided during install time should be used.

Impact of the problem:
Degraded Ingress operator, failing to create a service successfully for the private ingresscontroller

Comment 1 Patrick Dillon 2022-06-08 16:02:05 UTC

It looks like the kubernetes.io/role/internal-elb tag was removed a long time ago and it looks to me like it was removed by accident as copypasta: https://github.com/openshift/installer/commit/9448afff1a2cb9909fc29b2ad7c7b8583763c9cc

To resolve this bug, we should:

1. When creating new subnets, add the "kubernetes.io/role/internal-elb" tag back to the worker subnets
2. When using existing subnets, add the tag. When creating the tag, we may need to set the value to "shared" or do some other such logic to make sure the tag can be destroyed.  The code for that is here: https://github.com/openshift/installer/blob/master/pkg/asset/cluster/aws/aws.go#L43
3. Ensure the new tags are destroyed properly for both case 1 & 2. Relevant destroy code is here: https://github.com/openshift/installer/blob/release-4.10/pkg/destroy/aws/shared.go

Comment 2 Patrick Dillon 2022-06-08 22:53:42 UTC

Oops. My previous comment should have had the tag kubernetes.io/role/elb -- not internal-elb. Thanks Trevor for catching that!

Comment 3 sdasu 2022-08-26 21:07:27 UTC

Upon further investigation, we have concluded that the analysis provided above does not contribute towards the solution of the problem reported in https://bugzilla.redhat.com/show_bug.cgi?id=2027137. So, for that reason, closing this BZ as not-a-bug.