Description of problem: This issue occurs more specifically in a cluster that is installed on an existing customer VPC on AWS. The loadbalancer service tries to use a subnet that does not belong to the openshift cluster i.e it tries to use the subnet that was not provided during the installation of the cluster. One of the issues with this is that when the loadbalancer tries to use a subnet with less than 8 free IP adresses, kube controller throws the following error: >controller.go:307] error processing service openshift-ingress/router-app-ceabr-io (will retry): failed to ensure load balancer: InvalidSubnet: Not enough IP space available in subnet-<ID>. ELB requires at least 8 free IP addresses in each subnet. While the subnets provided by the customer have 200+ free IP adresses available. The loadbalancer should only be using (and be concerned about) the subnets that are provided during install time. These subnets are also tagged with : kubernetes.io/cluster/<cluster> : shared from the code, it looks like kube-controller picks any private subnet in the VPC for an internal loadbalancer : https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3682 But this is not ideal as the controller is not smart enough to only filter the subnet that belongs to the openshift cluster and uses an unrelated subnet that's in the customer's VPC The customer created a multi AZ cluster providing private and public subnets for us-east-1a , us-east-1b and us-east-1d, however, the controller is tryuing to use the subnet from us-east-1c. It should really only use the AZs provided by the customer. similarly, for public load-balancer service, the kube-controller picks all subnets that have an internet gateway in it's route table, which is the right definition of public subnet, however it should only pick the ones belonging to the cluster. OpenShift release version: 4.8.17 Cluster Platform: AWS How reproducible: Consistently reproducible Steps to Reproduce (in detail): 1. Create a cluster in an existing VPC with existing private subnets with less than 8 IP addresses free 2. create an ingresscontroller with spec.endpointPublishingStrategy.loadBalancer.scope=Internal 3. This should create a loadbalancer service for the ingresscontroller in openshift-ingress namespace and the service will get stuck in pending state as one of the existing subnets doesn't have enough free IPs Actual results: Looadbalancer service tries to attaches itself to subents not owned by the kubernetes/openshift cluster and fails if those un-owned subnets do not have enough free IP addresses Expected results: The loadbalancer should not attach itself to subnets that do not belong the kubernetes/openshift cluster. Only the subnets provided during install time should be used. Impact of the problem: Degraded Ingress operator, failing to create a service successfully for the private ingresscontroller
Setting blocker- because this doesn't look like a regression. I'll look into the issue today.
Can you provide a list of subnets in the VPC along with their tags? As a workaround, can you specify the desired subnets using the "service.beta.kubernetes.io/aws-load-balancer-subnets" annotation on the ingresscontroller's load-balancer service? For example, oc -n openshift-ingress annotate svc/router-default service.beta.kubernetes.io/aws-load-balancer-subnets=<subnet id>,<subnet id>,<subnet id> Note that users generally should not annotate or otherwise modify operator-managed resources. Setting this annotation manually is only suggested here as a workaround for an urgent customer issue. If there is an issue that needs to be fixed in the cloud provider implementation, then this may take some time to fix as it involves upstream code, and changes to the subnet selection logic need to be made with care to avoid breaking existing clusters with unusual configurations. Is this issue truly urgent, meaning it takes priority over other BZs and features?
I believe the buggy behavior described in comment 0 was caused by https://github.com/kubernetes/kubernetes/pull/97431, which changed the subnet discovery to include both subnets that have a cluster tag that matches the cluster as well as subnets that have no cluster tag at all. This change is even called out in the description of #97431: > subnets without any cluster tag also gets auto-discovered So it's not a bug, it's a feature! I'm at a loss to explain why upstream would merge a breaking change like that, but at least I think we know the cause of the issue. The same PR also added the "service.beta.kubernetes.io/aws-load-balancer-subnets" annotation that I suggested as a workaround in comment 4. As an alternative workaround, it appears that tagging the subnets that don't belong to a cluster with a tag with key "kubernetes.io/cluster/<something>" and value "shared" should also prevent subnet discovery from picking up subnets that don't belong to the cluster. Any unique key should work as long as it has the prefix "kubernetes.io/cluster/". Please try one of the proposed workarounds and let us know whether it works.
Customer has implemented workaround and is observing cluster behaviour in the short-term.
Which workaround is the customer using? Annotation the service, or tagging subnets?
I see two options for resolving this BZ: * Revert (part of) the upstream change that causes the AWS cloud provider implementation to use subnets that don't belong to the cluster. * Add logic to the ingress operator or installer to set the "service.beta.kubernetes.io/aws-load-balancer-subnets" service annotation based on the configured subnets. These options are not mutually exclusive; we could do the latter option as an expedient fix in OpenShift while we push for the former option upstream. The first option would require upstream work and would need a plan to minimize disruption for users who actually want the new behavior, but it is probably the more correct option and would benefit not just people encountering the issue with operator-managed services but also people encountering the issue with user-created services. The second option would be the expedient option and would make it easier to narrow the scope. For example, it would only apply to services managed by the ingress operator, and we could choose to apply the service annotation only when creating a new ingresscontroller to minimize the risk of disruption to existing clusters. However, we'll need to be careful not to overwrite the annotation if users have already set it, and we'll need to have a migration plan in case the upstream change does get reverted or amended. Would a change that applied only to newly installed clusters be sufficient for Service Delivery, at least in the short term? Does the set of subnets with which an ingresscontroller should be associated change after installation, or can we set the annotation when the ingresscontroller is created and not worry about updating it?
@mmasters The upsteam change (done as a feature) talks about the following two use cases: Ability to choose different or specific set of subnets per LB to fulfil security requirements. This will be supported via annotation. Simplify VPC setup requirements since cluster name is no longer required on the tag key for subnet resources. This also helps easily provision k8s on an existing VPC setup. I am assuming the first is done via annotation, so that should not impact us ? . The second one seems to be the issue, but I also see the following from AWS for their EKS: https://aws.amazon.com/premiumsupport/knowledge-center/eks-vpc-subnet-discovery/ In that sense, an upstream change or having Cu appropriately tag the subnets seem to be best approaches.
hi @apaladagu - how do we properly annotate or tag in AWS?
@jrist The actual tag you want to use is as suggested in comment #8 above. and w.r.t to how to tag refer to the two screen shots I have attached. Basically identify the subnets in your VPC and from the tags tab, click on manage tags and add/remove tags.
Thank you Anand we figured out a workaround. I don't have any info to supply otherwise.
*** Bug 2069782 has been marked as a duplicate of this bug. ***
https://github.com/openshift/installer/pull/6038 - Adds the "kubernetes.io/role/elb": "" and "kubernetes.io/role/internal-elb": "" tags to public and private subnets respectively. These subnets could be either provided to the Installer via install-config or created by the Installer. That should take care of the case where the selection of the subnet falls back to "Subnet Discovery" documented here: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/subnet_discovery/ The 2nd part of the fix should involve the loadbalancer service having the annotation: service.beta.kubernetes.io/aws-load-balancer-subnets: <comma seperated list of subnets>. This was already provided as the workaround but needs to be added by OpenShift as part of the fix. IIUC, this should be added by https://github.com/openshift/cluster-ingress-operator and not the installer. @mmasters Please let me know if you agree with the 2 part fix.
The issue should ultimately be resolved by RFE-1198[1] and OCPCLOUD-1104[2], which will provide a way for the cluster admin to specify which subnets OpenShift should use for ELBs that it provisions for LoadBalancer-type services. In the meantime, the issue can be worked around as discussed in previous comments on this BZ: * Ensure that subnets that belong to OpenShift are properly tagged in AWS with the "kubernetes.io/cluster/<cluster_infra_id>" tag (with the value set to either "shared" or "owned", depending on whether the subnet is shared with other clusters or infrastructure) and either the "kubernetes.io/role/elb" tag or the "kubernetes.io/role/internal-elb" tag (depending on whether the subnet is intended for public/external LBs or for private/internal LBs). * If necessary, use the "service.beta.kubernetes.io/aws-load-balancer-subnets" annotation to specify the subnets with which the ELB for a LoadBalancer-type service should be associated. As I understand it, the specific scenario in which it may be necessary to use the service annotation is when the VPC has subnets that are not owned by OpenShift and those subnets are associated with availability zones (AZs) for which there is no OpenShift-owned subnet associated. Otherwise, if an AZ has some OpenShift-owned subnet, then OpenShift should already prefer that subnet over any non-OpenShift-owned subnet for the same AZ. 1. https://issues.redhat.com/browse/RFE-1198 2. https://issues.redhat.com/browse/OCPCLOUD-1104
Small correction: RFE-2816[1] may be more relevant here than RFE-1198. We are still working out the details. 1. https://issues.redhat.com/browse/RFE-2816
I'm closing this BZ as we're tracking the issue in Jira: https://issues.redhat.com/browse/RFE-2816