Bug 2105351 - The AWS ELB Operator is failing to start when the VPC tag `kubernetes.io/cluster/<infraID>=.*` is not set
Summary: The AWS ELB Operator is failing to start when the VPC tag `kubernetes.io/clus...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Andrey Lebedev
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-08 15:57 UTC by Marco Braga
Modified: 2022-12-19 21:11 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-19 21:11:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marco Braga 2022-07-08 15:57:36 UTC
- **Description of problem:**

The operator pod is failing to start when the VPC tag `kubernetes.io/cluster/<infraID>=.*` is not set in clusters installed in existing VPC (IPI).

According to the documentation[0]:
~~~
- The VPC must not use the kubernetes.io/cluster/.*: owned tag.
	The installation program modifies your subnets to add the kubernetes.io/cluster/.*: shared tag, so your subnets must have at least one free tag slot available for it. See Tag Restrictions in the AWS documentation to confirm that the installation program can add a tag to each subnet that you specify.
~~~

So the VPC was created without this tag. When the ELB Operator was installed[1], the following error is raised in the logs:

~~~
$ oc logs pod/aws-load-balancer-operator-controller-manager-7d6c65fcc8-rh64h -n aws-load-balancer-operator
I0708 15:14:22.268164       1 request.go:601] Waited for 1.041569483s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/cloud.network.openshift.io/v1?timeout=32s
1.6572932639217587e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
1.6572932640037425e+09	ERROR	setup	failed to get VPC ID	{"error": "no VPC with tag \"kubernetes.io/cluster/lzdemo-7k427\" found"}
main.main
	/workspace/main.go:133
runtime.main
	/usr/local/go/src/runtime/proc.go:255
~~~

When I set the required tag and recycle the pod the installation finished successfully.

- **OpenShift release version:**

4.11.0-rc.1

- **Cluster Platform:**

AWS (IPI)

- **How reproducible:**

Always

- **Steps to Reproduce (in detail):**

1. Create the VPC without the tag `kubernetes.io/cluster/<infraID>=shared`. Create the network dependencies (Subnet, route tables, Nats, etc)
2. Create the install-config.yaml, adding the subnets previously created
3. Create the cluster
4. Setup the [Local Development](https://github.com/openshift/aws-load-balancer-operator#local-development)
5. Check the operator logs

- **Actual results:**

Error on the operator logs mentioned above.

- **Expected results:**

Clear guidance on how to create a cluster in existing VPC without impacting the installation of the ELB Operator.

* Operator start without expecting the VPC cluster tag, according to our documentation
* Or review the approach in our documentation

- **Impact of the problem:**

There is a work in progress to provide guidance on how to install OpenShift clusters in existing VPCs with subnets in the Local Zones[2]. The ELB Operator is a key component to use cases in Local Zones as it supports only ALB, so the clear guidance will help to a successful orientation in this scenario.

- **Additional info:**


[0] Installing a cluster on AWS into an existing VPC / Requirements for using your VPC:
https://docs.openshift.com/container-platform/4.10/installing/installing_aws/installing-aws-vpc.html#installation-custom-aws-vpc-requirements_installing-aws-vpc

[1] Steps used to install from source: [Local Development](https://github.com/openshift/aws-load-balancer-operator#local-development)
~~~
# Building the Operand
git clone https://github.com/openshift/aws-load-balancer-controller.git
IMG=quay.io/$USER/aws-load-balancer-controller
podman build -t $IMG -f Dockerfile.openshift
podman push $IMG

# Update the Operand image (RELATED_IMAGE_CONTROLLER) on `config/manager/manager.yaml`

# Building the Operator
export IMG=quay.io/$USER/aws-load-balancer-operator:latest
make image-build image-push

# Running the Operator
oc new-project aws-load-balancer-operator
oc apply -f hack/operator-credentials-request.yaml
export IMG=quay.io/$USER/aws-load-balancer-operator:latest
make deploy
oc get all -n aws-load-balancer-operator
~~~

[2] Current work to use AWS Local Zones in OCP:
https://issues.redhat.com/browse/RFE-2782

Comment 1 Miciah Dashiel Butler Masters 2022-07-12 13:56:03 UTC
This is the expected behavior.  The ALB operator requires that the VPC be tagged.  Arjun will work on making this clearer in the documentation.

Comment 2 mfisher 2022-12-19 21:11:42 UTC
This issue is stale and has been closed because it has been open 90 days or more with no noted activity/comments in the last 60 days.  If this issue is crucial and still needs resolution, please open a new jira issue and the engineering team will triage and prioritize accordingly.


Note You need to log in before you can comment on or make changes to this bug.