Bug 2105351

Summary: The AWS ELB Operator is failing to start when the VPC tag `kubernetes.io/cluster/<infraID>=.*` is not set
Product: OpenShift Container Platform Reporter: Marco Braga <mrbraga>
Component: NetworkingAssignee: Andrey Lebedev <alebedev>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: mfisher, mmasters
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-19 21:11:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marco Braga 2022-07-08 15:57:36 UTC
- **Description of problem:**

The operator pod is failing to start when the VPC tag `kubernetes.io/cluster/<infraID>=.*` is not set in clusters installed in existing VPC (IPI).

According to the documentation[0]:
~~~
- The VPC must not use the kubernetes.io/cluster/.*: owned tag.
	The installation program modifies your subnets to add the kubernetes.io/cluster/.*: shared tag, so your subnets must have at least one free tag slot available for it. See Tag Restrictions in the AWS documentation to confirm that the installation program can add a tag to each subnet that you specify.
~~~

So the VPC was created without this tag. When the ELB Operator was installed[1], the following error is raised in the logs:

~~~
$ oc logs pod/aws-load-balancer-operator-controller-manager-7d6c65fcc8-rh64h -n aws-load-balancer-operator
I0708 15:14:22.268164       1 request.go:601] Waited for 1.041569483s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/cloud.network.openshift.io/v1?timeout=32s
1.6572932639217587e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
1.6572932640037425e+09	ERROR	setup	failed to get VPC ID	{"error": "no VPC with tag \"kubernetes.io/cluster/lzdemo-7k427\" found"}
main.main
	/workspace/main.go:133
runtime.main
	/usr/local/go/src/runtime/proc.go:255
~~~

When I set the required tag and recycle the pod the installation finished successfully.

- **OpenShift release version:**

4.11.0-rc.1

- **Cluster Platform:**

AWS (IPI)

- **How reproducible:**

Always

- **Steps to Reproduce (in detail):**

1. Create the VPC without the tag `kubernetes.io/cluster/<infraID>=shared`. Create the network dependencies (Subnet, route tables, Nats, etc)
2. Create the install-config.yaml, adding the subnets previously created
3. Create the cluster
4. Setup the [Local Development](https://github.com/openshift/aws-load-balancer-operator#local-development)
5. Check the operator logs

- **Actual results:**

Error on the operator logs mentioned above.

- **Expected results:**

Clear guidance on how to create a cluster in existing VPC without impacting the installation of the ELB Operator.

* Operator start without expecting the VPC cluster tag, according to our documentation
* Or review the approach in our documentation

- **Impact of the problem:**

There is a work in progress to provide guidance on how to install OpenShift clusters in existing VPCs with subnets in the Local Zones[2]. The ELB Operator is a key component to use cases in Local Zones as it supports only ALB, so the clear guidance will help to a successful orientation in this scenario.

- **Additional info:**


[0] Installing a cluster on AWS into an existing VPC / Requirements for using your VPC:
https://docs.openshift.com/container-platform/4.10/installing/installing_aws/installing-aws-vpc.html#installation-custom-aws-vpc-requirements_installing-aws-vpc

[1] Steps used to install from source: [Local Development](https://github.com/openshift/aws-load-balancer-operator#local-development)
~~~
# Building the Operand
git clone https://github.com/openshift/aws-load-balancer-controller.git
IMG=quay.io/$USER/aws-load-balancer-controller
podman build -t $IMG -f Dockerfile.openshift
podman push $IMG

# Update the Operand image (RELATED_IMAGE_CONTROLLER) on `config/manager/manager.yaml`

# Building the Operator
export IMG=quay.io/$USER/aws-load-balancer-operator:latest
make image-build image-push

# Running the Operator
oc new-project aws-load-balancer-operator
oc apply -f hack/operator-credentials-request.yaml
export IMG=quay.io/$USER/aws-load-balancer-operator:latest
make deploy
oc get all -n aws-load-balancer-operator
~~~

[2] Current work to use AWS Local Zones in OCP:
https://issues.redhat.com/browse/RFE-2782

Comment 1 Miciah Dashiel Butler Masters 2022-07-12 13:56:03 UTC
This is the expected behavior.  The ALB operator requires that the VPC be tagged.  Arjun will work on making this clearer in the documentation.

Comment 2 mfisher 2022-12-19 21:11:42 UTC
This issue is stale and has been closed because it has been open 90 days or more with no noted activity/comments in the last 60 days.  If this issue is crucial and still needs resolution, please open a new jira issue and the engineering team will triage and prioritize accordingly.