Bug 2094716 - Unable to install a fully air gapped OCP 4.10 cluster in AWS using IPI
Summary: Unable to install a fully air gapped OCP 4.10 cluster in AWS using IPI
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.10
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.12.0
Assignee: Rafael Fonseca
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-08 08:54 UTC by Jose Ignacio Jerez
Modified: 2023-01-17 19:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Doc updates were done in https://bugzilla.redhat.com/show_bug.cgi?id=2100534
Clone Of:
Environment:
Last Closed: 2023-01-17 19:49:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 6076 0 None open Bug 2094716: docs: fully air-gapped AWS IPI install 2022-07-04 14:11:54 UTC
Red Hat Bugzilla 2100534 0 unspecified CLOSED Update AWS restricted installation doc for Internet req for route 53 service 2022-10-07 20:08:55 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:17 UTC

Internal Links: 2100534

Description Jose Ignacio Jerez 2022-06-08 08:54:08 UTC
vVrsion: OCP 4.10.10

Platform: AWS IPI

The documentation about "Installing a cluster on AWS in a restricted network" [1] hints that you can install OCP on a fully disconnected (aka air gapped) network in AWS, without a proxy.  You need a mirror registry in the VPC and manually create and provide the IAM credentials.  However this is not enough, the ingress controller operator fails to deploy because it tries to access the route 53 API using a public IP

In the ingress operator logs, you can see:

ERROR   operator.init.controller.dns_controller controller/controller.go:266    Reconciler error        {"name": "default-wildcard", "namespace": "openshift-ingress-operator", "error": "failed to create DNS provider: failed to create AWS DNS manager: failed to validate aws provider service endpoints: [failed to list route53 hosted zones: RequestError: send request failed\ncaused by: Get \"https://route53.amazonaws.com/2013-04-01/hostedzone?maxitems=1\": dial tcp 54.239.31.187:443: i/o timeout, failed to get group tagging resources: RequestError: send request failed\ncaused by: Post \"https://tagging.us-east-1.amazonaws.com/\": dial tcp 52.94.233.76:443: i/o timeout]"}
 
The VPC has no conecction to the Internet and there is no proxy either, so it is not possible to access the public IP 52.94.233.76.

There seems to be no way to either access the route 53 API using a private IP (aws endpoint) or tell the installer not to manage the DNS private hosted zone. 

Even if a hostedZone directive is added to the install-config.yaml file the installer will not create the private hosted zone but will try to add the *.apps record, using the previously mention public IP

platform:
  aws:
    region: eu-west-3
    subnets:
        - subnet-024eb3f95d230d440
        - subnet-095756b98f1576e22
        - subnet-00f61e017451444b4
    hostedZone: Z053339613WKB097X6QDX

We need a method to tell the IPI installer not to manage DNS in a disconnected installation, as it can be done with UPI, or clearly state in the documentation that AWS + IPI + air gapped is not an option.

This bug can be easily reproduced by trying to install an OCP 4.10 cluster on AWS using the IPI installer, on a disconnected VPC, with no proxy, with mirror registry and it the s3, lb and ec2 endpoints created, and manually creating the IAM credentials.  The installation fails because the ingress cluster operator does not fully deployes.

Comment 1 Rafael Fonseca 2022-06-10 13:17:54 UTC
Can you try configuring Openshift not to manage DNS records as exemplified in [1]?

[1] https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#remove-dns-zones

If that works, I'll update the IPI docs.

Comment 2 Jose Ignacio Jerez 2022-06-13 09:12:10 UTC
I have run the following test:

Adding the directive hostedZone: to the install-config.yaml file, like:
platform:
  aws:
    region: eu-west-3
    subnets:
        - subnet-061759d2148f90cb6
        - subnet-07c2090dc4fee87ce
        - subnet-02153044d54b99901
    hostedZone: Z02632002O1H1WY40CUR

And later removing the private zone reference from the manifest file manifests/cluster-dns-02-config.yml:

apiVersion: config.openshift.io/v1
kind: DNS
metadata:
  creationTimestamp: null
  name: cluster
spec:
  baseDomain: buxom.kali.emeatam.support
REMOVED -->  privateZone:
REMOVED -->    id: Z0263xxxxxxxxCURG
status: {}

Results in:
The installer does not try to create the private hosted zone
The installer creates the api and api-int DNS records in the private hosted zone.  It can do that because it is run from a bastion host which does have access to the internet
After the bootstrap process is completed, the ingress controller does not try to create any records in the private hosted zone, in particular *.apps, but it checks for the resolution of canary-openshift-ingress-canary.apps.buxom.kali.emeatam.support which is not possible because *.apps does not exist.
To unlock this situation I have manually created the *.apps record pointing to the classic load balancer created by the installer, mind that this is an IPI installation.  But the classic load balancer is only known after the installer has created it.
I would say that the installation is possible as long as the installer has access to the Internet, but the procedure is unlikely to be supported since it implies creating dependent resources in the middle of  the installation.

My conclusion is that the docs should state that it is not possible to install OCP 4 on AWS in a fully air gapped design, because the installation process  requires Internet access to use the route 53 API, which is stated in AWS documentations for route 53 FAQ:
Do I need connectivity to the outside Internet in order to use Private DNS?
You can resolve internal DNS names from resources within your VPC that do not have Internet connectivity. However, to update the configuration for your Private DNS hosted zone, you need Internet connectivity to access the Route 53 API endpoint, which is outside of VPC.

Comment 3 Jose Ignacio Jerez 2022-06-13 10:13:41 UTC
I have also tested the simpler alternative of letting the openshift installer create the private hosted zone and add the api and api-int records in it, and then remove the private zone references from cluster-dns-02-config.yml.
This configuration works in a similar way as the one before in regards to the ingress controller.  The *.apps needs to be created manually, but it is simpler because the private hosted zone dosn't need to be created manually and the install-config.yaml does not contain the hostedZone directive.

Comment 4 Mike Pytlak 2022-06-13 16:13:28 UTC
@jjerezro The doc for an IPI installation on an AWS restricted network currently states that you still require access to cloud APIs[1] WDYT? I believe this addresses your conclusion about the docs stating it is not possible to install in a fully air gapped design.

[1] About installations in restricted networks (https://docs.openshift.com/container-platform/4.10/installing/installing_aws/installing-restricted-networks-aws-installer-provisioned.html#installation-about-restricted-networks_installing-restricted-networks-aws-installer-provisioned)

Comment 5 Jose Ignacio Jerez 2022-06-14 07:33:03 UTC
@mpytlak At the beginning of the document you mention, there is this paragraph, but the wording is a bit ambiguous and could be improved to be more precise, for example "you might require internet access" turns out to be "you definitely need internet access" in particular to access route 53 API:
  
"If you choose to perform a restricted network installation on a cloud platform, you still require access to its cloud APIs. Some cloud functions, like Amazon Web Service’s IAM service, require internet access, so you might still require internet access. Depending on your network, you might require less internet access for an installation on bare metal hardware or on VMware vSphere."

Other sections of the document give more precise and quite useful instructions on how to access required services with no Internet access:

* The mirror registry

"You mirrored the images for a disconnected installation to your registry and obtained the imageContentSources data for your version of OpenShift Container Platform."

The EC2, ELB and S3 APIs:

"If you are working in a disconnected environment, you are unable to reach the public IP addresses for EC2 and ELB endpoints. To resolve this, you must create a VPC endpoint and attach it to the subnet that the clusters are using. The endpoints should be named as follows:"

The IAM APIs:

"If the cloud identity and access management (IAM) APIs are not accessible in your environment, or if you do not want to store an administrator-level credential secret in the kube-system namespace, you can manually create and maintain IAM credentials."

The section "Internet access for OpenShift Container Platform" provides even more useful tips: https://docs.openshift.com/container-platform/4.10/installing/installing_aws/installing-restricted-networks-aws-installer-provisioned.html#cluster-entitlements_installing-restricted-networks-aws-installer-provisioned

But nowhere in the document it is a mentioned that route 53 needs internet access, and there seems to be no supported workaround for this in a disconnected environment.  This is what I would like you to add, preferably at the beginning of the document.

Comment 6 Mike Pytlak 2022-06-14 13:16:48 UTC
Thanks for the feedback, Jose. @stevsmit as I understand it, you wrote this content. Please see Jose's comment on 6-14-22. Possible candidate for doc improvement.

Comment 7 Steven Smith 2022-06-14 17:49:35 UTC
@mpytlak I wrote documentation on OMR and some docs on disconnected installs (https://docs.openshift.com/container-platform/4.10/installing/disconnected_install/index.html), but not on AWS.

Comment 8 Steven Smith 2022-06-14 17:50:29 UTC
I might be wrong, but I think Cody owned most of these docs (as per Git blame)

Comment 9 Mike Pytlak 2022-06-23 16:01:39 UTC
@jjerezro Doc ticket created.

Comment 10 Jose Ignacio Jerez 2022-06-27 06:55:16 UTC
Excellent news.
If you need any further input from me just le me know.

Comment 15 errata-xmlrpc 2023-01-17 19:49:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.