Bug 2048451

Summary:	Custom serviceEndpoints in install-config are reported to be unreachable when environment uses a proxy
Product:	OpenShift Container Platform	Reporter:	Apoorva Jagtap <apjagtap>
Component:	Installer	Assignee:	Aditya Narayanaswamy <anarayan>
Installer sub component:	openshift-installer	QA Contact:	Yunfei Jiang <yunjiang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	padillon, yunjiang
Version:	4.8
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Custom service endpoints behind restricted environments were unreachable by the installer Consequence: installer failure due to service endpoints being invisible Fix: Check the service endpoints with the system proxy information set by the user Result: Service endpoints behind proxy should bee visible now during checks	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:45:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Apoorva Jagtap 2022-01-31 09:43:30 UTC

Version: 4.8.24

Platform: AWS

Please specify:
[*] IPI (restricted) with manual mode STS

What happened?
[*] While specifying the custom serviceEndpoints (for ec2 & sts) in the install-config, the installer reports it as invalid as no connection could be established:
~~~
 Install Config": platform.aws.serviceEndpoints[0].url: Invalid value: "https://sts.<region>.amazonaws.com": dial tcp x.x.x.x:443: connect: connection timed out.
~~~
- A curl test from the bastion to the same endpoint reports a successful connection (200 status code).

[*] Later, as a check, tried to generate manifests by removing the `platform.aws.serviceEndpoints` details from the install-config, and specified the serviceEndpoints just in the infrastructure's manifest (cluster-infrastructure-02-config.yml). 
- With this, the cluster deployment proceeds, but the image-registry operator reports to be in Degaded state, as it still reaches out to the global STS endpoint than the customized as per the region:
~~~
- apiVersion: config.openshift.io/v1
  kind: ClusterOperator
  metadata:
    ...
    name: image-registry
  spec: {}
  status:
    conditions:
    ...
    - lastTransitionTime: "2022-01-28T09:22:24Z"
      message: |-
        Progressing: Unable to apply resources: unable to sync storage configuration: WebIdentityErr: failed to retrieve credentials
        Progressing: caused by: RequestError: send request failed
        Progressing: caused by: Post "https://sts.amazonaws.com/": dial tcp x.x.x.x:443: i/o timeout
      reason: Error
      status: "True"
~~~

What did you expect to happen?
[*] No connectivity issues while installer tries to reach to the custom serviceEndpoint.

Anything else we need to know?
[*] The S3 bucker referred used with the STS configurations is present on another AWS account, as due to security concerns they cannot configure public access for the bucket in the current AWS account.

Comment 7 Matthew Staebler 2022-02-01 01:22:03 UTC

I think that the issue here is that the installer is not considering the proxy when attempting to validate the accessibility of the service endpoints.

Instead of using net.Dial for validation [1], we should be using proxy.Dial from golang.org/x/net [2].

[1] https://github.com/openshift/installer/blob/f6ea846f7a8a2357191dd2e2c4cec5b73023d0f0/pkg/asset/installconfig/aws/validation.go#L331
[2] https://pkg.go.dev/golang.org/x/net/proxy#Dial

Comment 9 Apoorva Jagtap 2022-02-01 09:45:27 UTC

Hello Matthew, 
That seems to be a valid point. For the time being, do we have any possibility to let the installer skip the validations (to test)? 

Thanks,
ApoorvaJ

Comment 10 Matthew Staebler 2022-02-01 12:24:45 UTC

The installer does not have a way to skip the validations. The only thing that I can offer is configuring your machine so that it resolves the IP address of the service endpoints to the proxy.

However, let's take a step back to where you were able to get the infrastructure resources created without specifying the service endpoints in the install-config.yaml. In one install-config.yaml that you posted, the s3 endpoint is specified as well as the sts endpoint. In the other, only the sts endpoint is specified. Is the custom endpoint for s3 needed? If not, then you should be able to manually edit the installer manifests--like you attempted--to add the sts endpoint, as the installer does not use the sts endpoint. Did you specify the sts endpoint in both the spec and status of the infrastructure manifest?

Comment 13 Apoorva Jagtap 2022-02-01 13:29:05 UTC

Thanks for confirming.. 

Regarding the install-configs, previously there was a need for s3 endpoint, but with the latest configuration changes at cu's end, we would just need the sts endpoint. I mistakenly attached both the install-configs with endpoints (corrected now: install-config-sts.yaml with endpoint & install-config.yaml without endpoint).

So, in the latest deployment, (where cluster came up with few operators degraded) we did not pass any serviceEndpoint in the install-config, and just specified the sts endpoint in the infrastructure manifest's spec section, which lead to failure for the image-registry operator.


However, just to try out some workaround, the team must have tried to specify it in spec as well as status section too. 
If we do not need to have the sts specifically in the install-config, I can try to perform a fresh check again (with sts endpoint in just the spec section of the infrastructure manifest). Let me know if we should keep a check on anything else too.

Thanks,
ApoorvaJ

Comment 14 Matthew Staebler 2022-02-01 13:42:14 UTC

> So, in the latest deployment, (where cluster came up with few operators degraded) we did not pass any serviceEndpoint in the install-config, and just specified the sts endpoint in the infrastructure manifest's spec section, which lead to failure for the image-registry operator.

The kube cloud config is populated with the service endpoints from the status of the infrastructure. So it makes sense that it would not work if you only filled out the spec.
See https://github.com/openshift/cluster-config-operator/blob/a726e3ee93ee0058b90aef3ec37106a2411b7216/pkg/operator/kube_cloud_config/aws.go#L62.

Comment 17 Yunfei Jiang 2022-02-10 07:52:12 UTC

verification failed.

OCP version: 4.11.0-0.nightly-2022-02-10-031822

Note:

## Setting up host A which can not access ec2.us-east-2.amazonaws.com

> curl -kvv https://ec2.us-east-2.amazonaws.com

* About to connect() to ec2.us-east-2.amazonaws.com port 443 (#0)
*   Trying 52.95.16.2...
* Connection timed out
* Failed connect to ec2.us-east-2.amazonaws.com:443; Connection timed out
* Closing connection 0
curl: (7) Failed connect to ec2.us-east-2.amazonaws.com:443; Connection timed out


## the command `curl -kvv https://ec2.us-east-2.amazonaws.com` return same the results from bastion(proxy) host, and host A with proxy setting

On host A with proxy setting.

> curl -kvv https://ec2.us-east-2.amazonaws.com
* About to connect() to proxy ec2-3-138-34-112.us-east-2.compute.amazonaws.com port 3128 (#0)
*   Trying 10.0.0.95...
* Connected to ec2-3-138-34-112.us-east-2.compute.amazonaws.com (10.0.0.95) port 3128 (#0)
* Establish HTTP proxy tunnel to ec2.us-east-2.amazonaws.com:443
* Proxy auth using Basic with user 'proxy-user1'
> CONNECT ec2.us-east-2.amazonaws.com:443 HTTP/1.1
> Host: ec2.us-east-2.amazonaws.com:443
> Proxy-Authorization: ...
> User-Agent: curl/7.29.0
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 Connection established
<
* Proxy replied OK to CONNECT request
...
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: ec2.us-east-2.amazonaws.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: https://aws.amazon.com/ec2
...

On proxy host (bastion host)

> curl -kvv https://ec2.us-east-2.amazonaws.com
* About to connect() to ec2.us-east-2.amazonaws.com port 443 (#0)
*   Trying 52.95.20.2...
* Connected to ec2.us-east-2.amazonaws.com (52.95.20.2) port 443 (#0)
...
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: ec2.us-east-2.amazonaws.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: https://aws.amazon.com/ec2
...

## but failed while using installer.

> ./411/openshift-install create manifests --dir sts5c
FATAL failed to fetch Master Machines: failed to load asset "Install Config": [platform.aws.serviceEndpoints[0].url: Invalid value: "https://ec2.us-east-2.amazonaws.com": dial tcp x.x.x.x:443: connect: connection timed out, platform.aws.serviceEndpoints[1].url: Invalid value: "https://sts.us-east-2.amazonaws.com": dial tcp x.x.x.x:443: connect: connection timed out]

Comment 18 Yunfei Jiang 2022-02-10 08:05:10 UTC

> Progressing: Unable to apply resources: unable to sync storage configuration: WebIdentityErr: failed to retrieve credentials

+apjagtap this is a known issue, pls see Bug 1939842 Image registry Degraded caused by requesting to aws sts global endpoint timeout when installing sts cluster in a disconnected network

Comment 19 Apoorva Jagtap 2022-02-11 09:13:53 UTC

Hello @mstaeble, we performed the installation again with no custom endpoints in the install-config, but specifying the sts endpoint in infrastructure CR's spec as well status section. The installation failed again with image-registry reporting degraded state. Let me know if you'd like to take a look into the latest logs, and I can share the same.

However, I think the reason for image-registry operator in a degraded state is due to the !1939842 as highlighted by Yunfie. Thanks @yunjiang for sharing that. 
I'll check further on the same.

Thank you for the help so far!

Comment 26 Yunfei Jiang 2022-04-28 06:49:03 UTC

verified. PASS.
verify process: see comment 17
OCP version: 4.11.0-0.nightly-2022-04-26-181148

Comment 28 errata-xmlrpc 2022-08-10 10:45:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069