Bug 2048451
| Summary: | Custom serviceEndpoints in install-config are reported to be unreachable when environment uses a proxy | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Apoorva Jagtap <apjagtap> |
| Component: | Installer | Assignee: | Aditya Narayanaswamy <anarayan> |
| Installer sub component: | openshift-installer | QA Contact: | Yunfei Jiang <yunjiang> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | padillon, yunjiang |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Custom service endpoints behind restricted environments were unreachable by the installer
Consequence: installer failure due to service endpoints being invisible
Fix: Check the service endpoints with the system proxy information set by the user
Result: Service endpoints behind proxy should bee visible now during checks
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 10:45:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Apoorva Jagtap
2022-01-31 09:43:30 UTC
I think that the issue here is that the installer is not considering the proxy when attempting to validate the accessibility of the service endpoints. Instead of using net.Dial for validation [1], we should be using proxy.Dial from golang.org/x/net [2]. [1] https://github.com/openshift/installer/blob/f6ea846f7a8a2357191dd2e2c4cec5b73023d0f0/pkg/asset/installconfig/aws/validation.go#L331 [2] https://pkg.go.dev/golang.org/x/net/proxy#Dial Hello Matthew, That seems to be a valid point. For the time being, do we have any possibility to let the installer skip the validations (to test)? Thanks, ApoorvaJ The installer does not have a way to skip the validations. The only thing that I can offer is configuring your machine so that it resolves the IP address of the service endpoints to the proxy. However, let's take a step back to where you were able to get the infrastructure resources created without specifying the service endpoints in the install-config.yaml. In one install-config.yaml that you posted, the s3 endpoint is specified as well as the sts endpoint. In the other, only the sts endpoint is specified. Is the custom endpoint for s3 needed? If not, then you should be able to manually edit the installer manifests--like you attempted--to add the sts endpoint, as the installer does not use the sts endpoint. Did you specify the sts endpoint in both the spec and status of the infrastructure manifest? Thanks for confirming.. Regarding the install-configs, previously there was a need for s3 endpoint, but with the latest configuration changes at cu's end, we would just need the sts endpoint. I mistakenly attached both the install-configs with endpoints (corrected now: install-config-sts.yaml with endpoint & install-config.yaml without endpoint). So, in the latest deployment, (where cluster came up with few operators degraded) we did not pass any serviceEndpoint in the install-config, and just specified the sts endpoint in the infrastructure manifest's spec section, which lead to failure for the image-registry operator. However, just to try out some workaround, the team must have tried to specify it in spec as well as status section too. If we do not need to have the sts specifically in the install-config, I can try to perform a fresh check again (with sts endpoint in just the spec section of the infrastructure manifest). Let me know if we should keep a check on anything else too. Thanks, ApoorvaJ > So, in the latest deployment, (where cluster came up with few operators degraded) we did not pass any serviceEndpoint in the install-config, and just specified the sts endpoint in the infrastructure manifest's spec section, which lead to failure for the image-registry operator. The kube cloud config is populated with the service endpoints from the status of the infrastructure. So it makes sense that it would not work if you only filled out the spec. See https://github.com/openshift/cluster-config-operator/blob/a726e3ee93ee0058b90aef3ec37106a2411b7216/pkg/operator/kube_cloud_config/aws.go#L62. verification failed. OCP version: 4.11.0-0.nightly-2022-02-10-031822 Note: ## Setting up host A which can not access ec2.us-east-2.amazonaws.com > curl -kvv https://ec2.us-east-2.amazonaws.com * About to connect() to ec2.us-east-2.amazonaws.com port 443 (#0) * Trying 52.95.16.2... * Connection timed out * Failed connect to ec2.us-east-2.amazonaws.com:443; Connection timed out * Closing connection 0 curl: (7) Failed connect to ec2.us-east-2.amazonaws.com:443; Connection timed out ## the command `curl -kvv https://ec2.us-east-2.amazonaws.com` return same the results from bastion(proxy) host, and host A with proxy setting On host A with proxy setting. > curl -kvv https://ec2.us-east-2.amazonaws.com * About to connect() to proxy ec2-3-138-34-112.us-east-2.compute.amazonaws.com port 3128 (#0) * Trying 10.0.0.95... * Connected to ec2-3-138-34-112.us-east-2.compute.amazonaws.com (10.0.0.95) port 3128 (#0) * Establish HTTP proxy tunnel to ec2.us-east-2.amazonaws.com:443 * Proxy auth using Basic with user 'proxy-user1' > CONNECT ec2.us-east-2.amazonaws.com:443 HTTP/1.1 > Host: ec2.us-east-2.amazonaws.com:443 > Proxy-Authorization: ... > User-Agent: curl/7.29.0 > Proxy-Connection: Keep-Alive > < HTTP/1.1 200 Connection established < * Proxy replied OK to CONNECT request ... > GET / HTTP/1.1 > User-Agent: curl/7.29.0 > Host: ec2.us-east-2.amazonaws.com > Accept: */* > < HTTP/1.1 301 Moved Permanently < Location: https://aws.amazon.com/ec2 ... On proxy host (bastion host) > curl -kvv https://ec2.us-east-2.amazonaws.com * About to connect() to ec2.us-east-2.amazonaws.com port 443 (#0) * Trying 52.95.20.2... * Connected to ec2.us-east-2.amazonaws.com (52.95.20.2) port 443 (#0) ... > GET / HTTP/1.1 > User-Agent: curl/7.29.0 > Host: ec2.us-east-2.amazonaws.com > Accept: */* > < HTTP/1.1 301 Moved Permanently < Location: https://aws.amazon.com/ec2 ... ## but failed while using installer. > ./411/openshift-install create manifests --dir sts5c FATAL failed to fetch Master Machines: failed to load asset "Install Config": [platform.aws.serviceEndpoints[0].url: Invalid value: "https://ec2.us-east-2.amazonaws.com": dial tcp x.x.x.x:443: connect: connection timed out, platform.aws.serviceEndpoints[1].url: Invalid value: "https://sts.us-east-2.amazonaws.com": dial tcp x.x.x.x:443: connect: connection timed out] > Progressing: Unable to apply resources: unable to sync storage configuration: WebIdentityErr: failed to retrieve credentials +apjagtap this is a known issue, pls see Bug 1939842 Image registry Degraded caused by requesting to aws sts global endpoint timeout when installing sts cluster in a disconnected network Hello @mstaeble, we performed the installation again with no custom endpoints in the install-config, but specifying the sts endpoint in infrastructure CR's spec as well status section. The installation failed again with image-registry reporting degraded state. Let me know if you'd like to take a look into the latest logs, and I can share the same. However, I think the reason for image-registry operator in a degraded state is due to the !1939842 as highlighted by Yunfie. Thanks @yunjiang for sharing that. I'll check further on the same. Thank you for the help so far! verified. PASS. verify process: see comment 17 OCP version: 4.11.0-0.nightly-2022-04-26-181148 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |