Bug 1743483
| Summary: | [disconnected] a disconnected upi-on-aws install need proxy to access aws api when your instance totally drop internet capacity. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Johnny Liu <jialiu> |
| Component: | Documentation | Assignee: | Kathryn Alexander <kalexand> |
| Status: | CLOSED UPSTREAM | QA Contact: | Johnny Liu <jialiu> |
| Severity: | urgent | Docs Contact: | Vikram Goyal <vigoyal> |
| Priority: | urgent | ||
| Version: | 4.2.0 | CC: | ahoffer, aos-bugs, bleanhar, decarr, dgoodwin, dmoessne, jnordell, jokerman, kalexand, mfuruta, nosue, rh-container, scuppett, sdodson, suchaudh, umohnani, wking |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.3.0 | Flags: | scuppett:
needinfo-
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Release Note | |
| Doc Text: |
Disconnected installations of OCP4 on cloud environments where interoperability with the cloud environment is desired may require a proxy be configured to the cloud environment infrastructure endpoints. Examples of this include AWS Route53 & IAM endpoints.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-25 19:50:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Johnny Liu
2019-08-20 05:59:04 UTC
Johnny, We discussed this during group G architecture call and decided that what we'd suggest is configuring the VPC to have access to standard AWS APIs. Future work will enable the CCO to be disabled and that will be tracked in https://jira.coreos.com/browse/CO-537 Can you try that? Once confirmed we can move this to docs to make sure that we hilight that in the disconnected documentation. (In reply to Scott Dodson from comment #4) > Johnny, > > we'd suggest is configuring the VPC to have access to standard AWS APIs. Does that mean vpc only be able to access standard AWS APIs, but not can not access internet, right? How to achieve that? Nee some guide here. (In reply to Johnny Liu from comment #7) > Need some guide here. Here the "guide" means some draft document we will tell customer once QE confirmed. The only process we can come up with is to provide an Internet gateway during the installation and then remove it after installation and verify the upgrade process works as expected. (In reply to Scott Dodson from comment #9) > The only process we can come up with is to provide an Internet gateway > during the installation and then remove it after installation and verify the > upgrade process works as expected. That sound like I am running a common upi-on-aws install, "remove internet gateway" seem like some post actions. Do we really suggest customer to do such a strange disconnected install? Standing form customer perspective, I do not feel like this a reasonable resolution. The root fix should be "Future work will enable the CCO to be disabled", if we can not complete that in 4.2. that sound like disconnected install on aws should be some 4.3 feature. And one more point, per my understanding, the cloud-credential operator is always keeping to access aws API, after installation is completed, once remove internet gateway, CCO would go to degrade state again. Anyone could confirm that? If I am wrong, pls correct me. @Devan, could you help confirm my understanding?
> per my understanding, the cloud-credential operator is always keeping to access aws API, after installation is completed, once remove internet gateway, CCO would go to degrade state again.
I have not verified this but if the cred operator was able to successfully mint all the required credentials it needs before being taken "offline", it will be erroring in the control loop but not marked degraded. I think we check if credentials need minting or not to report degraded/failing status. Follow Scott's suggestion, install a cluster, then remove NAT gateway from the VPC. Wait for some mins , authentication and cloud-credential operator get into degrade state as my assumption.
$ oc describe co authentication
Name: authentication
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2019-09-03T10:57:11Z
Generation: 1
Resource Version: 427264
Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication
UID: 949a0b48-ce39-11e9-b3c7-06329ea4d760
Spec:
Status:
Conditions:
Last Transition Time: 2019-09-04T06:40:56Z
Message: RouteHealthDegraded: failed to GET route: dial tcp 18.222.35.247:443: i/o timeout
Reason: RouteHealthDegradedFailedGet
Status: True
Type: Degraded
Last Transition Time: 2019-09-03T11:08:19Z
Reason: AsExpected
Status: False
Type: Progressing
Last Transition Time: 2019-09-03T11:06:49Z
Reason: AsExpected
Status: True
Type: Available
Last Transition Time: 2019-09-03T10:57:14Z
Reason: AsExpected
Status: True
Type: Upgradeable
Extension: <nil>
$ oc describe co cloud-credential
Name: cloud-credential
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2019-09-03T10:53:45Z
Generation: 1
Resource Version: 395619
Self Link: /apis/config.openshift.io/v1/clusteroperators/cloud-credential
UID: 19b7cfa4-ce39-11e9-a07d-024c4b7846b0
Spec:
Status:
Conditions:
Last Transition Time: 2019-09-04T07:09:50Z
Message: 4 of 4 credentials requests are failing to sync.
Reason: CredentialsFailing
Status: True
Type: Degraded
Last Transition Time: 2019-09-04T07:09:50Z
Message: 0 of 4 credentials requests provisioned, 4 reporting errors.
Reason: Reconciling
Status: True
Type: Progressing
Last Transition Time: 2019-09-03T10:53:45Z
Status: True
Type: Available
Last Transition Time: 2019-09-03T10:53:45Z
Status: True
Type: Upgradeable
Extension: <nil>
For authentication, it is failing to connect apps LB which is provisioned by ingress router, and it is a internet-facing LB. User have no way to change it in CF template.
For cloud-credential, totally the same error log, saying fail to access https://iam.amazonaws.com/
Derek any thoughts on what we should do here? We know you've mentioned that cluster components should be able to assume access to cloud APIs without a proxy. However no one has been able to provide QE with a documented way to configure a cluster that is disconnected but able to access AWS APIs. My understanding is QE has tried several things without luck. One attempt was a proxy, which led to bug #1747366, where Cred Operator does not support a proxy but QE manually hacked around it and this appears to solve the issue for a disconnected cluster. Should we add support for the proxy vars to cred operator? From my understanding, cluster behind proxy and disconnected cluster is two different test scenarios, the only common thing is both test scenarios have no direct internet connectivity. cluster behind proxy: 1. instances have no direct internet connectivity. 2. need set up a proxy in internal network which is the only egress to internet. 3. no need mirror release image into internal registry, cluster is till pulling images from quay.io, but have to go through proxy. 4. all access to internal url in cluster instances would get out though proxy disconnected cluster: 1. instances have no any internet connectivity. 2. need set up a registry in internal network which also have egress internet connectivity. 3. mirror release image on the internal registry for the following installation 4. all instances in cluster have no any internet connectivity, pulling images from the internal registry. > But the problem is apps load balancer, the apps load balancer would be provisioned by ingress operator, which is in public subnet. User have no way change it in CF template. [1] landed installer docs about disabling DNS. You should also be able to set endpointPublishingStrategy on the IngressController spec [2] to disable load-balancer provisioning, although I haven't tried that myself. > Manual updates to the templates (and VPC) are required to do a disconnected + proxy installation. Agreed, and we have those in flight for AWS [3]. I'm not sure we intend to ever document the setup in the installer repo, but folks are obviously welcome to lean on the CI approach if they want some guidance setting this up, and that may eventually end up in openshift-docs instructions. But this bug is assigned to the cred operator, and I'm not seeing anything there that bug 1747366 does not already cover. If we are going to leave this open, we need to be more clear about what action we're expecting to close it. [1]: https://github.com/openshift/installer/pull/2221 [2]: https://github.com/openshift/api/blob/37678ff76af25c454dce37f78973d47d6ec23125/operator/v1/types_ingress.go#L67-L83 [3]: https://github.com/openshift/release/pull/4719 (In reply to W. Trevor King from comment #20) > But this bug is assigned to the cred operator, and I'm not seeing anything > there that bug 1747366 does not already cover. If we are going to leave > this open, we need to be more clear about what action we're expecting to > close it. In my initial report, cred operator can not work without proxy, that was blocking QE's disconnected install on aws testing. Per the above round of round discussion, QE get a conclusion from the above comments, disconnected install need privatelink + proxy in the disconnected vpc, so that move on our testing (need verification after bug 1747366 is fixed). Once cred operator get fixed with proxy enabled, will continue to validate "disable load-balancer provisioning" way. In a word, this issue at least deserve some document tracking, or else, customer would also hit the same issue. I would change component to "Document", and change title to reflect our conclusion. I think we need to just be very clear in our documentation what we're referring to as 'Disconnected' and perhaps for now considering something more specific like 'Mirrored content'. The epic brief and product epic in Jira clearly outline what's in scope however the goal is a bit ambiguous as it describes being disconnected from Internet which I guess could be interpreted as not having access to cloud APIs, though I personally wouldn't interpret it that way. https://docs.google.com/document/d/1m-dSdP6NaHuAptNEgt0pGYq3x-74cKT2wQydtXDg4Jw/edit# https://jira.coreos.com/browse/PROD-614 Reading through this if the bootstrap node is failing to come up, we will not be able to help on the cred minting side. This should either be reassigned to installer, or perhaps filed as a separate bug. (In reply to Devan Goodwin from comment #25) > Reading through this if the bootstrap node is failing to come up, we will > not be able to help on the cred minting side. This should either be > reassigned to installer, or perhaps filed as a separate bug. Because aws did not have a supported privatelink for iam endpoint, QE is suggested to enable proxy for credential operator in comment 19. But unfortunately once enable proxy in install-config.yaml, it would enable *global* proxy for the whole cluster. That would cause other parts of installation would not work as exception. So is there a way to only configure credential operator to enable proxy setting? No there is no way to enable proxy only for crednetials operator, other than what you discovered earlier by manually adding the proxy settings. Is there some customer need for us to enable proxy for individual components and not the whole cluster? (In reply to Devan Goodwin from comment #29) > No there is no way to enable proxy only for crednetials operator, other than > what you discovered earlier by manually adding the proxy settings. > > Is there some customer need for us to enable proxy for individual components > and not the whole cluster? In this disconnected install scenario (not proxy install scenario), I think the answer would be 'yes', or else, credential would never get ready. User-provided "disconnected" AWS docs are in flight, and they're just using the vanilla CloudFormation templates [1]. They also emphasize that you still need access to the AWS APIs [2]. There's nothing about "if you block the usual route to the IAM APIs, you can use these proxy settings to recover...". This is all a lot like the "mirrored release" emphasis from comment 22. Can we close this NOTABUG, or at least punt to 4.3+, unless we can reproduce the cred issue under those constraints? [1]: https://github.com/openshift/openshift-docs/pull/16696/files#diff-b0013daefc0ff70440c46c0e994c9249R65-R89 [2]: https://github.com/openshift/openshift-docs/pull/16696/files#diff-b0013daefc0ff70440c46c0e994c9249R15 Today I successfully set up a disconnected cluster (dropping internet gateway for private subnets + enable proxy in install-config.yaml) after apply some workaround. During testing, we found a proxy bug: Bug 1753467 - [proxy] no proxy is set for kube-controller-manager. Once 1753467 is fixed, our conclusion in QE's testing for disconnected install on aws would be that if user drop the overall internet traffic capacity (no way to access AWS APIs), user need enable proxy to allow those AWS APIs access, add those api endpoints into proxy's whitelist. No need mix privatelink any more. In QE's testing, the whitelist in proxy would be something like: ec2.us-east-2.amazonaws.com iam.amazonaws.com <MIRROR-REIGSTRY> .s3.us-east-2.amazonaws.com .apps.<CLUSTER-NAME>.<DOMAIN> User also need create apps record following https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#add-the-ingress-dns-records These info need to be clearly explained to customers. So this is not a test blocker any more, move this bug to Document component. After that proxy bug is fixed, I'll update the AWS docs per the information in #c40. I'm adding the information about creating your own DNS records entries per https://github.com/openshift/installer/pull/2221/files#diff-a2ee8aa448a0244512469c9c7126465f\ on https://github.com/openshift/openshift-docs/pull/17190. I've merged the information about creating your own DNS records entries, and it should go live soon. I'm going to keep this bug open pending the resolution of Bug 1753467 - [proxy] no proxy is set for kube-controller-manager. (In reply to Kathryn Alexander from comment #42) > I'm adding the information about creating your own DNS records entries per > https://github.com/openshift/installer/pull/2221/files#diff- > a2ee8aa448a0244512469c9c7126465f\ on > https://github.com/openshift/openshift-docs/pull/17190. Hello Kathryn Alexander, Would you mean ? - Creating the Ingress DNS Records - Installing a cluster on AWS in a restricted network - Installing on AWS | Installing | OpenShift Container Platform 4.5 https://docs.openshift.com/container-platform/4.5/installing/installing_aws/installing-restricted-networks-aws.html#installation-create-ingress-dns-records_installing-restricted-networks-aws /Masaki @Masaki, yes, that section contains the intended change for this bug, so I'm going to close it. Please reopen it with more data if you think that there is additional work to do. |