Description of problem: Version-Release number of the following components: 4.2.0-0.nightly-2019-08-18-222019 How reproducible: Always Steps to Reproduce: 1. Create a disconnected vpc Refer to https://docs.openshift.com/container-platform/4.1/installing/installing_aws_user_infra/installing-aws-user-infra.html#installation-cloudformation-vpc_installing-aws-user-infra, remove NAT/EIP/Route from private subnets, and create ec2/elasticloadbalancing Endpoints to access AWS service which kubelet required for cloudprovider. 2. Create a mirror registry, and mirror release image to this internal reigstry. 3. Trigger upi install on aws Actual results: Installation failed, due to some operator does not get ready $ ./openshift-install wait-for install-complete --dir '/home/installer2/workspace/Launch Environment Flexy/workdir/install-dir' level=info msg="Waiting up to 30m0s for the cluster at https://api.aws-jialiu1.qe.devcluster.openshift.com:6443 to initialize..." level=fatal msg="failed to initialize the cluster: Working towards 4.2.0-0.nightly-2019-08-18-222019: 99% complete" Expected results: Disconnected install should be completed successfully. Additional info: The following operator does not get ready $ oc get clusteroperator NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 20h cloud-credential 4.2.0-0.nightly-2019-08-18-222019 True True True 20h image-registry False True False 20h $ oc describe clusteroperator image-registry ... Status: Conditions: Last Transition Time: 2019-08-19T09:01:56Z Message: The deployment does not exist Reason: DeploymentNotFound Status: False Type: Available Last Transition Time: 2019-08-19T09:01:56Z Message: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials "openshift-image-registry/installer-cloud-credentials": secret "installer-cloud-credentials" not found $ oc get po -n openshift-ingress-operator ... Containers: ingress-operator: Container ID: cri-o://8c507f9d8956f22d77963dca8079ff51139c8bdb9892cdb33f49c8cbcc6b94a6 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca33686c0da5dc410c3c388ee4c10458c7283c3033430ab7ae9e6029dfe0f98b Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca33686c0da5dc410c3c388ee4c10458c7283c3033430ab7ae9e6029dfe0f98b Port: <none> Host Port: <none> Command: ingress-operator State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: 2019-08-20T05:56:55.355Z INFO operator log/log.go:26 started zapr logger 2019-08-20T05:57:00.772Z INFO operator.entrypoint ingress-operator/main.go:62 using operator namespace {"namespace": "openshift-ingress-operator"} 2019-08-20T05:57:00.783Z ERROR operator.entrypoint ingress-operator/main.go:105 failed to create DNS manager {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"} Exit Code: 1 Started: Tue, 20 Aug 2019 13:56:55 +0800 Finished: Tue, 20 Aug 2019 13:57:00 +0800 Ready: False Restart Count: 248 ... Seem like most unavailable operators depends on cloud-credential. $ oc logs -f cloud-credential-operator-57dbb67984-cpf9l -n openshift-cloud-credential-operator |grep "cloud credentials insufficient to satisfy credentials request" time="2019-08-20T05:18:45Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress time="2019-08-20T05:18:45Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials time="2019-08-20T05:35:20Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro time="2019-08-20T05:35:20Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro secret=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds time="2019-08-20T05:35:22Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2019-08-20T05:35:22Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials time="2019-08-20T05:35:24Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws time="2019-08-20T05:35:24Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws secret=openshift-machine-api/aws-cloud-credentials $ oc logs cloud-credential-operator-57dbb67984-cpf9l -n openshift-cloud-credential-operator|grep "error while validating cloud credentials" time="2019-08-19T08:59:42Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: error gathering AWS credentials details: error querying username: RequestError: send request failed\ncaused by: Post https://iam.amazonaws.com/: dial tcp 54.239.22.207:443: i/o timeout" controller=secretannotator time="2019-08-19T09:01:43Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: error gathering AWS credentials details: error querying username: RequestError: send request failed\ncaused by: Post https://iam.amazonaws.com/: dial tcp 205.251.242.222:443: i/o timeout" controller=secretannotator Seem like cloud-credential is trying to access https://iam.amazonaws.com/, while I go through https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html, seem like no offically supported iam endpoint.
Johnny, We discussed this during group G architecture call and decided that what we'd suggest is configuring the VPC to have access to standard AWS APIs. Future work will enable the CCO to be disabled and that will be tracked in https://jira.coreos.com/browse/CO-537 Can you try that? Once confirmed we can move this to docs to make sure that we hilight that in the disconnected documentation.
(In reply to Scott Dodson from comment #4) > Johnny, > > we'd suggest is configuring the VPC to have access to standard AWS APIs. Does that mean vpc only be able to access standard AWS APIs, but not can not access internet, right? How to achieve that? Nee some guide here.
(In reply to Johnny Liu from comment #7) > Need some guide here. Here the "guide" means some draft document we will tell customer once QE confirmed.
The only process we can come up with is to provide an Internet gateway during the installation and then remove it after installation and verify the upgrade process works as expected.
(In reply to Scott Dodson from comment #9) > The only process we can come up with is to provide an Internet gateway > during the installation and then remove it after installation and verify the > upgrade process works as expected. That sound like I am running a common upi-on-aws install, "remove internet gateway" seem like some post actions. Do we really suggest customer to do such a strange disconnected install? Standing form customer perspective, I do not feel like this a reasonable resolution. The root fix should be "Future work will enable the CCO to be disabled", if we can not complete that in 4.2. that sound like disconnected install on aws should be some 4.3 feature. And one more point, per my understanding, the cloud-credential operator is always keeping to access aws API, after installation is completed, once remove internet gateway, CCO would go to degrade state again. Anyone could confirm that? If I am wrong, pls correct me.
@Devan, could you help confirm my understanding? > per my understanding, the cloud-credential operator is always keeping to access aws API, after installation is completed, once remove internet gateway, CCO would go to degrade state again.
I have not verified this but if the cred operator was able to successfully mint all the required credentials it needs before being taken "offline", it will be erroring in the control loop but not marked degraded. I think we check if credentials need minting or not to report degraded/failing status.
Follow Scott's suggestion, install a cluster, then remove NAT gateway from the VPC. Wait for some mins , authentication and cloud-credential operator get into degrade state as my assumption. $ oc describe co authentication Name: authentication Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-09-03T10:57:11Z Generation: 1 Resource Version: 427264 Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication UID: 949a0b48-ce39-11e9-b3c7-06329ea4d760 Spec: Status: Conditions: Last Transition Time: 2019-09-04T06:40:56Z Message: RouteHealthDegraded: failed to GET route: dial tcp 18.222.35.247:443: i/o timeout Reason: RouteHealthDegradedFailedGet Status: True Type: Degraded Last Transition Time: 2019-09-03T11:08:19Z Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2019-09-03T11:06:49Z Reason: AsExpected Status: True Type: Available Last Transition Time: 2019-09-03T10:57:14Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> $ oc describe co cloud-credential Name: cloud-credential Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-09-03T10:53:45Z Generation: 1 Resource Version: 395619 Self Link: /apis/config.openshift.io/v1/clusteroperators/cloud-credential UID: 19b7cfa4-ce39-11e9-a07d-024c4b7846b0 Spec: Status: Conditions: Last Transition Time: 2019-09-04T07:09:50Z Message: 4 of 4 credentials requests are failing to sync. Reason: CredentialsFailing Status: True Type: Degraded Last Transition Time: 2019-09-04T07:09:50Z Message: 0 of 4 credentials requests provisioned, 4 reporting errors. Reason: Reconciling Status: True Type: Progressing Last Transition Time: 2019-09-03T10:53:45Z Status: True Type: Available Last Transition Time: 2019-09-03T10:53:45Z Status: True Type: Upgradeable Extension: <nil> For authentication, it is failing to connect apps LB which is provisioned by ingress router, and it is a internet-facing LB. User have no way to change it in CF template. For cloud-credential, totally the same error log, saying fail to access https://iam.amazonaws.com/
Derek any thoughts on what we should do here? We know you've mentioned that cluster components should be able to assume access to cloud APIs without a proxy. However no one has been able to provide QE with a documented way to configure a cluster that is disconnected but able to access AWS APIs. My understanding is QE has tried several things without luck. One attempt was a proxy, which led to bug #1747366, where Cred Operator does not support a proxy but QE manually hacked around it and this appears to solve the issue for a disconnected cluster. Should we add support for the proxy vars to cred operator?
From my understanding, cluster behind proxy and disconnected cluster is two different test scenarios, the only common thing is both test scenarios have no direct internet connectivity. cluster behind proxy: 1. instances have no direct internet connectivity. 2. need set up a proxy in internal network which is the only egress to internet. 3. no need mirror release image into internal registry, cluster is till pulling images from quay.io, but have to go through proxy. 4. all access to internal url in cluster instances would get out though proxy disconnected cluster: 1. instances have no any internet connectivity. 2. need set up a registry in internal network which also have egress internet connectivity. 3. mirror release image on the internal registry for the following installation 4. all instances in cluster have no any internet connectivity, pulling images from the internal registry.
> But the problem is apps load balancer, the apps load balancer would be provisioned by ingress operator, which is in public subnet. User have no way change it in CF template. [1] landed installer docs about disabling DNS. You should also be able to set endpointPublishingStrategy on the IngressController spec [2] to disable load-balancer provisioning, although I haven't tried that myself. > Manual updates to the templates (and VPC) are required to do a disconnected + proxy installation. Agreed, and we have those in flight for AWS [3]. I'm not sure we intend to ever document the setup in the installer repo, but folks are obviously welcome to lean on the CI approach if they want some guidance setting this up, and that may eventually end up in openshift-docs instructions. But this bug is assigned to the cred operator, and I'm not seeing anything there that bug 1747366 does not already cover. If we are going to leave this open, we need to be more clear about what action we're expecting to close it. [1]: https://github.com/openshift/installer/pull/2221 [2]: https://github.com/openshift/api/blob/37678ff76af25c454dce37f78973d47d6ec23125/operator/v1/types_ingress.go#L67-L83 [3]: https://github.com/openshift/release/pull/4719
(In reply to W. Trevor King from comment #20) > But this bug is assigned to the cred operator, and I'm not seeing anything > there that bug 1747366 does not already cover. If we are going to leave > this open, we need to be more clear about what action we're expecting to > close it. In my initial report, cred operator can not work without proxy, that was blocking QE's disconnected install on aws testing. Per the above round of round discussion, QE get a conclusion from the above comments, disconnected install need privatelink + proxy in the disconnected vpc, so that move on our testing (need verification after bug 1747366 is fixed). Once cred operator get fixed with proxy enabled, will continue to validate "disable load-balancer provisioning" way. In a word, this issue at least deserve some document tracking, or else, customer would also hit the same issue. I would change component to "Document", and change title to reflect our conclusion.
I think we need to just be very clear in our documentation what we're referring to as 'Disconnected' and perhaps for now considering something more specific like 'Mirrored content'. The epic brief and product epic in Jira clearly outline what's in scope however the goal is a bit ambiguous as it describes being disconnected from Internet which I guess could be interpreted as not having access to cloud APIs, though I personally wouldn't interpret it that way. https://docs.google.com/document/d/1m-dSdP6NaHuAptNEgt0pGYq3x-74cKT2wQydtXDg4Jw/edit# https://jira.coreos.com/browse/PROD-614
Reading through this if the bootstrap node is failing to come up, we will not be able to help on the cred minting side. This should either be reassigned to installer, or perhaps filed as a separate bug.
(In reply to Devan Goodwin from comment #25) > Reading through this if the bootstrap node is failing to come up, we will > not be able to help on the cred minting side. This should either be > reassigned to installer, or perhaps filed as a separate bug. Because aws did not have a supported privatelink for iam endpoint, QE is suggested to enable proxy for credential operator in comment 19. But unfortunately once enable proxy in install-config.yaml, it would enable *global* proxy for the whole cluster. That would cause other parts of installation would not work as exception. So is there a way to only configure credential operator to enable proxy setting?
No there is no way to enable proxy only for crednetials operator, other than what you discovered earlier by manually adding the proxy settings. Is there some customer need for us to enable proxy for individual components and not the whole cluster?
(In reply to Devan Goodwin from comment #29) > No there is no way to enable proxy only for crednetials operator, other than > what you discovered earlier by manually adding the proxy settings. > > Is there some customer need for us to enable proxy for individual components > and not the whole cluster? In this disconnected install scenario (not proxy install scenario), I think the answer would be 'yes', or else, credential would never get ready.
User-provided "disconnected" AWS docs are in flight, and they're just using the vanilla CloudFormation templates [1]. They also emphasize that you still need access to the AWS APIs [2]. There's nothing about "if you block the usual route to the IAM APIs, you can use these proxy settings to recover...". This is all a lot like the "mirrored release" emphasis from comment 22. Can we close this NOTABUG, or at least punt to 4.3+, unless we can reproduce the cred issue under those constraints? [1]: https://github.com/openshift/openshift-docs/pull/16696/files#diff-b0013daefc0ff70440c46c0e994c9249R65-R89 [2]: https://github.com/openshift/openshift-docs/pull/16696/files#diff-b0013daefc0ff70440c46c0e994c9249R15
Today I successfully set up a disconnected cluster (dropping internet gateway for private subnets + enable proxy in install-config.yaml) after apply some workaround. During testing, we found a proxy bug: Bug 1753467 - [proxy] no proxy is set for kube-controller-manager. Once 1753467 is fixed, our conclusion in QE's testing for disconnected install on aws would be that if user drop the overall internet traffic capacity (no way to access AWS APIs), user need enable proxy to allow those AWS APIs access, add those api endpoints into proxy's whitelist. No need mix privatelink any more. In QE's testing, the whitelist in proxy would be something like: ec2.us-east-2.amazonaws.com iam.amazonaws.com <MIRROR-REIGSTRY> .s3.us-east-2.amazonaws.com .apps.<CLUSTER-NAME>.<DOMAIN> User also need create apps record following https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#add-the-ingress-dns-records These info need to be clearly explained to customers. So this is not a test blocker any more, move this bug to Document component.
After that proxy bug is fixed, I'll update the AWS docs per the information in #c40.
I'm adding the information about creating your own DNS records entries per https://github.com/openshift/installer/pull/2221/files#diff-a2ee8aa448a0244512469c9c7126465f\ on https://github.com/openshift/openshift-docs/pull/17190.
I've merged the information about creating your own DNS records entries, and it should go live soon. I'm going to keep this bug open pending the resolution of Bug 1753467 - [proxy] no proxy is set for kube-controller-manager.
(In reply to Kathryn Alexander from comment #42) > I'm adding the information about creating your own DNS records entries per > https://github.com/openshift/installer/pull/2221/files#diff- > a2ee8aa448a0244512469c9c7126465f\ on > https://github.com/openshift/openshift-docs/pull/17190. Hello Kathryn Alexander, Would you mean ? - Creating the Ingress DNS Records - Installing a cluster on AWS in a restricted network - Installing on AWS | Installing | OpenShift Container Platform 4.5 https://docs.openshift.com/container-platform/4.5/installing/installing_aws/installing-restricted-networks-aws.html#installation-create-ingress-dns-records_installing-restricted-networks-aws /Masaki
@Masaki, yes, that section contains the intended change for this bug, so I'm going to close it. Please reopen it with more data if you think that there is additional work to do.