Bug 1743483

Summary:	[disconnected] a disconnected upi-on-aws install need proxy to access aws api when your instance totally drop internet capacity.
Product:	OpenShift Container Platform	Reporter:	Johnny Liu <jialiu>
Component:	Documentation	Assignee:	Kathryn Alexander <kalexand>
Status:	CLOSED UPSTREAM	QA Contact:	Johnny Liu <jialiu>
Severity:	urgent	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	urgent
Version:	4.2.0	CC:	ahoffer, aos-bugs, bleanhar, decarr, dgoodwin, dmoessne, jnordell, jokerman, kalexand, mfuruta, nosue, rh-container, scuppett, sdodson, suchaudh, umohnani, wking
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.3.0	Flags:	scuppett: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Release Note
Doc Text:	Disconnected installations of OCP4 on cloud environments where interoperability with the cloud environment is desired may require a proxy be configured to the cloud environment infrastructure endpoints. Examples of this include AWS Route53 & IAM endpoints.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-25 19:50:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Johnny Liu 2019-08-20 05:59:04 UTC

Description of problem:

Version-Release number of the following components:
4.2.0-0.nightly-2019-08-18-222019

How reproducible:
Always

Steps to Reproduce:
1. Create a disconnected vpc
Refer to https://docs.openshift.com/container-platform/4.1/installing/installing_aws_user_infra/installing-aws-user-infra.html#installation-cloudformation-vpc_installing-aws-user-infra, remove NAT/EIP/Route from private subnets, and create ec2/elasticloadbalancing Endpoints to access AWS service which kubelet required for cloudprovider.
2. Create a mirror registry, and mirror release image to this internal reigstry.
3. Trigger upi install on aws

Actual results:
Installation failed, due to some operator does not get ready
$ ./openshift-install wait-for install-complete --dir '/home/installer2/workspace/Launch Environment Flexy/workdir/install-dir'
level=info msg="Waiting up to 30m0s for the cluster at https://api.aws-jialiu1.qe.devcluster.openshift.com:6443 to initialize..."

level=fatal msg="failed to initialize the cluster: Working towards 4.2.0-0.nightly-2019-08-18-222019: 99% complete"




Expected results:
Disconnected install should be completed successfully.

Additional info:
The following operator does not get ready
$ oc get clusteroperator
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                 Unknown     Unknown       True       20h
cloud-credential                           4.2.0-0.nightly-2019-08-18-222019   True        True          True       20h
image-registry                                                                 False       True          False      20h

$ oc describe clusteroperator image-registry
...
Status:
  Conditions:
    Last Transition Time:  2019-08-19T09:01:56Z
    Message:               The deployment does not exist
    Reason:                DeploymentNotFound
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-08-19T09:01:56Z
    Message:               Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials "openshift-image-registry/installer-cloud-credentials": secret "installer-cloud-credentials" not found

$ oc get po -n openshift-ingress-operator
...
Containers:
  ingress-operator:
    Container ID:  cri-o://8c507f9d8956f22d77963dca8079ff51139c8bdb9892cdb33f49c8cbcc6b94a6
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca33686c0da5dc410c3c388ee4c10458c7283c3033430ab7ae9e6029dfe0f98b
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca33686c0da5dc410c3c388ee4c10458c7283c3033430ab7ae9e6029dfe0f98b
    Port:          <none>
    Host Port:     <none>
    Command:
      ingress-operator
    State:                Waiting
      Reason:             CrashLoopBackOff
    Last State:           Terminated
      Reason:             Error
      Message:            2019-08-20T05:56:55.355Z  INFO                 operator                      log/log.go:26                 started zapr logger
2019-08-20T05:57:00.772Z  INFO                      operator.entrypoint  ingress-operator/main.go:62   using operator namespace      {"namespace": "openshift-ingress-operator"}
2019-08-20T05:57:00.783Z  ERROR                     operator.entrypoint  ingress-operator/main.go:105  failed to create DNS manager  {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"}

      Exit Code:    1
      Started:      Tue, 20 Aug 2019 13:56:55 +0800
      Finished:     Tue, 20 Aug 2019 13:57:00 +0800
    Ready:          False
    Restart Count:  248
...

Seem like most unavailable operators depends on cloud-credential.
$ oc logs -f cloud-credential-operator-57dbb67984-cpf9l -n openshift-cloud-credential-operator |grep "cloud credentials insufficient to satisfy credentials request"
time="2019-08-20T05:18:45Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress
time="2019-08-20T05:18:45Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2019-08-20T05:35:20Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro
time="2019-08-20T05:35:20Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro secret=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds
time="2019-08-20T05:35:22Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-08-20T05:35:22Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2019-08-20T05:35:24Z" level=error msg="cloud credentials insufficient to satisfy credentials request" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws
time="2019-08-20T05:35:24Z" level=error msg="error syncing credentials: cloud credentials insufficient to satisfy credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws secret=openshift-machine-api/aws-cloud-credentials

$ oc logs cloud-credential-operator-57dbb67984-cpf9l -n openshift-cloud-credential-operator|grep "error while validating cloud credentials"
time="2019-08-19T08:59:42Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: error gathering AWS credentials details: error querying username: RequestError: send request failed\ncaused by: Post https://iam.amazonaws.com/: dial tcp 54.239.22.207:443: i/o timeout" controller=secretannotator
time="2019-08-19T09:01:43Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: error gathering AWS credentials details: error querying username: RequestError: send request failed\ncaused by: Post https://iam.amazonaws.com/: dial tcp 205.251.242.222:443: i/o timeout" controller=secretannotator


Seem like cloud-credential is trying to access https://iam.amazonaws.com/, while  I go through https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html, seem like no offically supported iam endpoint.

Comment 4 Scott Dodson 2019-08-23 13:27:04 UTC

Johnny,

We discussed this during group G architecture call and decided that what we'd suggest is configuring the VPC to have access to standard AWS APIs. Future work will enable the CCO to be disabled and that will be tracked in https://jira.coreos.com/browse/CO-537

Can you try that? Once confirmed we can move this to docs to make sure that we hilight that in the disconnected documentation.

Comment 7 Johnny Liu 2019-08-26 02:14:27 UTC

(In reply to Scott Dodson from comment #4)
> Johnny,
> 
> we'd suggest is configuring the VPC to have access to standard AWS APIs.

Does that mean vpc only be able to access standard AWS APIs, but not can not access internet, right? How to achieve that? Nee some guide here.

Comment 8 Johnny Liu 2019-08-26 04:49:29 UTC

(In reply to Johnny Liu from comment #7)
> Need some guide here.
Here the "guide" means some draft document we will tell customer once QE confirmed.

Comment 9 Scott Dodson 2019-08-28 12:23:03 UTC

The only process we can come up with is to provide an Internet gateway during the installation and then remove it after installation and verify the upgrade process works as expected.

Comment 10 Johnny Liu 2019-08-29 03:06:24 UTC

(In reply to Scott Dodson from comment #9)
> The only process we can come up with is to provide an Internet gateway
> during the installation and then remove it after installation and verify the
> upgrade process works as expected.

That sound like I am running a common upi-on-aws install, "remove internet gateway" seem like some post actions. 
Do we really suggest customer to do such a strange disconnected install?
Standing form customer perspective, I do not feel like this a reasonable resolution. The root fix should be "Future work will enable the CCO to be disabled", if we can not complete that in 4.2. that sound like disconnected install on aws should be some 4.3 feature.

And one more point, per my understanding, the cloud-credential operator is always keeping to access aws API, after installation is completed, once remove internet gateway, CCO would go to degrade state again. Anyone could confirm that? If I am wrong, pls correct me.

Comment 11 Johnny Liu 2019-08-30 01:35:53 UTC

@Devan, could you help confirm my understanding?
> per my understanding, the cloud-credential operator is always keeping to access aws API, after installation is completed, once remove internet gateway, CCO would go to degrade state again.

Comment 12 Devan Goodwin 2019-09-03 11:07:16 UTC

I have not verified this but if the cred operator was able to successfully mint all the required credentials it needs before being taken "offline", it will be erroring in the control loop but not marked degraded. I think we check if credentials need minting or not to report degraded/failing status.

Comment 13 Johnny Liu 2019-09-04 08:56:41 UTC

Follow Scott's suggestion, install a cluster, then remove NAT gateway from the VPC. Wait for some mins , authentication and cloud-credential operator get into degrade state as my assumption.


$ oc describe co authentication
Name:         authentication
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-09-03T10:57:11Z
  Generation:          1
  Resource Version:    427264
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/authentication
  UID:                 949a0b48-ce39-11e9-b3c7-06329ea4d760
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-09-04T06:40:56Z
    Message:               RouteHealthDegraded: failed to GET route: dial tcp 18.222.35.247:443: i/o timeout
    Reason:                RouteHealthDegradedFailedGet
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-09-03T11:08:19Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-09-03T11:06:49Z
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-09-03T10:57:14Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>


$ oc describe co cloud-credential
Name:         cloud-credential
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-09-03T10:53:45Z
  Generation:          1
  Resource Version:    395619
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/cloud-credential
  UID:                 19b7cfa4-ce39-11e9-a07d-024c4b7846b0
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-09-04T07:09:50Z
    Message:               4 of 4 credentials requests are failing to sync.
    Reason:                CredentialsFailing
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-09-04T07:09:50Z
    Message:               0 of 4 credentials requests provisioned, 4 reporting errors.
    Reason:                Reconciling
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-09-03T10:53:45Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-09-03T10:53:45Z
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>


For authentication, it is failing to connect apps LB which is provisioned by ingress router, and it is a internet-facing LB. User have no way to change it in CF template.
For cloud-credential, totally the same error log, saying fail to access https://iam.amazonaws.com/

Comment 14 Devan Goodwin 2019-09-04 15:02:28 UTC

Derek any thoughts on what we should do here? We know you've mentioned that cluster components should be able to assume access to cloud APIs without a proxy. However no one has been able to provide QE with a documented way to configure a cluster that is disconnected but able to access AWS APIs. My understanding is QE has tried several things without luck. One attempt was a proxy, which led to bug #1747366, where Cred Operator does not support a proxy but QE manually hacked around it and this appears to solve the issue for a disconnected cluster. Should we add support for the proxy vars to cred operator?

Comment 15 Johnny Liu 2019-09-05 00:52:33 UTC

From my understanding, cluster behind proxy and disconnected cluster is two different test scenarios, the only common thing is both test scenarios have no direct internet connectivity.

cluster behind proxy:
1. instances have no direct internet connectivity.
2. need set up a proxy in internal network which is the only egress to internet. 
3. no need mirror release image into internal registry, cluster is till pulling images from quay.io, but have to go through proxy.
4. all access to internal url in cluster instances would get out though proxy

disconnected cluster:
1. instances have no any internet connectivity.
2. need set up a registry in internal network which also have egress internet connectivity.
3. mirror release image on the internal registry for the following installation
4. all instances in cluster have no any internet connectivity, pulling images from the internal registry.

Comment 20 W. Trevor King 2019-09-10 21:47:37 UTC

> But the problem is apps load balancer, the apps load balancer would be provisioned by ingress operator, which is in public subnet. User have no way change it in CF template. 

[1] landed installer docs about disabling DNS.  You should also be able to set endpointPublishingStrategy on the IngressController spec [2] to disable load-balancer provisioning, although I haven't tried that myself.

> Manual updates to the templates (and VPC) are required to do a disconnected + proxy installation.

Agreed, and we have those in flight for AWS [3].  I'm not sure we intend to ever document the setup in the installer repo, but folks are obviously welcome to lean on the CI approach if they want some guidance setting this up, and that may eventually end up in openshift-docs instructions.

But this bug is assigned to the cred operator, and I'm not seeing anything there that bug 1747366 does not already cover.  If we are going to leave this open, we need to be more clear about what action we're expecting to close it.

[1]: https://github.com/openshift/installer/pull/2221
[2]: https://github.com/openshift/api/blob/37678ff76af25c454dce37f78973d47d6ec23125/operator/v1/types_ingress.go#L67-L83
[3]: https://github.com/openshift/release/pull/4719

Comment 21 Johnny Liu 2019-09-11 10:16:05 UTC

(In reply to W. Trevor King from comment #20)
> But this bug is assigned to the cred operator, and I'm not seeing anything
> there that bug 1747366 does not already cover.  If we are going to leave
> this open, we need to be more clear about what action we're expecting to
> close it.
In my initial report, cred operator can not work without proxy, that was blocking QE's disconnected install on aws testing.

Per the above round of round discussion, QE get a conclusion from the above comments, disconnected install need privatelink + proxy in the disconnected vpc, so that move on our testing (need verification after bug 1747366 is fixed). Once cred operator get fixed with proxy enabled,  will continue to validate "disable load-balancer provisioning" way.

In a word, this issue at least deserve some document tracking, or else, customer would also hit the same issue. 

I would change component to "Document", and change title to reflect our conclusion.

Comment 22 Scott Dodson 2019-09-11 12:31:59 UTC

I think we need to just be very clear in our documentation what we're referring to as 'Disconnected' and perhaps for now considering something more specific like 'Mirrored content'.

The epic brief and product epic in Jira clearly outline what's in scope however the goal is a bit ambiguous as it describes being disconnected from Internet which I guess could be interpreted as not having access to cloud APIs, though I personally wouldn't interpret it that way.

https://docs.google.com/document/d/1m-dSdP6NaHuAptNEgt0pGYq3x-74cKT2wQydtXDg4Jw/edit#
https://jira.coreos.com/browse/PROD-614

Comment 25 Devan Goodwin 2019-09-16 14:32:29 UTC

Reading through this if the bootstrap node is failing to come up, we will not be able to help on the cred minting side. This should either be reassigned to installer, or perhaps filed as a separate bug.

Comment 28 Johnny Liu 2019-09-17 01:15:34 UTC

(In reply to Devan Goodwin from comment #25)
> Reading through this if the bootstrap node is failing to come up, we will
> not be able to help on the cred minting side. This should either be
> reassigned to installer, or perhaps filed as a separate bug.

Because aws did not have a supported privatelink for iam endpoint, QE is suggested to enable proxy for credential operator in comment 19. But unfortunately once enable proxy in install-config.yaml, it would enable *global* proxy for the whole cluster. That would cause other parts of installation would not work as exception. So is there a way to only configure credential operator to enable proxy setting?

Comment 29 Devan Goodwin 2019-09-17 10:59:56 UTC

No there is no way to enable proxy only for crednetials operator, other than what you discovered earlier by manually adding the proxy settings.

Is there some customer need for us to enable proxy for individual components and not the whole cluster?

Comment 31 Johnny Liu 2019-09-17 12:09:21 UTC

(In reply to Devan Goodwin from comment #29)
> No there is no way to enable proxy only for crednetials operator, other than
> what you discovered earlier by manually adding the proxy settings.
> 
> Is there some customer need for us to enable proxy for individual components
> and not the whole cluster?

In this disconnected install scenario (not proxy install scenario), I think the answer would be 'yes', or else, credential would never get ready.

Comment 39 W. Trevor King 2019-09-19 04:13:34 UTC

User-provided "disconnected" AWS docs are in flight, and they're just using the vanilla CloudFormation templates [1].  They also emphasize that you still need access to the AWS APIs [2].  There's nothing about "if you block the usual route to the IAM APIs, you can use these proxy settings to recover...".  This is all a lot like the "mirrored release" emphasis from comment 22.  Can we close this NOTABUG, or at least punt to 4.3+, unless we can reproduce the cred issue under those constraints?
  
[1]: https://github.com/openshift/openshift-docs/pull/16696/files#diff-b0013daefc0ff70440c46c0e994c9249R65-R89
[2]: https://github.com/openshift/openshift-docs/pull/16696/files#diff-b0013daefc0ff70440c46c0e994c9249R15

Comment 40 Johnny Liu 2019-09-19 06:58:11 UTC

Today I successfully set up a disconnected cluster (dropping internet gateway for private subnets + enable proxy in install-config.yaml) after apply some workaround.

During testing, we found a proxy bug:
Bug 1753467 - [proxy] no proxy is set for kube-controller-manager.

Once 1753467 is fixed, our conclusion in QE's testing for disconnected install on aws would be that if user drop the overall internet traffic capacity (no way to access AWS APIs), user need enable proxy to allow those AWS APIs access, add those api endpoints into proxy's whitelist. No need mix privatelink any more. 

In QE's testing, the whitelist in proxy would be something like:
ec2.us-east-2.amazonaws.com iam.amazonaws.com <MIRROR-REIGSTRY> .s3.us-east-2.amazonaws.com .apps.<CLUSTER-NAME>.<DOMAIN>

User also need create apps record following https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#add-the-ingress-dns-records

These info need to be clearly explained to customers. So this is not a test blocker any more, move this bug to Document component.

Comment 41 Kathryn Alexander 2019-09-20 15:16:18 UTC

After that proxy bug is fixed, I'll update the AWS docs per the information in #c40.

Comment 42 Kathryn Alexander 2019-10-10 18:08:12 UTC

I'm adding the information about creating your own DNS records entries per https://github.com/openshift/installer/pull/2221/files#diff-a2ee8aa448a0244512469c9c7126465f\ on https://github.com/openshift/openshift-docs/pull/17190.

Comment 44 Kathryn Alexander 2019-11-06 21:16:02 UTC

I've merged the information about creating your own DNS records entries, and it should go live soon. I'm going to keep this bug open pending the resolution of Bug 1753467 - [proxy] no proxy is set for kube-controller-manager.

Comment 46 Masaki Furuta ( RH ) 2020-09-01 02:22:37 UTC

(In reply to Kathryn Alexander from comment #42)
> I'm adding the information about creating your own DNS records entries per
> https://github.com/openshift/installer/pull/2221/files#diff-
> a2ee8aa448a0244512469c9c7126465f\ on
> https://github.com/openshift/openshift-docs/pull/17190.

Hello Kathryn Alexander,

Would you mean ?

  - Creating the Ingress DNS Records - Installing a cluster on AWS in a restricted network - Installing on AWS | Installing | OpenShift Container Platform 4.5
    https://docs.openshift.com/container-platform/4.5/installing/installing_aws/installing-restricted-networks-aws.html#installation-create-ingress-dns-records_installing-restricted-networks-aws

/Masaki

Comment 47 Kathryn Alexander 2020-09-25 19:50:13 UTC

@Masaki, yes, that section contains the intended change for this bug, so I'm going to close it. Please reopen it with more data if you think that there is additional work to do.