Bug 1644084 - RFE: K8S AWS Cloud Provider Missing Support for Custom Endpoints and Air-Gapped Regions
Summary: RFE: K8S AWS Cloud Provider Missing Support for Custom Endpoints and Air-Gapp...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 3.11.0
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 3.11.z
Assignee: Jack Ottofaro
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-30 00:58 UTC by Matt
Modified: 2019-10-28 09:51 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Enhance AWS cloud provider to parse additional endpoint configuration/customization. Both core Kubernetes and cluster autoscaler. Reason: AWS now allowing regions (custom and private) which do not follow the typical standard conventions of their public cloud endpoints. OpenShift deployments were limited to public AWS cloud regions only and limited adoption of the product in these scenarios. Result: Additional configuration elements can be added to the aws.conf and will be honored by OpenShift as well as the cluster-autoscaler to ensure the right cloud endpoints are used to automatically provision EBS volumes, load balancers and EC2 instances.
Clone Of:
Environment:
Last Closed: 2019-04-11 05:38:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3670641 0 None None None 2018-10-30 15:20:36 UTC
Red Hat Product Errata RHBA-2019:0636 0 None None None 2019-04-11 05:38:38 UTC

Description Matt 2018-10-30 00:58:03 UTC
Description of problem:
Currently the AWS DNS Suffix (amazonaws.com) is hardcoded [1] into the go-aws-sdk that is being utilized in the AWS autoscaler feature for OCP 3.11.


Version-Release number of selected component (if applicable):
OCP 3.11

How reproducible:
Any OCP disconnected deployment on an AWS cloud that is not hosted on amazonaws.com will fail when attempting to deploy with the auto-scaler

Steps to Reproduce:
1. Configure vars to AWS auto-scaling
2. Deploy on normal AWS (commercial) results in successful deployment
3. Deploy same configuration on disconnected (non amazonaws.com) domain and deployer fails

Actual results:
Failed to start kubelet

Expected results:
Successful deployment

Additional info:
Before using the aws-go-sdk this was an easily overridden boto.cfg option using a custom endpoints.json file. Now the domain is hardcoded into the aws-go-sdk.

[1] https://github.com/aws/aws-sdk-go/blob/master/aws/endpoints/defaults.go#L211


Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Matt 2018-10-30 00:59:47 UTC
Unable to add logs etc, this is an air-gapped environment.

Comment 3 Matt 2018-10-30 13:32:52 UTC
Folks, this is a bit bigger than initially reported. The kubelet is calling out to the statically defined AWS endpoint causing a deployment failure for any non-public AWS Cloud OpenShift 3.11 deployments.

Comment 12 Stephen Cuppett 2018-11-02 15:36:18 UTC
There are a few differences in AWS air-gapped regions from the public regions:

https://digitalageexperts.com/how-are-aws-private-air-gapped-regions-different/

We have identified the spot in the cluster-autoscaler where additional options will be needed to allow configuring for private regions:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L97

In addition, there are multiple spots in upstream Kubernetes cloudprovider where these options would be needed:

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L645-L745

The AWS SDK provides facilities for using custom CAs and endpoints:

https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/

+ environment variable AWS_CA_BUNDLE

These mechanisms need leveraged.

We will raise the upstream issues needed next.

Comment 13 Jan Chaloupka 2018-11-02 15:47:17 UTC
Upstream issue for the kubernetes/kubernetes https://github.com/kubernetes/kubernetes/issues/70588. It's a quick summary of the current issue. I will extend the description after more detailed analysis of the underlying code.

Comment 14 Stephen Cuppett 2018-11-02 18:02:42 UTC
There is also a spot in the AWS service broker:

https://github.com/awslabs/aws-servicebroker/blob/master/pkg/broker/aws_sdk.go#L24-L38

Comment 15 Stephen Cuppett 2018-11-02 20:11:45 UTC
Here are the specifics on the custom CA in the AWS SDK for Go:

https://github.com/aws/aws-sdk-go/blob/master/aws/session/session.go#L222-L225

Comment 18 Chuck Svoboda 2019-03-01 18:58:58 UTC
Specifically affected in AWS Govcloud.

Comment 19 Stephen Cuppett 2019-03-01 19:01:18 UTC
Done:

(1.14) Upstream Kubernetes PR: https://github.com/kubernetes/kubernetes/pull/72245

In-flight:

(1.15) Upstream autoscaler PR: https://github.com/kubernetes/autoscaler/pull/1717
(3.11) Origin PR: https://github.com/openshift/origin/pull/22138

Todo:

(3.11) Ansible support for AWS_CA_BUNDLE environment variable: https://github.com/openshift/openshift-ansible
(3.11) Origin autoscaler PR: https://github.com/openshift/kubernetes-autoscaler

Comment 21 Stephen Cuppett 2019-03-01 22:41:48 UTC
Ansible PR now in-flight: https://github.com/openshift/openshift-ansible/pull/11277/

Comment 23 Stephen Cuppett 2019-03-05 11:55:07 UTC
Update:

Done:

(1.14) Upstream Kubernetes PR: https://github.com/kubernetes/kubernetes/pull/72245
(3.11) Origin PR: https://github.com/openshift/origin/pull/22138
(3.11) Ansible support for AWS_CA_BUNDLE environment variable: https://github.com/openshift/openshift-ansible/pull/11277

In-flight:

(1.15) Upstream autoscaler PR: https://github.com/kubernetes/autoscaler/pull/1717

Todo:

(3.11) Origin autoscaler PR: https://github.com/openshift/kubernetes-autoscaler

Comment 24 Chuck Svoboda 2019-03-05 12:04:03 UTC
(In reply to Stephen Cuppett from comment #23)
> Update:
> 
> Done:
> 
> (1.14) Upstream Kubernetes PR:
> https://github.com/kubernetes/kubernetes/pull/72245
> (3.11) Origin PR: https://github.com/openshift/origin/pull/22138
> (3.11) Ansible support for AWS_CA_BUNDLE environment variable:
> https://github.com/openshift/openshift-ansible/pull/11277
> 
> In-flight:
> 
> (1.15) Upstream autoscaler PR:
> https://github.com/kubernetes/autoscaler/pull/1717
> 
> Todo:
> 
> (3.11) Origin autoscaler PR:
> https://github.com/openshift/kubernetes-autoscaler

Looks like the reviewer didn't like the code Qual and closed it unmerged.

Comment 25 Jan Chaloupka 2019-03-05 12:21:05 UTC
(1.15) Upstream autoscaler PR: https://github.com/kubernetes/autoscaler/pull/1745

Comment 28 Stephen Cuppett 2019-03-19 11:41:17 UTC
Update:
 
 Done:
 
 (1.14) Upstream Kubernetes PR: https://github.com/kubernetes/kubernetes/pull/72245
 (3.11) Origin PR: https://github.com/openshift/origin/pull/22138
 (3.11) Ansible support for AWS_CA_BUNDLE environment variable:  https://github.com/openshift/openshift-ansible/pull/11277
 (1.15) Upstream autoscaler PR: https://github.com/kubernetes/autoscaler/pull/1745
 (3.11) Origin autoscaler PR: https://github.com/openshift/kubernetes-autoscaler/pull/56

Comment 32 Matt 2019-03-27 13:08:50 UTC
jhou:
What info are we looking for here? I see I'm tagged on the needinfo flag, but not seeing what info is being requested.

Comment 34 sunzhaohua 2019-04-02 06:02:08 UTC
Verified.

We did a regression testing for the playbook and the autoscaler features. No problems found.

"atomic-openshift version: v3.11.98"

Comment 36 errata-xmlrpc 2019-04-11 05:38:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0636


Note You need to log in before you can comment on or make changes to this bug.