Bug 1881262

Summary: Failed to install disconnected+private cluster in us-gov-east-1, error: secrets "router-certs-default" not found
Product: OpenShift Container Platform Reporter: Yunfei Jiang <yunjiang>
Component: DocumentationAssignee: Vikram Goyal <vigoyal>
Status: CLOSED NOTABUG QA Contact: Yunfei Jiang <yunjiang>
Severity: high Docs Contact: Vikram Goyal <vigoyal>
Priority: high    
Version: 4.6CC: adahiya, amcdermo, aos-bugs, bmcelvee, dgoodwin, jokerman, mfisher, mmasters
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-05 08:31:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
comment 14 install log
none
comment 14 VPC cloudformation template none

Description Yunfei Jiang 2020-09-22 02:14:30 UTC
Config CCO in manual mode, install a private cluster in us-gov-east-1 region: 
level=fatal msg="failed to fetch Master Machines: failed to generate asset \"Master Machines\": creating AWS session: fetching availability zones: AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401, request id: 5e8be396-a673-4e6d-9225-d0a08ded83fe” 

The cluster with same configuration can be installed in us-gov-west-1 successfully. 

Version-Release number of the following components: 
4.6.0-0.nightly-2020-09-21-030155 

How reproducible: 
Always 

Steps to Reproduce: 
1. Create 4.6 cluster with CCO in manual mode in us-gov-east-1, refer to https://github.com/openshift/cloud-credential-operator/blob/master/docs/mode-manual-creds.md

Actual results: 
AuthFailure: AWS was not able to validate the provided access credentials 

Expected results: 
Cluster can be installed successfully 

Additional info:


======= update on Sep. 30

The latest description & errors please refer to Comment 14

Comment 1 Abhinav Dahiya 2020-09-22 17:21:22 UTC
Manual creds for CCO doesn't mean the user doesn't need to provide credentials to the installer to communicate with AWS APIs. Manual mode is only for cluster operators.

```
level=fatal msg="failed to fetch Master Machines: failed to generate asset \"Master Machines\": creating AWS session: fetching availability zones: AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401, request id: 5e8be396-a673-4e6d-9225-d0a08ded83fe” 
```

Please make sure the installer has access to credentials with appropriate permissions to call AWS APIs. Moving to CCO team to triage/update the docs.

Comment 2 Devan Goodwin 2020-09-22 17:29:07 UTC
Abhinav: CCO docs don't really cover what the installer needs or doesn't, is there something in particular that is inaccurate in the CCO docs? Is this just a request to clarify in the manual mode docs in our repo, and product documentation, that while you can put the CCO into manual mode, you still need to provide the installer with a credential?

Comment 3 Yunfei Jiang 2020-09-23 15:56:38 UTC
I followed up this document [1] to config CCO in manual mode, and install cluster. This works in `us-east-2`, `us-gov-west-1` regions, as I mentioned in description: The cluster with same configuration can be installed in us-gov-west-1 successfully. 

Per my understanding, `us-gov-west-1` and `us-gov-east-1` use the same IAM service, if it works in `us-gov-west-1`, it should work in `us-gov-east-1`.

>> FYI, Secrets I provided:

openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds
                "iam:GetUserPolicy",
                "iam:GetUser",
                "iam:ListAccessKeys"

openshift-ingress-operator/cloud-credentials
                "elasticloadbalancing:DescribeLoadBalancers",
                "route53:ListHostedZones",
                "route53:ChangeResourceRecordSets",
                "tag:GetResources"

openshift-machine-api/aws-cloud-credentials
                "ec2:CreateTags",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeImages",
                "ec2:DescribeInstances",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:RunInstances",
                "ec2:TerminateInstances",
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups",
                "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
                "elasticloadbalancing:RegisterTargets",
                "iam:PassRole",
                "iam:CreateServiceLinkedRole"

openshift-image-registry/installer-cloud-credentials
                "s3:CreateBucket",
                "s3:DeleteBucket",
                "s3:PutBucketTagging",
                "s3:GetBucketTagging",
                "s3:PutBucketPublicAccessBlock",
                "s3:GetBucketPublicAccessBlock",
                "s3:PutEncryptionConfiguration",
                "s3:GetEncryptionConfiguration",
                "s3:PutLifecycleConfiguration",
                "s3:GetLifecycleConfiguration",
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:HeadBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload"

openshift-cloud-credential-operator/cloud-credential-operator-s3
                "s3:PutObject",
                "s3:PutBucketTagging",
                "s3:CreateBucket",
                "s3:PutObjectAcl"

openshift-cluster-csi-drivers/ebs-cloud-credentials
                "ec2:DetachVolume",
                "ec2:AttachVolume",
                "ec2:ModifyVolume",
                "ec2:DeleteSnapshot",
                "ec2:DescribeInstances",
                "ec2:DeleteTags",
                "ec2:DescribeTags",
                "ec2:CreateTags",
                "ec2:DescribeVolumesModifications",
                "ec2:DescribeSnapshots",
                "ec2:CreateVolume",
                "ec2:DeleteVolume",
                "ec2:DescribeVolumes",
                "ec2:CreateSnapshot"

[1] https://github.com/openshift/cloud-credential-operator/blob/master/docs/mode-manual-creds.md

Comment 4 Yunfei Jiang 2020-09-24 10:44:43 UTC
Correct Secrets in Comment 3 , sorry for the confused info.

>> openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:GetUserPolicy",
                "iam:GetUser",
                "iam:ListAccessKeys"
            ],
            "Resource": "*"
        }
    ]
}

>> openshift-ingress-operator/cloud-credentials
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "elasticloadbalancing:DescribeLoadBalancers",
                "route53:ListHostedZones",
                "route53:ChangeResourceRecordSets",
                "tag:GetResources"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

>> openshift-machine-api/aws-cloud-credentials
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeImages",
                "ec2:DescribeInstances",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:RunInstances",
                "ec2:TerminateInstances",
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups",
                "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
                "elasticloadbalancing:RegisterTargets",
                "iam:PassRole",
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:GenerateDataKeyWithoutPlainText",
                "kms:DescribeKey"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:RevokeGrant",
                "kms:CreateGrant",
                "kms:ListGrants"
            ],
            "Resource": "*",
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": true
                }
            }
        }
    ]
}

>> openshift-image-registry/installer-cloud-credentials
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:CreateBucket",
                "s3:DeleteBucket",
                "s3:PutBucketTagging",
                "s3:GetBucketTagging",
                "s3:PutBucketPublicAccessBlock",
                "s3:GetBucketPublicAccessBlock",
                "s3:PutEncryptionConfiguration",
                "s3:GetEncryptionConfiguration",
                "s3:PutLifecycleConfiguration",
                "s3:GetLifecycleConfiguration",
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:HeadBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

>> openshift-cloud-credential-operator/cloud-credential-operator-s3
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutBucketTagging",
                "s3:CreateBucket",
                "s3:PutObjectAcl"
            ],
            "Resource": "*"
        }
    ]
}


>> openshift-cluster-csi-drivers/ebs-cloud-credentials
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DetachVolume",
                "ec2:AttachVolume",
                "ec2:ModifyVolume",
                "ec2:DeleteSnapshot",
                "ec2:DescribeInstances",
                "ec2:DeleteTags",
                "ec2:DescribeTags",
                "ec2:CreateTags",
                "ec2:DescribeVolumesModifications",
                "ec2:DescribeSnapshots",
                "ec2:CreateVolume",
                "ec2:DeleteVolume",
                "ec2:DescribeVolumes",
                "ec2:CreateSnapshot"
            ],
            "Resource": "*"
        }
    ]
}

Comment 5 Devan Goodwin 2020-09-24 11:29:38 UTC
If you were to take these credentials and:

1. configure them on a local workstation
2. configure the aws CLI to use a govcloud endpoint, I've never done this but maybe with --endpoint-url on the command below?
3. use the aws CLI to list availability zones: aws ec2 describe-availability-zones

Does this work in us-gov-east-1, us-gov-east-2, or both?

Comment 6 Devan Goodwin 2020-09-24 12:25:10 UTC
My suspicion here is that there is absolutely nothing we can fix in CCO for this. Either: 

(1) We are mistaken and us-gov-east-1 credentials can't be used in us-gov-west-1
(2) Credentials can be used in both but the AWS IAM simulation API is broken and tells you you can, even though you actually can.

Neither can be fixed by CCO. If it's (2) then we can use the new force mode functionality in the InstallConfig and CloudCredential config that is present in latest 4.5 releases.

I don't think this qualifies as a 4.6 blocker unless more information becomes available so I'm going to take it off the list for now, but we will keep trying to get to the bottom of what Yunfei is experiencing.

Comment 7 Yunfei Jiang 2020-09-29 10:59:16 UTC
Devan,

I configured machine api credential as my default aws profile (export AWS_PROFILE=xxx):

weather I specify endpoint or not in east or west region, the credentials work well:

>> aws --region us-gov-east-1 --endpoint-url https://ec2.us-gov-east-1.amazonaws.com ec2 describe-availability-zones
>> aws --region us-gov-east-1 ec2 describe-availability-zones

{
    "AvailabilityZones": [
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "us-gov-east-1",
            "ZoneName": "us-gov-east-1a",
            "ZoneId": "usge1-az1",
            "GroupName": "us-gov-east-1",
            "NetworkBorderGroup": "us-gov-east-1",
            "ZoneType": "availability-zone"
        },
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "us-gov-east-1",
            "ZoneName": "us-gov-east-1b",
            "ZoneId": "usge1-az2",
            "GroupName": "us-gov-east-1",
            "NetworkBorderGroup": "us-gov-east-1",
            "ZoneType": "availability-zone"
        },
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "us-gov-east-1",
            "ZoneName": "us-gov-east-1c",
            "ZoneId": "usge1-az3",
            "GroupName": "us-gov-east-1",
            "NetworkBorderGroup": "us-gov-east-1",
            "ZoneType": "availability-zone"
        }
    ]
}

>> aws --region us-gov-west-1 --endpoint-url https://ec2.us-gov-west-1.amazonaws.com ec2 describe-availability-zones
>> aws --region us-gov-west-1 ec2 describe-availability-zones
{
    "AvailabilityZones": [
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "us-gov-west-1",
            "ZoneName": "us-gov-west-1a",
            "ZoneId": "usgw1-az1",
            "GroupName": "us-gov-west-1",
            "NetworkBorderGroup": "us-gov-west-1",
            "ZoneType": "availability-zone"
        },
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "us-gov-west-1",
            "ZoneName": "us-gov-west-1b",
            "ZoneId": "usgw1-az2",
            "GroupName": "us-gov-west-1",
            "NetworkBorderGroup": "us-gov-west-1",
            "ZoneType": "availability-zone"
        },
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "us-gov-west-1",
            "ZoneName": "us-gov-west-1c",
            "ZoneId": "usgw1-az3",
            "GroupName": "us-gov-west-1",
            "NetworkBorderGroup": "us-gov-west-1",
            "ZoneType": "availability-zone"
        }
    ]
}

>> but if I specified an incorrect endpoint oe region:
aws --region us-gov-east-1 --endpoint-url https://ec2.us-gov-west-1.amazonaws.com ec2 describe-availability-zones

An error occurred (AuthFailure) when calling the DescribeAvailabilityZones operation: AWS was not able to validate the provided access credentials


We config CCO in manual mode when installing disconnected + private clusters, which is an important configuration for customer, even it works in us-gov-west-1 region, but it is missing in east region.
I think it is better if we can fix this issue before 4.6 GA, otherwise, we need to document in release note.

Comment 8 Devan Goodwin 2020-09-29 11:13:47 UTC
Yunfei when you said if you specified an incorrect endpoint, your command looks identical to the one that was working above: aws --region us-gov-east-1 --endpoint-url https://ec2.us-gov-west-1.amazonaws.com ec2 describe-availability-zones

But did you mean that if you change the region or the endpoint in this command above to something invalid, you get the same error message you're seeing in this attempt to install?

Could you walk through the precise steps you take to install a cluster into us-gov-east-1? I'm wondering if there could have been a typo in the region or endpoint you configured when installing the cluster?

Comment 9 Devan Goodwin 2020-09-29 11:39:38 UTC
Just noticed the commands you provided did differ slightly, first time I couldn't see it.

Working: aws --region us-gov-east-1 --endpoint-url https://ec2.us-gov-east-1.amazonaws.com ec2 describe-availability-zones
Failing: aws --region us-gov-east-1 --endpoint-url https://ec2.us-gov-west-1.amazonaws.com ec2 describe-availability-zones

So mismatching the region and the endpoint URL seems to simulate the failure you're getting in the install? Is it possible this was what happened in the installconfig?

Comment 10 Yunfei Jiang 2020-09-29 12:10:13 UTC
(In reply to Devan Goodwin from comment #9)
> Just noticed the commands you provided did differ slightly, first time I
> couldn't see it.
> 
> Working: aws --region us-gov-east-1 --endpoint-url
> https://ec2.us-gov-east-1.amazonaws.com ec2 describe-availability-zones
> Failing: aws --region us-gov-east-1 --endpoint-url
> https://ec2.us-gov-west-1.amazonaws.com ec2 describe-availability-zones
> 
> So mismatching the region and the endpoint URL seems to simulate the failure
> you're getting in the install? 

yes, mismatching endpoint and region can cause above error.

>> Is it possible this was what happened in the installconfig?
I did not specify any endpoints in install-config, just set region as us-gov-east-1 or us-gov-west-1.

Comment 11 Devan Goodwin 2020-09-29 12:27:01 UTC
Ok possible clue, the Installer which vendors github.com/aws/aws-sdk-go v1.32.3 has some defaults for region to endpoint:

https://github.com/openshift/installer/blob/master/vendor/github.com/aws/aws-sdk-go/aws/endpoints/defaults.go#L7328

cloud-credential operator which vendors github.com/aws/aws-sdk-go v1.30.5 does not have this section: 

https://github.com/openshift/cloud-credential-operator/blob/master/vendor/github.com/aws/aws-sdk-go/aws/endpoints/defaults.go

Comment 12 Devan Goodwin 2020-09-29 12:55:57 UTC
Abhinav I think this was prematurely sent to us, with the use of manual mode it might initially look like Yunfei did not give the installer a valid credential, however I don't think this was the case. It looks like:

- Yunfei's credential is good and working in us-gov-west-1.
- A credential good for us-gov-west-1 should also work in us-gov-east-1.
- The failing code is the installer's AZ lookup probably for generating MachineSets.

I don't know if it's related for sure, but Yunfei can reproduce the error message by doing an aws describe AZs command and mismatching the region with the endpoint, i.e.: aws --region us-gov-east-1 --endpoint-url https://ec2.us-gov-west-1.amazonaws.com ec2 describe-availability-zones

This may be a red herring, but is it possible the sdk is somehow mapping the given region us-gov-east-1 to the wrong endpoint? It does look like this mapping is done internally in AWS sdk.

So it looks like there may be some kind of an issue here but at this point, it's surfacing in the installer, and we have not yet made it to any CCO code. Documentation doesn't seem to be the issue here at this point and I suspect it may have nothing to do with manual mode.

Comment 13 Abhinav Dahiya 2020-09-29 17:14:18 UTC
Here's how i created ignition configs for us-gov-east-1 to test the failure in creating machinesets.

1. Create an install config
```
$ yq m -CP -x aws-install-config.yaml elide-install-config.yaml
apiVersion: v1
baseDomain: devcluster.openshift.com
metadata:
  name: adahiya-2
platform:
  aws:
    region: us-gov-east-1
    amiID: ami-96c6f8f7
publish: Internal
pullSecret: ""
sshKey: ""

```

2. Create ignition configs

```
$ (rm -rf dev && mkdir -p dev && cp aws-install-config.yaml dev/install-config.yaml) && OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE= ./bin/openshift-install --dir dev create ignition-configs
INFO Consuming Install Config from target directory
INFO Credentials loaded from the "govcloud" profile in file "/home/adahiya/.aws/credentials"
INFO Ignition-Configs created in: dev and dev/auth
```

So there is no bug in installer for using the default mode in us-gov-east-1

Therefore the reason I moved it to the cco team was because the reporter pointed to the docs in CCO repo and it looks like the docs are not making it clear the the credentialsMode manual is only for cluster operators and the installer still needs to be provided with appropriate
credentials to communicate with the API like any other type of workflow.

@yunjiang

- Can you provide reproducer steps?
- Also include the .openshift_install.log
  and the install-config.yaml that was used.

Comment 15 Yunfei Jiang 2020-09-30 09:40:53 UTC
Created attachment 1717799 [details]
comment 14 install log

Comment 16 Yunfei Jiang 2020-09-30 09:41:39 UTC
Created attachment 1717800 [details]
comment 14 VPC cloudformation template

Comment 23 Yunfei Jiang 2020-10-13 00:43:49 UTC
Hello Brandi,

Thanks for the update, LGTM.

Comment 24 Andrew McDermott 2020-10-23 16:07:13 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 25 Andrew McDermott 2020-11-03 17:36:48 UTC
Needinfo answered in comment #22.

Comment 26 Daneyon Hansen 2020-11-12 16:50:55 UTC
Closing since https://github.com/openshift/openshift-docs/pull/26300 has merged.

Comment 28 Daneyon Hansen 2020-12-07 17:48:37 UTC
I’m reassigning to the docs team to address https://bugzilla.redhat.com/show_bug.cgi?id=1881262#c27.

Comment 29 Yunfei Jiang 2021-02-05 08:31:37 UTC
After bug 1892129  bug 1921901 and bug 1903226 have been fixed, I re-visited this issue, went through the whole process and fixed an automation issue, now the disconnect and private cluster could be installed successfully in us-gov-east-1.
verified version: 4.6.0-0.nightly-2021-02-04-203135 and 4.7.0-0.nightly-2021-02-05-005950