Bug 1935058

Summary: Can’t finish install sts clusters on aws government region
Product: OpenShift Container Platform Reporter: wang lin <lwan>
Component: NetworkingAssignee: Ryan Fredette <rfredette>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: amcdermo, aos-bugs, arane, jdiaz, jrouth, lwan, mmasters, yunjiang
Version: 4.7Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:49:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
must-gather none

Description wang lin 2021-03-04 10:27:51 UTC
Description of problem:
When installing sts cluster on aws government region(private) like us-gov-west-1, the installation failed on cluster initialization. Several cluster operators Degraded. The Ingress operator pod has below logs. It can’t find the OpenIDConnect provider.

Logs from ingress-operator
": "failed to create DNS provider: failed to create AWS DNS manager: failed to validate aws provider service endpoints: [failed to list route53 hosted zones: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 4684b572-3416-4bbe-b725-b1a120548975, failed to describe elb load balancers: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 40bcd677-ac53-42b7-95a1-c737e6d2b52c, failed to describe elbv2 load balancers: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 496b9444-97ca-474d-a036-4e7a899d1543, failed to get group tagging resources: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 6f35fd63-9b7d-4e11-8a3d-f65982f5f121]"}

But I checked from aws, the OpenIDConnect provider does exist.
###OpenIDConnect provider
$ aws iam get-open-id-connect-provider --open-id-connect-provider-arn arn:aws-us-gov:iam::225746144451:oidc-provider/lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com --region us-gov-west-1
{
    "Url": "lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com",
    "ClientIDList": [
        "openshift"
    ],
    "ThumbprintList": [
        "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    ],
    "CreateDate": "2021-03-03T12:14:08.162Z"
}
####roles for ingress component
$ aws iam get-role --role-name lwanguvsts-h9gkf_ingress --region us-gov-west-1
{
    "Role": {
        "Path": "/",
        "RoleName": "lwanguvsts-h9gkf_ingress",
        "RoleId": "AROATJD4EZTB5QU4GHACC",
        "Arn": "arn:aws-us-gov:iam::225746144451:role/lwanguvsts-h9gkf_ingress",
        "CreateDate": "2021-03-03T12:14:48Z",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "arn:aws-us-gov:iam::225746144451:oidc-provider/lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com:sub": "system:serviceaccount:openshift-ingress-operator:ingress-operator"
                        }
                    }
                }
            ]
        },
        "MaxSessionDuration": 3600,
        "Tags": [
            {
                "Key": "openshift_creationDate",
                "Value": "2021-03-03T12:35:24.009190+00:00"
            }
        ]
    }
}
####secret for ingress component
[default]
role_arn = arn:aws-us-gov:iam::225746144451:role/lwanguvsts-h9gkf_ingress
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-03-01-085007

How reproducible:
always

Steps to Reproduce: 
1.Docs to install a sts cluster: https://deploy-preview-29545--osdocs.netlify.app/openshift-enterprise/latest/authentication/managing_cloud_provider_credentials/cco-mode-sts.html#sts-mode-installing-manual-config
2. Choose government region like us-gov-west-1

Actual results:
The installation failed on cluster initialization.

Expected results:
The installation should successfully

Additional info:
STS service endpoint has been activated in us-gov-west-1.
The image registry operator can successfully assume the role of the government region.

Comment 3 wang lin 2021-05-07 09:09:13 UTC
can't install sts cluster successfully in gov region , so added testblocker keywords.

Comment 4 Joel Diaz 2021-05-19 14:59:12 UTC
From the BZ comment of "The image registry operator can successfully assume the role of the government region." it appears that creds are otherwise working, and this may be specific to what or how the ingress operator is interacting with AWS. Moving BZ.

Comment 5 wang lin 2021-05-20 08:24:31 UTC
Created attachment 1785068 [details]
must-gather

Upload must-gather info

Comment 6 Joel Diaz 2021-05-20 16:32:42 UTC
FWIW, I copied out the ServiceAccount token and was able to use the credentials to interact with AWS w/o issue. This worked as long as I explicitly set the region (which is unsurprising as the AWS gov endpoints are different than the global AWS endpoints).

Save the token contents locally (they expire every hour):
oc rsh -n openshift-ingress-operator deployment/ingress-operator cat /var/run/secrets/openshift/serviceaccount/token > ingresstoken

export AWS_WEB_IDENTITY_TOKEN_FILE=/path/to/ingresstoken

set AWS_ROLE_ARN to the arn from 'oc get secret -n openshift-ingress-operator cloud-credentials -o json | jq -r .data.credentials | base64 -d

Now you can interact with AWS Govcloud.

Here's my specific local output:
[jdiazrh@fedaio ~]$ env | grep AWS
AWS_ROLE_ARN=arn:aws-us-gov:iam::211567136888:role/jdiaz-gov2-openshift-ingress-operator-cloud-credentials
AWS_WEB_IDENTITY_TOKEN_FILE=/home/jdiazrh/ingresstoken
[jdiazrh@fedaio ~]$ aws route53 list-hosted-zones --region us-gov-west-1
{
    "HostedZones": [
        {
            "Id": "/hostedzone/Z05310543CY43718KAY6T",
            "Name": "jdiaz-gov.jdiaz.example.com.",
            "CallerReference": "terraform-20210519210348318100000003",
            "Config": {
                "Comment": "Managed by Terraform",
                "PrivateZone": true
            },
            "ResourceRecordSetCount": 4
        },
        {
            "Id": "/hostedzone/Z05566831WVXSZSXIWCGQ",
            "Name": "jdiaz-gov.jdiaz.example.com.",
            "CallerReference": "terraform-20210520153320594700000004",
            "Config": {
                "Comment": "Managed by Terraform",
                "PrivateZone": true
            },
            "ResourceRecordSetCount": 4
        }
    ]
}

But if you fail to set the region you get all kinds of problems:

[jdiazrh@fedaio ~]$ aws route53 list-hosted-zones 

An error occurred (InvalidClientTokenId) when calling the ListHostedZones operation: The security token included in the request is invalid

Comment 8 Ryan Fredette 2021-05-20 21:22:35 UTC
The ingress operator is definitely setting the region. This log message comes from the same codepath that's eventually erroring out:

> 2021-05-20T07:45:59.280695252Z 2021-05-20T07:45:59.280Z INFO    operator.dns    dns/controller.go:531   using region from operator config       {"region name": "us-gov-west-1"}

jdiaz when you copied out the serviceaccount token, did you also see this ingress operator issue? If we have the region correct, and the credentials file is right, I'm not sure what else could be causing the operator's auth to fail

Comment 9 Joel Diaz 2021-05-21 13:37:32 UTC
Yes, I did see the reported issues in the ingress-operator pod logs. Looking through the ingress-operator AWS client setup, I didn't catch anything that looked incorrect, but clearly something isn't working. You can look at the image-registry to compare how they set up their AWS client https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/s3/s3.go#L181

Comment 10 Miciah Dashiel Butler Masters 2021-06-09 14:37:17 UTC
Removing blocker status as GovCloud+STS isn't considered a blocker.  The proposed fix is in the merge queue but hitting lots of CI flakes.  We may drop it from the release if we don't see progress today.

Comment 12 Hongan Li 2021-06-11 09:44:29 UTC
The PR has been merged into 4.8.0-0.nightly-2021-06-10-224448.

Tested with 4.8.0-0.nightly-2021-06-11-024306 and ingress is OK now, although still find some other operators are abnormal after the installation. 

$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.8.0-0.nightly-2021-06-11-024306   True        False         False      62m

$ oc -n openshift-ingress-operator get secret/cloud-credentials -o json | jq -r .data.credentials | base64 -d
[default]
role_arn = arn:aws-us-gov:iam::123456789:role/a-lwangov0611-xxxx-openshift-ingress-operator-cloud-c
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token

I'm moving it to verified. Please reopen if still see the issue.

Comment 18 errata-xmlrpc 2021-07-27 22:49:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438