Bug 1935058 - Can’t finish install sts clusters on aws government region
Summary: Can’t finish install sts clusters on aws government region
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Ryan Fredette
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-04 10:27 UTC by wang lin
Modified: 2022-08-04 22:39 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:49:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather (19.17 MB, application/gzip)
2021-05-20 08:24 UTC, wang lin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 623 0 None open Bug 1935058: Set AWS session region 2021-06-03 18:49:33 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:51:00 UTC

Description wang lin 2021-03-04 10:27:51 UTC
Description of problem:
When installing sts cluster on aws government region(private) like us-gov-west-1, the installation failed on cluster initialization. Several cluster operators Degraded. The Ingress operator pod has below logs. It can’t find the OpenIDConnect provider.

Logs from ingress-operator
": "failed to create DNS provider: failed to create AWS DNS manager: failed to validate aws provider service endpoints: [failed to list route53 hosted zones: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 4684b572-3416-4bbe-b725-b1a120548975, failed to describe elb load balancers: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 40bcd677-ac53-42b7-95a1-c737e6d2b52c, failed to describe elbv2 load balancers: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 496b9444-97ca-474d-a036-4e7a899d1543, failed to get group tagging resources: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 6f35fd63-9b7d-4e11-8a3d-f65982f5f121]"}

But I checked from aws, the OpenIDConnect provider does exist.
###OpenIDConnect provider
$ aws iam get-open-id-connect-provider --open-id-connect-provider-arn arn:aws-us-gov:iam::225746144451:oidc-provider/lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com --region us-gov-west-1
{
    "Url": "lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com",
    "ClientIDList": [
        "openshift"
    ],
    "ThumbprintList": [
        "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    ],
    "CreateDate": "2021-03-03T12:14:08.162Z"
}
####roles for ingress component
$ aws iam get-role --role-name lwanguvsts-h9gkf_ingress --region us-gov-west-1
{
    "Role": {
        "Path": "/",
        "RoleName": "lwanguvsts-h9gkf_ingress",
        "RoleId": "AROATJD4EZTB5QU4GHACC",
        "Arn": "arn:aws-us-gov:iam::225746144451:role/lwanguvsts-h9gkf_ingress",
        "CreateDate": "2021-03-03T12:14:48Z",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "arn:aws-us-gov:iam::225746144451:oidc-provider/lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com:sub": "system:serviceaccount:openshift-ingress-operator:ingress-operator"
                        }
                    }
                }
            ]
        },
        "MaxSessionDuration": 3600,
        "Tags": [
            {
                "Key": "openshift_creationDate",
                "Value": "2021-03-03T12:35:24.009190+00:00"
            }
        ]
    }
}
####secret for ingress component
[default]
role_arn = arn:aws-us-gov:iam::225746144451:role/lwanguvsts-h9gkf_ingress
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-03-01-085007

How reproducible:
always

Steps to Reproduce: 
1.Docs to install a sts cluster: https://deploy-preview-29545--osdocs.netlify.app/openshift-enterprise/latest/authentication/managing_cloud_provider_credentials/cco-mode-sts.html#sts-mode-installing-manual-config
2. Choose government region like us-gov-west-1

Actual results:
The installation failed on cluster initialization.

Expected results:
The installation should successfully

Additional info:
STS service endpoint has been activated in us-gov-west-1.
The image registry operator can successfully assume the role of the government region.

Comment 3 wang lin 2021-05-07 09:09:13 UTC
can't install sts cluster successfully in gov region , so added testblocker keywords.

Comment 4 Joel Diaz 2021-05-19 14:59:12 UTC
From the BZ comment of "The image registry operator can successfully assume the role of the government region." it appears that creds are otherwise working, and this may be specific to what or how the ingress operator is interacting with AWS. Moving BZ.

Comment 5 wang lin 2021-05-20 08:24:31 UTC
Created attachment 1785068 [details]
must-gather

Upload must-gather info

Comment 6 Joel Diaz 2021-05-20 16:32:42 UTC
FWIW, I copied out the ServiceAccount token and was able to use the credentials to interact with AWS w/o issue. This worked as long as I explicitly set the region (which is unsurprising as the AWS gov endpoints are different than the global AWS endpoints).

Save the token contents locally (they expire every hour):
oc rsh -n openshift-ingress-operator deployment/ingress-operator cat /var/run/secrets/openshift/serviceaccount/token > ingresstoken

export AWS_WEB_IDENTITY_TOKEN_FILE=/path/to/ingresstoken

set AWS_ROLE_ARN to the arn from 'oc get secret -n openshift-ingress-operator cloud-credentials -o json | jq -r .data.credentials | base64 -d

Now you can interact with AWS Govcloud.

Here's my specific local output:
[jdiazrh@fedaio ~]$ env | grep AWS
AWS_ROLE_ARN=arn:aws-us-gov:iam::211567136888:role/jdiaz-gov2-openshift-ingress-operator-cloud-credentials
AWS_WEB_IDENTITY_TOKEN_FILE=/home/jdiazrh/ingresstoken
[jdiazrh@fedaio ~]$ aws route53 list-hosted-zones --region us-gov-west-1
{
    "HostedZones": [
        {
            "Id": "/hostedzone/Z05310543CY43718KAY6T",
            "Name": "jdiaz-gov.jdiaz.example.com.",
            "CallerReference": "terraform-20210519210348318100000003",
            "Config": {
                "Comment": "Managed by Terraform",
                "PrivateZone": true
            },
            "ResourceRecordSetCount": 4
        },
        {
            "Id": "/hostedzone/Z05566831WVXSZSXIWCGQ",
            "Name": "jdiaz-gov.jdiaz.example.com.",
            "CallerReference": "terraform-20210520153320594700000004",
            "Config": {
                "Comment": "Managed by Terraform",
                "PrivateZone": true
            },
            "ResourceRecordSetCount": 4
        }
    ]
}

But if you fail to set the region you get all kinds of problems:

[jdiazrh@fedaio ~]$ aws route53 list-hosted-zones 

An error occurred (InvalidClientTokenId) when calling the ListHostedZones operation: The security token included in the request is invalid

Comment 8 Ryan Fredette 2021-05-20 21:22:35 UTC
The ingress operator is definitely setting the region. This log message comes from the same codepath that's eventually erroring out:

> 2021-05-20T07:45:59.280695252Z 2021-05-20T07:45:59.280Z INFO    operator.dns    dns/controller.go:531   using region from operator config       {"region name": "us-gov-west-1"}

jdiaz when you copied out the serviceaccount token, did you also see this ingress operator issue? If we have the region correct, and the credentials file is right, I'm not sure what else could be causing the operator's auth to fail

Comment 9 Joel Diaz 2021-05-21 13:37:32 UTC
Yes, I did see the reported issues in the ingress-operator pod logs. Looking through the ingress-operator AWS client setup, I didn't catch anything that looked incorrect, but clearly something isn't working. You can look at the image-registry to compare how they set up their AWS client https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/s3/s3.go#L181

Comment 10 Miciah Dashiel Butler Masters 2021-06-09 14:37:17 UTC
Removing blocker status as GovCloud+STS isn't considered a blocker.  The proposed fix is in the merge queue but hitting lots of CI flakes.  We may drop it from the release if we don't see progress today.

Comment 12 Hongan Li 2021-06-11 09:44:29 UTC
The PR has been merged into 4.8.0-0.nightly-2021-06-10-224448.

Tested with 4.8.0-0.nightly-2021-06-11-024306 and ingress is OK now, although still find some other operators are abnormal after the installation. 

$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.8.0-0.nightly-2021-06-11-024306   True        False         False      62m

$ oc -n openshift-ingress-operator get secret/cloud-credentials -o json | jq -r .data.credentials | base64 -d
[default]
role_arn = arn:aws-us-gov:iam::123456789:role/a-lwangov0611-xxxx-openshift-ingress-operator-cloud-c
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token

I'm moving it to verified. Please reopen if still see the issue.

Comment 18 errata-xmlrpc 2021-07-27 22:49:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.