Description of problem: When installing sts cluster on aws government region(private) like us-gov-west-1, the installation failed on cluster initialization. Several cluster operators Degraded. The Ingress operator pod has below logs. It can’t find the OpenIDConnect provider. Logs from ingress-operator ": "failed to create DNS provider: failed to create AWS DNS manager: failed to validate aws provider service endpoints: [failed to list route53 hosted zones: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 4684b572-3416-4bbe-b725-b1a120548975, failed to describe elb load balancers: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 40bcd677-ac53-42b7-95a1-c737e6d2b52c, failed to describe elbv2 load balancers: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 496b9444-97ca-474d-a036-4e7a899d1543, failed to get group tagging resources: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com\n\tstatus code: 400, request id: 6f35fd63-9b7d-4e11-8a3d-f65982f5f121]"} But I checked from aws, the OpenIDConnect provider does exist. ###OpenIDConnect provider $ aws iam get-open-id-connect-provider --open-id-connect-provider-arn arn:aws-us-gov:iam::225746144451:oidc-provider/lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com --region us-gov-west-1 { "Url": "lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com", "ClientIDList": [ "openshift" ], "ThumbprintList": [ "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ], "CreateDate": "2021-03-03T12:14:08.162Z" } ####roles for ingress component $ aws iam get-role --role-name lwanguvsts-h9gkf_ingress --region us-gov-west-1 { "Role": { "Path": "/", "RoleName": "lwanguvsts-h9gkf_ingress", "RoleId": "AROATJD4EZTB5QU4GHACC", "Arn": "arn:aws-us-gov:iam::225746144451:role/lwanguvsts-h9gkf_ingress", "CreateDate": "2021-03-03T12:14:48Z", "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws-us-gov:iam::225746144451:oidc-provider/lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "lwanguvsts-h9gkf-oidc.s3.us-gov-west-1.amazonaws.com:sub": "system:serviceaccount:openshift-ingress-operator:ingress-operator" } } } ] }, "MaxSessionDuration": 3600, "Tags": [ { "Key": "openshift_creationDate", "Value": "2021-03-03T12:35:24.009190+00:00" } ] } } ####secret for ingress component [default] role_arn = arn:aws-us-gov:iam::225746144451:role/lwanguvsts-h9gkf_ingress web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-03-01-085007 How reproducible: always Steps to Reproduce: 1.Docs to install a sts cluster: https://deploy-preview-29545--osdocs.netlify.app/openshift-enterprise/latest/authentication/managing_cloud_provider_credentials/cco-mode-sts.html#sts-mode-installing-manual-config 2. Choose government region like us-gov-west-1 Actual results: The installation failed on cluster initialization. Expected results: The installation should successfully Additional info: STS service endpoint has been activated in us-gov-west-1. The image registry operator can successfully assume the role of the government region.
can't install sts cluster successfully in gov region , so added testblocker keywords.
From the BZ comment of "The image registry operator can successfully assume the role of the government region." it appears that creds are otherwise working, and this may be specific to what or how the ingress operator is interacting with AWS. Moving BZ.
Created attachment 1785068 [details] must-gather Upload must-gather info
FWIW, I copied out the ServiceAccount token and was able to use the credentials to interact with AWS w/o issue. This worked as long as I explicitly set the region (which is unsurprising as the AWS gov endpoints are different than the global AWS endpoints). Save the token contents locally (they expire every hour): oc rsh -n openshift-ingress-operator deployment/ingress-operator cat /var/run/secrets/openshift/serviceaccount/token > ingresstoken export AWS_WEB_IDENTITY_TOKEN_FILE=/path/to/ingresstoken set AWS_ROLE_ARN to the arn from 'oc get secret -n openshift-ingress-operator cloud-credentials -o json | jq -r .data.credentials | base64 -d Now you can interact with AWS Govcloud. Here's my specific local output: [jdiazrh@fedaio ~]$ env | grep AWS AWS_ROLE_ARN=arn:aws-us-gov:iam::211567136888:role/jdiaz-gov2-openshift-ingress-operator-cloud-credentials AWS_WEB_IDENTITY_TOKEN_FILE=/home/jdiazrh/ingresstoken [jdiazrh@fedaio ~]$ aws route53 list-hosted-zones --region us-gov-west-1 { "HostedZones": [ { "Id": "/hostedzone/Z05310543CY43718KAY6T", "Name": "jdiaz-gov.jdiaz.example.com.", "CallerReference": "terraform-20210519210348318100000003", "Config": { "Comment": "Managed by Terraform", "PrivateZone": true }, "ResourceRecordSetCount": 4 }, { "Id": "/hostedzone/Z05566831WVXSZSXIWCGQ", "Name": "jdiaz-gov.jdiaz.example.com.", "CallerReference": "terraform-20210520153320594700000004", "Config": { "Comment": "Managed by Terraform", "PrivateZone": true }, "ResourceRecordSetCount": 4 } ] } But if you fail to set the region you get all kinds of problems: [jdiazrh@fedaio ~]$ aws route53 list-hosted-zones An error occurred (InvalidClientTokenId) when calling the ListHostedZones operation: The security token included in the request is invalid
The ingress operator is definitely setting the region. This log message comes from the same codepath that's eventually erroring out: > 2021-05-20T07:45:59.280695252Z 2021-05-20T07:45:59.280Z INFO operator.dns dns/controller.go:531 using region from operator config {"region name": "us-gov-west-1"} jdiaz when you copied out the serviceaccount token, did you also see this ingress operator issue? If we have the region correct, and the credentials file is right, I'm not sure what else could be causing the operator's auth to fail
Yes, I did see the reported issues in the ingress-operator pod logs. Looking through the ingress-operator AWS client setup, I didn't catch anything that looked incorrect, but clearly something isn't working. You can look at the image-registry to compare how they set up their AWS client https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/s3/s3.go#L181
Removing blocker status as GovCloud+STS isn't considered a blocker. The proposed fix is in the merge queue but hitting lots of CI flakes. We may drop it from the release if we don't see progress today.
The PR has been merged into 4.8.0-0.nightly-2021-06-10-224448. Tested with 4.8.0-0.nightly-2021-06-11-024306 and ingress is OK now, although still find some other operators are abnormal after the installation. $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.8.0-0.nightly-2021-06-11-024306 True False False 62m $ oc -n openshift-ingress-operator get secret/cloud-credentials -o json | jq -r .data.credentials | base64 -d [default] role_arn = arn:aws-us-gov:iam::123456789:role/a-lwangov0611-xxxx-openshift-ingress-operator-cloud-c web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token I'm moving it to verified. Please reopen if still see the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438