Bug 1935058
Summary: | Can’t finish install sts clusters on aws government region | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | wang lin <lwan> | ||||
Component: | Networking | Assignee: | Ryan Fredette <rfredette> | ||||
Networking sub component: | DNS | QA Contact: | Hongan Li <hongli> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | amcdermo, aos-bugs, arane, jdiaz, jrouth, lwan, mmasters, yunjiang | ||||
Version: | 4.7 | Keywords: | TestBlocker | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-07-27 22:49:27 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
wang lin
2021-03-04 10:27:51 UTC
can't install sts cluster successfully in gov region , so added testblocker keywords. From the BZ comment of "The image registry operator can successfully assume the role of the government region." it appears that creds are otherwise working, and this may be specific to what or how the ingress operator is interacting with AWS. Moving BZ. Created attachment 1785068 [details]
must-gather
Upload must-gather info
FWIW, I copied out the ServiceAccount token and was able to use the credentials to interact with AWS w/o issue. This worked as long as I explicitly set the region (which is unsurprising as the AWS gov endpoints are different than the global AWS endpoints). Save the token contents locally (they expire every hour): oc rsh -n openshift-ingress-operator deployment/ingress-operator cat /var/run/secrets/openshift/serviceaccount/token > ingresstoken export AWS_WEB_IDENTITY_TOKEN_FILE=/path/to/ingresstoken set AWS_ROLE_ARN to the arn from 'oc get secret -n openshift-ingress-operator cloud-credentials -o json | jq -r .data.credentials | base64 -d Now you can interact with AWS Govcloud. Here's my specific local output: [jdiazrh@fedaio ~]$ env | grep AWS AWS_ROLE_ARN=arn:aws-us-gov:iam::211567136888:role/jdiaz-gov2-openshift-ingress-operator-cloud-credentials AWS_WEB_IDENTITY_TOKEN_FILE=/home/jdiazrh/ingresstoken [jdiazrh@fedaio ~]$ aws route53 list-hosted-zones --region us-gov-west-1 { "HostedZones": [ { "Id": "/hostedzone/Z05310543CY43718KAY6T", "Name": "jdiaz-gov.jdiaz.example.com.", "CallerReference": "terraform-20210519210348318100000003", "Config": { "Comment": "Managed by Terraform", "PrivateZone": true }, "ResourceRecordSetCount": 4 }, { "Id": "/hostedzone/Z05566831WVXSZSXIWCGQ", "Name": "jdiaz-gov.jdiaz.example.com.", "CallerReference": "terraform-20210520153320594700000004", "Config": { "Comment": "Managed by Terraform", "PrivateZone": true }, "ResourceRecordSetCount": 4 } ] } But if you fail to set the region you get all kinds of problems: [jdiazrh@fedaio ~]$ aws route53 list-hosted-zones An error occurred (InvalidClientTokenId) when calling the ListHostedZones operation: The security token included in the request is invalid The ingress operator is definitely setting the region. This log message comes from the same codepath that's eventually erroring out:
> 2021-05-20T07:45:59.280695252Z 2021-05-20T07:45:59.280Z INFO operator.dns dns/controller.go:531 using region from operator config {"region name": "us-gov-west-1"}
jdiaz when you copied out the serviceaccount token, did you also see this ingress operator issue? If we have the region correct, and the credentials file is right, I'm not sure what else could be causing the operator's auth to fail
Yes, I did see the reported issues in the ingress-operator pod logs. Looking through the ingress-operator AWS client setup, I didn't catch anything that looked incorrect, but clearly something isn't working. You can look at the image-registry to compare how they set up their AWS client https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/s3/s3.go#L181 Removing blocker status as GovCloud+STS isn't considered a blocker. The proposed fix is in the merge queue but hitting lots of CI flakes. We may drop it from the release if we don't see progress today. The PR has been merged into 4.8.0-0.nightly-2021-06-10-224448. Tested with 4.8.0-0.nightly-2021-06-11-024306 and ingress is OK now, although still find some other operators are abnormal after the installation. $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.8.0-0.nightly-2021-06-11-024306 True False False 62m $ oc -n openshift-ingress-operator get secret/cloud-credentials -o json | jq -r .data.credentials | base64 -d [default] role_arn = arn:aws-us-gov:iam::123456789:role/a-lwangov0611-xxxx-openshift-ingress-operator-cloud-c web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token I'm moving it to verified. Please reopen if still see the issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |