1767936 – AWS api throttling fails jobs: error simulating policy: Throttling: Rate exceeded

Bug 1767936 - AWS api throttling fails jobs: error simulating policy: Throttling: Rate exceeded

Summary: AWS api throttling fails jobs: error simulating policy: Throttling: Rate exce...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Abhinav Dahiya
QA Contact:	Etienne Simard
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1794179 1820361 1850099 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-01 16:56 UTC by Ben Parees
Modified:	2020-07-13 17:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: independent AWS API clients for the library was configures with very low exponential backoff Consequence: Failure to validate permissions during install causing failed installs. Fix: Forcing the library to use the installer configured client. Result: No more hig rates of failure due to rate limitin
Clone Of:
Environment:
Last Closed:	2020-07-13 17:12:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3159	None	closed	Bug 1779312: pkg/asset/installconfig/aws/session.go: bump the retries to 25 for aws sdk	2021-01-19 08:57:06 UTC
Github	openshift installer pull 3295	None	closed	permissions.go: configure crendential check with installer session	2021-01-19 08:57:06 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:12:27 UTC

Description Ben Parees 2019-11-01 16:56:48 UTC

Description of problem:
Lease acquired, installing...
Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-10-31-230010
level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to fetch dependency of \"Master Machines\": failed to generate asset \"Platform Credentials Check\": validate AWS credentials: checking install permissions: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 980aa6f7-67fb-437d-b725-5966fa8c9742"



Actual results:
Installs fail due to AWS api rate limiting

Expected results:
Our CI infastructure is sufficiently scaled as to avoid these issues.


I am aware there are on-going efforts in this space, but we do not seem to have a tracking BZ for them.

Comment 1 Ben Parees 2019-11-01 16:58:34 UTC

Example failure:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10321

Comment 2 W. Trevor King 2019-11-04 04:02:26 UTC

Nothing to say on this particular issue, but I thought I'd chime in with some context.  Previous quasi-tracker was bug 1717604 , which was closed when we bumped the LB timeout (which until that point had been the main throttling failure mode).  Makes sense to have separate trackers for the various failure modes though, because we might be able to work up narrow fixes for them.  And Abhinav has [1] on the broad/sledgehammer side of this ;).  And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput.

[1]: https://github.com/openshift/installer/pull/2611

Comment 3 Ben Parees 2019-11-04 04:05:36 UTC

> And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput.


yeah, the sharding is what i was intending to reference w/ respect to on-going efforts in this space.

Comment 5 W. Trevor King 2019-11-04 06:05:41 UTC

[1] is a small step to set up for future AWS region sharding.

[1]: https://github.com/openshift/release/pull/5749

Comment 6 Vinay Kapalavai 2020-01-22 20:32:04 UTC

*** Bug 1794179 has been marked as a duplicate of this bug. ***

Comment 7 W. Trevor King 2020-02-22 05:23:21 UTC

Since [1,2,3] we're now sharding AWS over four regions when testing installer-provisioned clusters.  That helps with EC2 throttling, but doesn't impact Route 53 throttling, because all of those regions use us-east-1 Route 53.

[1]: https://github.com/openshift/release/pull/6833
[2]: https://github.com/openshift/release/pull/6845
[3]: https://github.com/openshift/release/pull/6949

Comment 8 Petr Balogh 2020-02-28 14:24:40 UTC

Looks like we have the same issue today:

http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00519.html

Is there some workaround for this or anything we can set in AWS at the moment to prevent this error?

Comment 9 Petr Balogh 2020-03-11 09:23:11 UTC

We see this issue again:
        log.debug(f"Command return code: {r.returncode}")
        if r.returncode and not ignore_error:
            raise CommandFailed(
>               f"Error during execution of command: {masked_cmd}."
                f"\nError is {masked_stderr}"
            )
E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: /home/jenkins/bin/openshift-install create cluster --dir /home/jenkins/current-cluster-dir/openshift-cluster-dir --log-level INFO.
E           Error is level=info msg="Consuming Install Config from target directory"
E           level=fatal msg="failed to fetch Cluster: failed to fetch dependency of \"Cluster\": failed to generate asset \"Platform Permissions Check\": validate AWS credentials: mint credentials check: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 699bbb4c-c7fb-47b0-bb8c-e56735a93f14"

Jenkins job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/5358/

Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/

Installer log:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/openshift_install_create_cluster_1583910026.log


Do I understand correctly that it's caused because of some load of AWS it's failing with this throttling error?
Can installer do more retries in case of such error?

Comment 10 Ben Parees 2020-03-11 21:19:16 UTC

I understand the challenges involved in addressing this bug, but raising the severity as it is one the top causes of our CI failures, so any steps we can take to mitigate it (e.g. sharding accounts) would very helpful.

Comment 11 Corey Daley 2020-03-12 16:51:37 UTC

Seeing this again today for operator.Run template e2e-aws-fips - e2e-aws-fips container setup

Example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/1299

Comment 12 W. Trevor King 2020-03-17 18:42:18 UTC

[1], in combination with the earlier [2], should make the installer more resilient in the face of policy-simulation throttling (while also increasing the overall load on AWS, because the "fix" is just "keep trying for longer and hope we get through eventually".

[1]: https://github.com/openshift/installer/pull/3295
[2]: https://github.com/openshift/installer/pull/3159

Comment 15 W. Trevor King 2020-03-17 21:22:20 UTC

Bumping the target release to 4.5, since these haven't been backported to 4.4.  Might need to go back to MODIFIED and dropped from a 4.4 errata?

Comment 19 W. Trevor King 2020-04-07 22:51:43 UTC

*** Bug 1820361 has been marked as a duplicate of this bug. ***

Comment 20 W. Trevor King 2020-04-07 22:52:40 UTC

Moving to the Installer component, since that's where the mitigations landed that moved this to MODIFIED.

Comment 21 Etienne Simard 2020-04-30 17:22:17 UTC

Verified with the following search: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=validate+AWS+credentials%3A+mint+credentials+check%3A+error+simulating+policy%3A+Throttling&maxAge=336h&context=2&type=junit

I only see that error present in release-openshift-ocp-installer-e2e-aws* jobs for 4.4 or earlier.

Comment 22 Etienne Simard 2020-04-30 17:22:57 UTC

Actually setting the status to Verified.

Comment 23 W. Trevor King 2020-06-24 15:09:29 UTC

*** Bug 1850099 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2020-07-13 17:12:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.