Bug 1767936 - AWS api throttling fails jobs: error simulating policy: Throttling: Rate exceeded
Summary: AWS api throttling fails jobs: error simulating policy: Throttling: Rate exce...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Abhinav Dahiya
QA Contact: Etienne Simard
URL:
Whiteboard:
: 1794179 1820361 1850099 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-01 16:56 UTC by Ben Parees
Modified: 2020-07-13 17:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: independent AWS API clients for the library was configures with very low exponential backoff Consequence: Failure to validate permissions during install causing failed installs. Fix: Forcing the library to use the installer configured client. Result: No more hig rates of failure due to rate limitin
Clone Of:
Environment:
Last Closed: 2020-07-13 17:12:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3159 0 None closed Bug 1779312: pkg/asset/installconfig/aws/session.go: bump the retries to 25 for aws sdk 2021-01-19 08:57:06 UTC
Github openshift installer pull 3295 0 None closed permissions.go: configure crendential check with installer session 2021-01-19 08:57:06 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:12:27 UTC

Description Ben Parees 2019-11-01 16:56:48 UTC
Description of problem:
Lease acquired, installing...
Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-10-31-230010
level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to fetch dependency of \"Master Machines\": failed to generate asset \"Platform Credentials Check\": validate AWS credentials: checking install permissions: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 980aa6f7-67fb-437d-b725-5966fa8c9742"



Actual results:
Installs fail due to AWS api rate limiting

Expected results:
Our CI infastructure is sufficiently scaled as to avoid these issues.


I am aware there are on-going efforts in this space, but we do not seem to have a tracking BZ for them.

Comment 2 W. Trevor King 2019-11-04 04:02:26 UTC
Nothing to say on this particular issue, but I thought I'd chime in with some context.  Previous quasi-tracker was bug 1717604 , which was closed when we bumped the LB timeout (which until that point had been the main throttling failure mode).  Makes sense to have separate trackers for the various failure modes though, because we might be able to work up narrow fixes for them.  And Abhinav has [1] on the broad/sledgehammer side of this ;).  And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput.

[1]: https://github.com/openshift/installer/pull/2611

Comment 3 Ben Parees 2019-11-04 04:05:36 UTC
> And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput.


yeah, the sharding is what i was intending to reference w/ respect to on-going efforts in this space.

Comment 5 W. Trevor King 2019-11-04 06:05:41 UTC
[1] is a small step to set up for future AWS region sharding.

[1]: https://github.com/openshift/release/pull/5749

Comment 6 Vinay Kapalavai 2020-01-22 20:32:04 UTC
*** Bug 1794179 has been marked as a duplicate of this bug. ***

Comment 7 W. Trevor King 2020-02-22 05:23:21 UTC
Since [1,2,3] we're now sharding AWS over four regions when testing installer-provisioned clusters.  That helps with EC2 throttling, but doesn't impact Route 53 throttling, because all of those regions use us-east-1 Route 53.

[1]: https://github.com/openshift/release/pull/6833
[2]: https://github.com/openshift/release/pull/6845
[3]: https://github.com/openshift/release/pull/6949

Comment 8 Petr Balogh 2020-02-28 14:24:40 UTC
Looks like we have the same issue today:

http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00519.html

Is there some workaround for this or anything we can set in AWS at the moment to prevent this error?

Comment 9 Petr Balogh 2020-03-11 09:23:11 UTC
We see this issue again:
        log.debug(f"Command return code: {r.returncode}")
        if r.returncode and not ignore_error:
            raise CommandFailed(
>               f"Error during execution of command: {masked_cmd}."
                f"\nError is {masked_stderr}"
            )
E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: /home/jenkins/bin/openshift-install create cluster --dir /home/jenkins/current-cluster-dir/openshift-cluster-dir --log-level INFO.
E           Error is level=info msg="Consuming Install Config from target directory"
E           level=fatal msg="failed to fetch Cluster: failed to fetch dependency of \"Cluster\": failed to generate asset \"Platform Permissions Check\": validate AWS credentials: mint credentials check: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 699bbb4c-c7fb-47b0-bb8c-e56735a93f14"

Jenkins job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/5358/

Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/

Installer log:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/openshift_install_create_cluster_1583910026.log


Do I understand correctly that it's caused because of some load of AWS it's failing with this throttling error?
Can installer do more retries in case of such error?

Comment 10 Ben Parees 2020-03-11 21:19:16 UTC
I understand the challenges involved in addressing this bug, but raising the severity as it is one the top causes of our CI failures, so any steps we can take to mitigate it (e.g. sharding accounts) would very helpful.

Comment 11 Corey Daley 2020-03-12 16:51:37 UTC
Seeing this again today for operator.Run template e2e-aws-fips - e2e-aws-fips container setup

Example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/1299

Comment 12 W. Trevor King 2020-03-17 18:42:18 UTC
[1], in combination with the earlier [2], should make the installer more resilient in the face of policy-simulation throttling (while also increasing the overall load on AWS, because the "fix" is just "keep trying for longer and hope we get through eventually".

[1]: https://github.com/openshift/installer/pull/3295
[2]: https://github.com/openshift/installer/pull/3159

Comment 15 W. Trevor King 2020-03-17 21:22:20 UTC
Bumping the target release to 4.5, since these haven't been backported to 4.4.  Might need to go back to MODIFIED and dropped from a 4.4 errata?

Comment 19 W. Trevor King 2020-04-07 22:51:43 UTC
*** Bug 1820361 has been marked as a duplicate of this bug. ***

Comment 20 W. Trevor King 2020-04-07 22:52:40 UTC
Moving to the Installer component, since that's where the mitigations landed that moved this to MODIFIED.

Comment 21 Etienne Simard 2020-04-30 17:22:17 UTC
Verified with the following search: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=validate+AWS+credentials%3A+mint+credentials+check%3A+error+simulating+policy%3A+Throttling&maxAge=336h&context=2&type=junit

I only see that error present in release-openshift-ocp-installer-e2e-aws* jobs for 4.4 or earlier.

Comment 22 Etienne Simard 2020-04-30 17:22:57 UTC
Actually setting the status to Verified.

Comment 23 W. Trevor King 2020-06-24 15:09:29 UTC
*** Bug 1850099 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2020-07-13 17:12:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.