Description of problem: Lease acquired, installing... Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-10-31-230010 level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to fetch dependency of \"Master Machines\": failed to generate asset \"Platform Credentials Check\": validate AWS credentials: checking install permissions: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 980aa6f7-67fb-437d-b725-5966fa8c9742" Actual results: Installs fail due to AWS api rate limiting Expected results: Our CI infastructure is sufficiently scaled as to avoid these issues. I am aware there are on-going efforts in this space, but we do not seem to have a tracking BZ for them.
Example failure: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10321
Nothing to say on this particular issue, but I thought I'd chime in with some context. Previous quasi-tracker was bug 1717604 , which was closed when we bumped the LB timeout (which until that point had been the main throttling failure mode). Makes sense to have separate trackers for the various failure modes though, because we might be able to work up narrow fixes for them. And Abhinav has [1] on the broad/sledgehammer side of this ;). And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput. [1]: https://github.com/openshift/installer/pull/2611
> And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput. yeah, the sharding is what i was intending to reference w/ respect to on-going efforts in this space.
[1] is a small step to set up for future AWS region sharding. [1]: https://github.com/openshift/release/pull/5749
*** Bug 1794179 has been marked as a duplicate of this bug. ***
Since [1,2,3] we're now sharding AWS over four regions when testing installer-provisioned clusters. That helps with EC2 throttling, but doesn't impact Route 53 throttling, because all of those regions use us-east-1 Route 53. [1]: https://github.com/openshift/release/pull/6833 [2]: https://github.com/openshift/release/pull/6845 [3]: https://github.com/openshift/release/pull/6949
Looks like we have the same issue today: http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00519.html Is there some workaround for this or anything we can set in AWS at the moment to prevent this error?
We see this issue again: log.debug(f"Command return code: {r.returncode}") if r.returncode and not ignore_error: raise CommandFailed( > f"Error during execution of command: {masked_cmd}." f"\nError is {masked_stderr}" ) E ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: /home/jenkins/bin/openshift-install create cluster --dir /home/jenkins/current-cluster-dir/openshift-cluster-dir --log-level INFO. E Error is level=info msg="Consuming Install Config from target directory" E level=fatal msg="failed to fetch Cluster: failed to fetch dependency of \"Cluster\": failed to generate asset \"Platform Permissions Check\": validate AWS credentials: mint credentials check: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 699bbb4c-c7fb-47b0-bb8c-e56735a93f14" Jenkins job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/5358/ Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/ Installer log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/openshift_install_create_cluster_1583910026.log Do I understand correctly that it's caused because of some load of AWS it's failing with this throttling error? Can installer do more retries in case of such error?
I understand the challenges involved in addressing this bug, but raising the severity as it is one the top causes of our CI failures, so any steps we can take to mitigate it (e.g. sharding accounts) would very helpful.
Seeing this again today for operator.Run template e2e-aws-fips - e2e-aws-fips container setup Example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/1299
[1], in combination with the earlier [2], should make the installer more resilient in the face of policy-simulation throttling (while also increasing the overall load on AWS, because the "fix" is just "keep trying for longer and hope we get through eventually". [1]: https://github.com/openshift/installer/pull/3295 [2]: https://github.com/openshift/installer/pull/3159
Bumping the target release to 4.5, since these haven't been backported to 4.4. Might need to go back to MODIFIED and dropped from a 4.4 errata?
*** Bug 1820361 has been marked as a duplicate of this bug. ***
Moving to the Installer component, since that's where the mitigations landed that moved this to MODIFIED.
Verified with the following search: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=validate+AWS+credentials%3A+mint+credentials+check%3A+error+simulating+policy%3A+Throttling&maxAge=336h&context=2&type=junit I only see that error present in release-openshift-ocp-installer-e2e-aws* jobs for 4.4 or earlier.
Actually setting the status to Verified.
*** Bug 1850099 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409