Bug 1767936
| Summary: | AWS api throttling fails jobs: error simulating policy: Throttling: Rate exceeded | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> |
| Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> |
| Installer sub component: | openshift-installer | QA Contact: | Etienne Simard <esimard> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | esimard, jialiu, nmanos, pbalogh, qiwan, vkapalav, wking |
| Version: | 4.5 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: independent AWS API clients for the library was configures with very low exponential backoff
Consequence: Failure to validate permissions during install causing failed installs.
Fix: Forcing the library to use the installer configured client.
Result: No more hig rates of failure due to rate limitin
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-13 17:12:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ben Parees
2019-11-01 16:56:48 UTC
Example failure: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10321 Nothing to say on this particular issue, but I thought I'd chime in with some context. Previous quasi-tracker was bug 1717604 , which was closed when we bumped the LB timeout (which until that point had been the main throttling failure mode). Makes sense to have separate trackers for the various failure modes though, because we might be able to work up narrow fixes for them. And Abhinav has [1] on the broad/sledgehammer side of this ;). And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput. [1]: https://github.com/openshift/installer/pull/2611 > And we're also working on sharding AWS over multiple regions, which should also help reduce throttling issues for a given level of cluster throughput.
yeah, the sharding is what i was intending to reference w/ respect to on-going efforts in this space.
[1] is a small step to set up for future AWS region sharding. [1]: https://github.com/openshift/release/pull/5749 *** Bug 1794179 has been marked as a duplicate of this bug. *** Since [1,2,3] we're now sharding AWS over four regions when testing installer-provisioned clusters. That helps with EC2 throttling, but doesn't impact Route 53 throttling, because all of those regions use us-east-1 Route 53. [1]: https://github.com/openshift/release/pull/6833 [2]: https://github.com/openshift/release/pull/6845 [3]: https://github.com/openshift/release/pull/6949 Looks like we have the same issue today: http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00519.html Is there some workaround for this or anything we can set in AWS at the moment to prevent this error? We see this issue again:
log.debug(f"Command return code: {r.returncode}")
if r.returncode and not ignore_error:
raise CommandFailed(
> f"Error during execution of command: {masked_cmd}."
f"\nError is {masked_stderr}"
)
E ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: /home/jenkins/bin/openshift-install create cluster --dir /home/jenkins/current-cluster-dir/openshift-cluster-dir --log-level INFO.
E Error is level=info msg="Consuming Install Config from target directory"
E level=fatal msg="failed to fetch Cluster: failed to fetch dependency of \"Cluster\": failed to generate asset \"Platform Permissions Check\": validate AWS credentials: mint credentials check: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 699bbb4c-c7fb-47b0-bb8c-e56735a93f14"
Jenkins job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/5358/
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/
Installer log:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-aws/fbalak-aws_20200311T064739/logs/openshift_install_create_cluster_1583910026.log
Do I understand correctly that it's caused because of some load of AWS it's failing with this throttling error?
Can installer do more retries in case of such error?
I understand the challenges involved in addressing this bug, but raising the severity as it is one the top causes of our CI failures, so any steps we can take to mitigate it (e.g. sharding accounts) would very helpful. Seeing this again today for operator.Run template e2e-aws-fips - e2e-aws-fips container setup Example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/1299 [1], in combination with the earlier [2], should make the installer more resilient in the face of policy-simulation throttling (while also increasing the overall load on AWS, because the "fix" is just "keep trying for longer and hope we get through eventually". [1]: https://github.com/openshift/installer/pull/3295 [2]: https://github.com/openshift/installer/pull/3159 Bumping the target release to 4.5, since these haven't been backported to 4.4. Might need to go back to MODIFIED and dropped from a 4.4 errata? *** Bug 1820361 has been marked as a duplicate of this bug. *** Moving to the Installer component, since that's where the mitigations landed that moved this to MODIFIED. Verified with the following search: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=validate+AWS+credentials%3A+mint+credentials+check%3A+error+simulating+policy%3A+Throttling&maxAge=336h&context=2&type=junit I only see that error present in release-openshift-ocp-installer-e2e-aws* jobs for 4.4 or earlier. Actually setting the status to Verified. *** Bug 1850099 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |