Bug 1925740 - GCP jobs exhaust zone listing query quota sometimes due to too many initializations of cloud provider in tests
Summary: GCP jobs exhaust zone listing query quota sometimes due to too many initializ...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Infrastructure
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.z
Assignee: Clayton Coleman
QA Contact: Jian Zhang
: 1926258 (view as bug list)
Depends On: 1920221 1926258
Blocks: 1926262
TreeView+ depends on / blocked
Reported: 2021-02-06 03:40 UTC by Clayton Coleman
Modified: 2021-06-24 18:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1920221
Last Closed: 2021-03-10 11:24:01 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25866 0 None closed Bug 1926258: Don't initialize zone info repeatedly in e2e tests 2021-02-16 23:54:24 UTC
Red Hat Product Errata RHBA-2021:0678 0 None None None 2021-03-10 11:24:36 UTC

Description Clayton Coleman 2021-02-06 03:40:54 UTC
+++ This bug was initially created as a clone of Bug #1920221 +++

The e2e tests fork and run child tests in individual processes, which causes cloud provider initialization code to be run once per test, not once per suite as per upstream.  The GCP cloud provider makes several calls to initialize zones and other values that are constant over the life of a test run, and when lots of GCP tests are running at the same time we stand a chance of exceeding the burst quota on our account.  Every week or so we get a big chunk of failures as a result in our CI runs like:


W0125 18:41:32.843010    9252 gce.go:485] No network name or URL specified.
E0125 18:41:32.894199    9252 test_context.go:485] Failed to setup provider config for "gce": Error building GCE/GKE provider: unexpected response listing zones: googleapi: Error 403: Quota exceeded for quota group 'ListGroup' and limit 'List requests per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:1053217076791'., rateLimitExceeded


This fails about 20% of GCP jobs total every week in the conformance suite, which impacts both PRs and release periodics.  This is intermittent.

The ideal fix is to have the cloud provider data seeded via environment and avoid duplicate initialization, which will require us to carry a patch to initialization to extract and reuse the value (during initCloudProvider, probably).  Should be possible to get that upstream in some form, but mitigating the impact quickly is important.

Comment 1 Scott Dodson 2021-02-10 14:49:40 UTC
*** Bug 1926258 has been marked as a duplicate of this bug. ***

Comment 4 Jianwei Hou 2021-02-24 06:42:23 UTC
The link https://search.ci.openshift.org/?search=Quota+exceeded+for+quota+group&maxAge=168h&context=1&type=junit&name=4%5C.7&maxMatches=5&maxBytes=20971520&groupBy=job returns many rateLimitExceeded errors, I think this still needs to be looked into, moving to assigned

Comment 5 W. Trevor King 2021-02-25 02:06:20 UTC
We've had a bunch of rateLimitExceeded for other types [1].  Better query for the zone-listing this bug is trying to address is [2].  That gives three hits in release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.6-stable-to-4.7-ci, and I'm not sure if that uses the unpatched 4.6 suite or the patched 4.7 suite.  But still, three 4.7 failures on this mode is a lot better than the 4.6 [3].  Moving back to ON_QA in case this is a convincing case for verification ;).

[1]: https://github.com/openshift/release/pull/16256
[2]: https://search.ci.openshift.org/?search=listing+zones%3A+googleapi.*ListGroup.*rateLimitExceeded&maxAge=168h&context=1&type=junit&name=4%5C.7&maxMatches=5&maxBytes=20971520&groupBy=job
[3]: https://search.ci.openshift.org/?search=listing+zones%3A+googleapi.*ListGroup.*rateLimitExceeded&maxAge=168h&context=1&type=junit&name=4%5C.6&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 7 Jian Zhang 2021-03-04 08:20:51 UTC
When querying "Quota exceeded for quota group", it may return those tests that failures with "Error reading InstanceGroup Members: googleapi: Error 403: Quota exceeded for quota group 'ListGroup' and limit 'List requests per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:1053217076791'., rateLimitExceeded". This failure is didferent with the origin issue: "response listing zones"
So, I queried the 4.7 jobs with "listing zones: googleapi.*ListGroup.*rateLimitExceeded". Only some tests hit this failure in there test templates:


It's much better than 4.6 that without the fixed PR: https://search.ci.openshift.org/?search=listing+zones%3A+googleapi.*ListGroup.*rateLimitExceeded&maxAge=168h&context=1&type=junit&name=4%5C.6&maxMatches=5&maxBytes=20971520&groupBy=job

LGTM, verify it.

Comment 9 errata-xmlrpc 2021-03-10 11:24:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 10 W. Trevor King 2021-06-24 18:45:05 UTC
This is just an issue for folks who run hundreds of GCP e2e jobs in the same account each day, so probably not customer facing.  Setting as no-docs.

Note You need to log in before you can comment on or make changes to this bug.