1920221 – GCP jobs exhaust zone listing query quota sometimes due to too many initializations of cloud provider in tests

Bug 1920221 - GCP jobs exhaust zone listing query quota sometimes due to too many initializations of cloud provider in tests

Summary: GCP jobs exhaust zone listing query quota sometimes due to too many initializ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Test Infrastructure
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Clayton Coleman
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1925740 1926258
TreeView+	depends on / blocked

Reported:	2021-01-25 20:04 UTC by Clayton Coleman
Modified:	2021-07-27 22:37 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1925740 1926258 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:36:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 552	None	closed	Bug 1920221: Prevent GCP e2e tests from triggering a rate limit on the listZone API	2021-02-16 05:40:00 UTC
Github	openshift origin pull 25861	None	closed	Bug 1920221: Don't initialize zone info repeatedly	2021-02-16 05:40:00 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:37:12 UTC

Description Clayton Coleman 2021-01-25 20:04:57 UTC

The e2e tests fork and run child tests in individual processes, which causes cloud provider initialization code to be run once per test, not once per suite as per upstream.  The GCP cloud provider makes several calls to initialize zones and other values that are constant over the life of a test run, and when lots of GCP tests are running at the same time we stand a chance of exceeding the burst quota on our account.  Every week or so we get a big chunk of failures as a result in our CI runs like:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.7/1353763451410845696

W0125 18:41:32.843010    9252 gce.go:485] No network name or URL specified.
E0125 18:41:32.894199    9252 test_context.go:485] Failed to setup provider config for "gce": Error building GCE/GKE provider: unexpected response listing zones: googleapi: Error 403: Quota exceeded for quota group 'ListGroup' and limit 'List requests per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:1053217076791'., rateLimitExceeded

https://search.ci.openshift.org/?search=Quota+exceeded+for+quota+group&maxAge=168h&context=1&type=junit&name=4%5C.7&maxMatches=5&maxBytes=20971520&groupBy=job

This fails about 20% of GCP jobs total every week in the conformance suite, which impacts both PRs and release periodics.  This is intermittent.

The ideal fix is to have the cloud provider data seeded via environment and avoid duplicate initialization, which will require us to carry a patch to initialization to extract and reuse the value (during initCloudProvider, probably).  Should be possible to get that upstream in some form, but mitigating the impact quickly is important.

Comment 2 Jian Zhang 2021-03-04 07:54:45 UTC

There are 15 payloads that have this failure in last 7 days, many test runs hit this error in each payload.
level=error msg=Error: Error reading InstanceGroup Members: googleapi: Error 403: Quota exceeded for quota group 'ListGroup' and limit 'List requests per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:1053217076791'., rateLimitExceeded

Searched "Quota exceeded for quota group" for 4.8 jobs, see: https://search.ci.openshift.org/?search=Quota+exceeded+for+quota+group&maxAge=168h&context=1&type=junit&name=4%5C.8&maxMatches=5&maxBytes=20971520&groupBy=job

release-openshift-ocp-installer-e2e-gcp-rt-4.8
periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp 
rehearse-16434-periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt 
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp -
rehearse-16262-periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-fips 
release-openshift-origin-installer-e2e-gcp-upgrade-4.7-stable-to-4.8-ci 
rehearse-16262-periodic-ci-openshift-release-master-ci-4.8-e2e-gcp 
rehearse-16262-periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-fips-serial 
release-openshift-ocp-installer-e2e-gcp-serial-4.8
release-openshift-origin-installer-e2e-gcp-upgrade-4.8 
rehearse-16391-pull-ci-openshift-cluster-authentication-operator-release-4.8-e2e-operator-encryption 
release-openshift-origin-installer-e2e-gcp-4.8
release-openshift-ocp-installer-e2e-gcp-ovn-4.8 
release-openshift-ocp-installer-e2e-gcp-4.8 
release-openshift-origin-installer-e2e-gcp-compact-4.8

Comment 3 Jian Zhang 2021-03-04 08:24:05 UTC

I should query "listing zones: googleapi.*ListGroup.*rateLimitExceeded" for 4.8 jobs to remove some noisy failures.
Such as, "reading InstanceGroup Members: googleapi:" is different between "unexpected response listing zones:". 

No failures found, details:
https://search.ci.openshift.org/?search=listing+zones%3A+googleapi.*ListGroup.*rateLimitExceeded&maxAge=168h&context=1&type=junit&name=4%5C.8&maxMatches=5&maxBytes=20971520&groupBy=job

LGTM, verify it.

Comment 5 W. Trevor King 2021-06-24 18:44:58 UTC

This is just an issue for folks who run hundreds of GCP e2e jobs in the same account each day, so probably not customer facing.  Setting as no-docs.

Comment 7 errata-xmlrpc 2021-07-27 22:36:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.