Bug 2028610

Summary: Installer doesn't retry on GCP rate limiting
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: Gaoyun Pei <gpei>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: wking
Version: 4.10Keywords: OtherQA
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2028611 (view as bug list) Environment:
Last Closed: 2022-03-10 16:31:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2028611    

Description Stephen Benjamin 2021-12-02 19:14:06 UTC
In CI, we're hitting errors like this:

```
level=error msg=Error: Error when reading or editing Target Pool
"ci-op-x5j99sbj-82914-2f74l-api": googleapi: Error 403: Quota exceeded
for quota group 'ReadGroup' and limit 'Read requests per 100 seconds' of
service 'compute.googleapis.com' for consumer
'project_number:711936183532'., rateLimitExceeded
```

This was fixed in terraform-provider-google v3.62.0, however v3.62.0 uses v2 of the terraform sdk. The installer should be pointed to the openshift fork that contains 3.27.0 + the retry patches.

Comment 2 Scott Dodson 2021-12-02 20:28:54 UTC
This may be difficult to reproduce. TRT has a wealth of data from which they can easily assess whether or not this fix has improved things. They should have enough data tomorrow to reach a conclusion on both effectiveness and whether or not these changes broke something, as such I welcome them to mark the bug VERIFIED once they have that data if QE hasn't been able to verify it independently.

Comment 3 Stephen Benjamin 2021-12-03 14:41:36 UTC
We got 108 runs in the 24 hours since my 4.10 installer PR merged.  GCP is doing a little better compared to the 24 hours before that:

After PR: https://sippy.ci.openshift.org/sippy-ng/jobs/4.10/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22not%22%3Afalse%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%224.10-e2e-gcp%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%224.9%22%7D%2C%7B%22columnField%22%3A%22timestamp%22%2C%22operatorValue%22%3A%22%3E%22%2C%22value%22%3A%221638423780000%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

Before PR: https://sippy.ci.openshift.org/sippy-ng/jobs/4.10/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22not%22%3Afalse%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%224.10-e2e-gcp%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%224.9%22%7D%2C%7B%22columnField%22%3A%22timestamp%22%2C%22operatorValue%22%3A%22%3C%22%2C%22value%22%3A%221638423780000%22%7D%2C%7B%22columnField%22%3A%22timestamp%22%2C%22operatorValue%22%3A%22%3E%3D%22%2C%22value%22%3A%221638337380000%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

search.ci also confirms we didn't get any terraform-generated read quota messages from GCP in the last 24 hours:

https://search.ci.openshift.org/chart?search=Error+when+reading.*403.*Quota+exceeded.*Read.*&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Marking this verified based on this data and comment #2.

Comment 6 errata-xmlrpc 2022-03-10 16:31:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056