1767161 – 4.3 GCP install fails with googleapi: Error 409: There were concurrent policy changes

Bug 1767161 - 4.3 GCP install fails with googleapi: Error 409: There were concurrent policy changes

Summary: 4.3 GCP install fails with googleapi: Error 409: There were concurrent policy...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Jeremiah Stuever
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-30 19:50 UTC by Chance Zibolski
Modified:	2020-07-13 17:12 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:12:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:12:27 UTC

Description Chance Zibolski 2019-10-30 19:50:51 UTC

Description of problem:

Getting a 409 error on OCP GCP installs:

level=error msg="Error: Error applying IAM policy to project \"openshift-gce-devel-ci\": Too many conflicts.  Latest error: Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff., aborted"

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/489#1:build-log.txt%3A23

How reproducible: 11 times over 24 hours, pretty common: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=googleapi%3A+Error+409%3A+There+were+concurrent+policy+changes&maxAge=336h&context=2&type=all

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:


Expected results:

Additional info:

Comment 1 W. Trevor King 2019-10-30 20:00:56 UTC

Here's our backoff attempt from [1], which was apparently not deep enough:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/489/artifacts/e2e-gcp/installer/.openshift_install.log | grep '409 Conflict'
time="2019-10-30T18:16:43Z" level=debug msg="2019-10-30T18:16:43.985Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict"
time="2019-10-30T18:16:45Z" level=debug msg="2019-10-30T18:16:45.900Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict"
time="2019-10-30T18:16:48Z" level=debug msg="2019-10-30T18:16:48.845Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict"
time="2019-10-30T18:16:53Z" level=debug msg="2019-10-30T18:16:53.657Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict"
time="2019-10-30T18:17:03Z" level=debug msg="2019-10-30T18:17:03.170Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict"

More context from the error itself, for folks searching for this [1]:

level=error msg="Error: Error applying IAM policy to project \"openshift-gce-devel-ci\": Too many conflicts.  Latest error: Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff., aborted"
level=error
level=error msg="  on ../tmp/openshift-install-319784952/master/main.tf line 11, in resource \"google_project_iam_member\" \"master-network-admin\":"
level=error msg="  11: resource \"google_project_iam_member\" \"master-network-admin\" {"

That line is [2].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/489
[2]: https://github.com/openshift/installer/blame/2ec16e5336ffb15b90ac6a46eb41f43e34ec7f3f/data/data/gcp/master/main.tf#L11

Comment 2 Abhinav Dahiya 2019-11-04 21:37:53 UTC

So the GCP api to update project level permissions is highly prone to conflicts.. the terraform-backend already ensure that all IAM policy update operations are serialized.
But multiple GCP clusters can cause the conflicts back-off to become more prone to causing failures.
We currently already back-off upto 30 seconds (5 steps).

The chances of back-off causing failure are increased because we perform 7 IAM policy updates for each cluster (all serialized for one cluster create) because we need to provide the control-plane and worker machines 5 and 2 gcp user roles resp.

We don't use custom GCP user roles because as of now,

a) custom user roles are slow to materialize 1 min to 10 mins
b) serviceAccountActor permissions cannot be added to custom roles, therefore we would need 2 user roles, plus 3 IAM operations.
c) Quota and limit on these custom roles is not clear.


Once options i have floated is https://github.com/openshift/installer/pull/2611 which would give `cluster create` more chances to succedd.

Comment 4 Scott Dodson 2020-01-31 18:53:04 UTC

16 occurrences over 2 days as of most recent count.

We should investigate putting leases around installation and teardown on GCP that are lower than the overall number of leases available for concurrent CI jobs. Ideally we'd rate limit rather than applying leases but leases may be a good start to approximate rate limiting.

Batching may, but probably not, be relevant see https://github.com/terraform-providers/terraform-provider-google/pull/4207

Comment 5 Hongkai Liu 2020-01-31 19:24:16 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/271

Comment 6 slowrie 2020-02-12 21:19:10 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/1497

Comment 7 Aniket Bhat 2020-02-28 21:42:58 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/1810

Comment 8 Ben Parees 2020-03-10 17:26:33 UTC

this is still relatively common (5 incidents in the last day):
https://search.svc.ci.openshift.org/?search=There+were+concurrent+policy+changes&maxAge=336h&context=2&type=all

recent example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/7470/rehearse-7470-pull-ci-openshift-installer-master-e2e-gcp-upi/6

Comment 9 Scott Dodson 2020-03-23 13:45:00 UTC

It seems more prevalent in UPI installation that IPI installations. Did we change something in IPI to be more efficient? 
Currently happening ~ 6 times over 14 days, leaving at medium.

Comment 10 Abhinav Dahiya 2020-03-27 17:19:53 UTC

> It seems more prevalent in UPI installation

I think we should most definitely fix this in UPI, because UPI runs gcloud cli command doesn't do any retries what so ever.

Comment 12 Abhinav Dahiya 2020-04-06 16:59:24 UTC

moving to modified as https://github.com/openshift/release/pull/8023 merged.

Comment 17 Yang Yang 2020-04-07 03:49:12 UTC

Hi Ben Parees,

Does it still occur in your 4.5 CI jobs with the PR merged?

As it's a CI configuration change, do you have anything that needs QE to do to verify it?

Thanks

Comment 18 Ben Parees 2020-04-07 03:52:15 UTC

looking at:
https://search.svc.ci.openshift.org/?search=googleapi%3A+Error+409%3A+There+were+concurrent+policy+changes.&maxAge=48h&context=2&type=junit

I do not need it in 4.5, only 4.4 and 4.3.  Will mark verified.

Comment 20 errata-xmlrpc 2020-07-13 17:12:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.