Description of problem: Getting a 409 error on OCP GCP installs: level=error msg="Error: Error applying IAM policy to project \"openshift-gce-devel-ci\": Too many conflicts. Latest error: Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff., aborted" https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/489#1:build-log.txt%3A23 How reproducible: 11 times over 24 hours, pretty common: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=googleapi%3A+Error+409%3A+There+were+concurrent+policy+changes&maxAge=336h&context=2&type=all Steps to Reproduce: 1. Unknown 2. 3. Actual results: Expected results: Additional info:
Here's our backoff attempt from [1], which was apparently not deep enough: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/489/artifacts/e2e-gcp/installer/.openshift_install.log | grep '409 Conflict' time="2019-10-30T18:16:43Z" level=debug msg="2019-10-30T18:16:43.985Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict" time="2019-10-30T18:16:45Z" level=debug msg="2019-10-30T18:16:45.900Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict" time="2019-10-30T18:16:48Z" level=debug msg="2019-10-30T18:16:48.845Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict" time="2019-10-30T18:16:53Z" level=debug msg="2019-10-30T18:16:53.657Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict" time="2019-10-30T18:17:03Z" level=debug msg="2019-10-30T18:17:03.170Z [DEBUG] plugin.terraform-provider-google: HTTP/2.0 409 Conflict" More context from the error itself, for folks searching for this [1]: level=error msg="Error: Error applying IAM policy to project \"openshift-gce-devel-ci\": Too many conflicts. Latest error: Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff., aborted" level=error level=error msg=" on ../tmp/openshift-install-319784952/master/main.tf line 11, in resource \"google_project_iam_member\" \"master-network-admin\":" level=error msg=" 11: resource \"google_project_iam_member\" \"master-network-admin\" {" That line is [2]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/489 [2]: https://github.com/openshift/installer/blame/2ec16e5336ffb15b90ac6a46eb41f43e34ec7f3f/data/data/gcp/master/main.tf#L11
So the GCP api to update project level permissions is highly prone to conflicts.. the terraform-backend already ensure that all IAM policy update operations are serialized. But multiple GCP clusters can cause the conflicts back-off to become more prone to causing failures. We currently already back-off upto 30 seconds (5 steps). The chances of back-off causing failure are increased because we perform 7 IAM policy updates for each cluster (all serialized for one cluster create) because we need to provide the control-plane and worker machines 5 and 2 gcp user roles resp. We don't use custom GCP user roles because as of now, a) custom user roles are slow to materialize 1 min to 10 mins b) serviceAccountActor permissions cannot be added to custom roles, therefore we would need 2 user roles, plus 3 IAM operations. c) Quota and limit on these custom roles is not clear. Once options i have floated is https://github.com/openshift/installer/pull/2611 which would give `cluster create` more chances to succedd.
16 occurrences over 2 days as of most recent count. We should investigate putting leases around installation and teardown on GCP that are lower than the overall number of leases available for concurrent CI jobs. Ideally we'd rate limit rather than applying leases but leases may be a good start to approximate rate limiting. Batching may, but probably not, be relevant see https://github.com/terraform-providers/terraform-provider-google/pull/4207
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/271
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/1497
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/1810
this is still relatively common (5 incidents in the last day): https://search.svc.ci.openshift.org/?search=There+were+concurrent+policy+changes&maxAge=336h&context=2&type=all recent example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/7470/rehearse-7470-pull-ci-openshift-installer-master-e2e-gcp-upi/6
It seems more prevalent in UPI installation that IPI installations. Did we change something in IPI to be more efficient? Currently happening ~ 6 times over 14 days, leaving at medium.
> It seems more prevalent in UPI installation I think we should most definitely fix this in UPI, because UPI runs gcloud cli command doesn't do any retries what so ever.
moving to modified as https://github.com/openshift/release/pull/8023 merged.
Hi Ben Parees, Does it still occur in your 4.5 CI jobs with the PR merged? As it's a CI configuration change, do you have anything that needs QE to do to verify it? Thanks
looking at: https://search.svc.ci.openshift.org/?search=googleapi%3A+Error+409%3A+There+were+concurrent+policy+changes.&maxAge=48h&context=2&type=junit I do not need it in 4.5, only 4.4 and 4.3. Will mark verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409