Bug 1896218 - Occasional GCP install failures: Error setting IAM policy for project ...: googleapi: Error 400: Service account ... does not exist., badRequest
Summary: Occasional GCP install failures: Error setting IAM policy for project ...: go...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Patrick Dillon
QA Contact: To Hung Sze
URL:
Whiteboard:
: 1903720 (view as bug list)
Depends On:
Blocks: 1903720 1929872
TreeView+ depends on / blocked
 
Reported: 2020-11-10 03:35 UTC by W. Trevor King
Modified: 2021-02-24 15:33 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: service accounts were being deleted before iam role bindings were deleted. Consequence: when terraform rewrote the policy to add new bindings, service accounts from a different cluster may have been deleted in between the read & write in the current cluster, causing an inconsistent state with the new policy being written. Fix: remove bindings before deleting service accounts. Result: service accounts are no longer associated with the iam policy at the time they are deleted, therefore they do not affect policies being written by other clusters. This allows successful cluster creation without conflicts between clusters.
Clone Of:
: 1903720 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:32:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4589 0 None closed Bug 1896218: remove GCP role bindings before service accounts 2021-02-17 23:55:31 UTC
Github openshift installer pull 4602 0 None closed Bug 1896218: remove GCP role bindings before service accounts 2021-02-17 23:55:31 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:33:02 UTC

Description W. Trevor King 2020-11-10 03:35:43 UTC
Occasionally in CI for at least the past 13 days:

$ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=all&search=Error%20setting%20IAM%20policy%20for%20project.*serviceaccount.com%20does%20not%20exist' | grep 'failures match' | sort
periodic-ci-openshift-cnv-cnv-ci-master-e2e-test-cron - 4 runs, 25% failed, 100% of failures match
periodic-ci-openshift-kni-cnf-features-deploy-sctpci-release-v4.3-cnf-sctp-ovn-gcp-periodic - 4 runs, 100% failed, 25% of failures match
pull-ci-opendatahub-io-odh-manifests-master-odh-manifests-e2e - 5 runs, 20% failed, 100% of failures match
...
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade - 4 runs, 100% failed, 25% of failures match
rehearse-13388-pull-ci-kubevirt-ssp-operator-release-4.8-e2e-functests - 3 runs, 100% failed, 33% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 2 runs, 50% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-serial-4.2 - 2 runs, 100% failed, 50% of failures match
release-openshift-origin-installer-e2e-gcp-4.2 - 2 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-4.3 - 3 runs, 33% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-4.8 - 9 runs, 100% failed, 11% of failures match
release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci - 4 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.4 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.3 - 2 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-launch-gcp - 87 runs, 40% failed, 6% of failures match

Example job [1]:

level=error msg="Error: Request \"Create IAM Members roles/compute.viewer serviceAccount:ci-op-f9p4hy7m-ad214-g4lgc-w.gserviceaccount.com for \\\"project \\\\\\\"openshift-gce-devel-ci\\\\\\\"\\\"\" returned error: Error applying IAM policy for project \"openshift-gce-devel-ci\": Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 400: Service account ci-op-7ijd7tr8-f92fc-j2vtt-m.gserviceaccount.com does not exist., badRequest"
level=error
level=error msg="  on ../tmp/openshift-install-559599373/iam/main.tf line 6, in resource \"google_project_iam_member\" \"worker-compute-viewer\":"
level=error msg="   6: resource \"google_project_iam_member\" \"worker-compute-viewer\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1325830980283404288

Comment 1 Patrick Dillon 2020-11-10 15:23:16 UTC
Looking at the logs here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25672/pull-ci-openshift-origin-master-e2e-agnostic-cmd/1325942865557196800/artifacts/e2e-agnostic-cmd/ipi-install-install/.openshift_install.log

I notice that the API request that is failing has a very long list of service accounts listed as members. If you search for "POST /v1/projects/openshift-gce-devel-ci:setIamPolicy?" you can find the request. To me, many of those service accounts (e.g. deleted ones) do not look relevant. Why are they being included in this API call? It may be worth investigating because (1) we don't want to be including extra service accounts by accident (2) if the list of service accounts included is growing out of control, it could be causing problems with the API.

I haven't had much time to look into it, but I thought there should only be one service account--the one created by the installer--included, but I could be wrong if there's a special setup for CI or I'm misreading the code.

Comment 2 Patrick Dillon 2020-11-10 20:31:28 UTC
We are seeing a related error:

level=error msg=Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-s3jzbvw9-1354f-4w9q4-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-s3jzbvw9-1354f-4w9q4-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: The number of members in the policy (1,501) is larger than the maximum allowed size 1,500., badRequest

CI run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1986/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326217434331353088

The main leak was fixed in September with this PR in master: https://github.com/openshift/installer/pull/4193 and several cherry picks. 

I think a good resolution of this BZ would be:

1. Verify that we're not still leaking
2. Check whether the leaked policies were correctly deleted (why are we hitting this limit?)
3. If possible, narrow down the service accounts that are included in the API requests as discussed here: https://bugzilla.redhat.com/show_bug.cgi?id=1896218#c1

Comment 3 To Hung Sze 2020-11-11 14:50:36 UTC
QE project is also still seeing some IAM left behind marked as deleted:serviceAccount:xxx

It doesn't seem to happen with (successful) manual IPI install / cleanup with latest 4.6 / 4.7.

Will provide more information if I see any pattern after doing some more experiments like UPI and automation through Flexy.

Comment 4 Patrick Dillon 2020-11-11 14:56:35 UTC
Note: the service account that is not found and triggers the error is not the service account created in this installation run. The call to setIAMPolicy includes the entire list of service account members in the request and it looks like this account from another cluster has gone missing (been deleted). So this seems like a race. 

> QE project is also still seeing some IAM left behind marked as deleted:serviceAccount:xxx

That is expected.  It looks like GCP marks them as deleted and then eventually deletes them. The problem there is that it takes up some of our quota.

Comment 5 Matthew Staebler 2020-11-11 15:03:11 UTC
(In reply to Patrick Dillon from comment #2)
> The main leak was fixed in September with this PR in master:
> https://github.com/openshift/installer/pull/4193 and several cherry picks. 

We cherry-picked back to 4.4, but CI still runs tests against 4.2 and 4.3 a couple times a day. Although, that does not seem like quite enough volume to account for the number of leaked service accounts.

Comment 6 Scott Dodson 2020-11-11 15:52:34 UTC
Jeremiah and Abhinav looked into 4.3 and earlier, the problem which introduced the leak there was never present.

https://coreos.slack.com/archives/CBUT43E94/p1600449793021600?thread_ts=1600344979.193800&cid=CBUT43E94

Comment 7 Jeremiah Stuever 2020-11-11 19:18:49 UTC
I believe this is a race condition where an unrelated service account is deleted in between when the policy is fetched and when it is posted for an update (to add or remove a related service account). The action is attempted multiple times (with a backoff); however, it is not clear if it uses the same stale list of members on each retry. This is going to expodentially be a problem in our CI project as we increase the number of clusters we are testing concurrently and thereby increase the churn on these policies. Perhaps it is beneficial to use pass-through credentials for most of our repos to reduce this churn while somewhere still ensuring that the non pass-through credentials work as expected.

Comment 8 Matthew Staebler 2020-11-11 19:32:34 UTC
(In reply to Jeremiah Stuever from comment #7)
> I believe this is a race condition where an unrelated service account is
> deleted in between when the policy is fetched and when it is posted for an
> update (to add or remove a related service account). The action is attempted
> multiple times (with a backoff); however, it is not clear if it uses the
> same stale list of members on each retry. This is going to expodentially be
> a problem in our CI project as we increase the number of clusters we are
> testing concurrently and thereby increase the churn on these policies.
> Perhaps it is beneficial to use pass-through credentials for most of our
> repos to reduce this churn while somewhere still ensuring that the non
> pass-through credentials work as expected.

I don't think that there is even any backoff. Terraform treats the 400 error as non-recoverable. See https://github.com/openshift/installer/blob/8b21f015c42a3604e5737998780d17795bc831ff/vendor/github.com/terraform-providers/terraform-provider-google/google/iam.go#L151

Comment 9 Jeremiah Stuever 2020-11-11 19:41:19 UTC
(In reply to Matthew Staebler from comment #8)
> I don't think that there is even any backoff. Terraform treats the 400 error
> as non-recoverable. See
> https://github.com/openshift/installer/blob/
> 8b21f015c42a3604e5737998780d17795bc831ff/vendor/github.com/terraform-
> providers/terraform-provider-google/google/iam.go#L151

You appear to be correct, I confused this as being a 'conflict' scenario. With only a single pass attempt, the number of clusters we can concurrently have installing in the CI project will be even more directly limited by the chance of this collision.

Comment 10 Brenton Leanhardt 2020-11-30 19:00:49 UTC
We don't think this we be resolved in 4.7 due to the upstream dependencies.

Comment 12 Matthew Staebler 2020-12-14 15:45:46 UTC
*** Bug 1903720 has been marked as a duplicate of this bug. ***

Comment 13 jamo luhrsen 2020-12-17 21:19:21 UTC
this is still happening, fyi. I noticed it seemed to be a problem recently in getting 4.4-ci payload promotions:

https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.4.0-0.ci
https://search.ci.openshift.org/?maxAge=24h&type=all&search=Error%20setting%20IAM%20policy%20for%20project.*serviceaccount.com%20does%20not%20exist

Comment 16 Aravindh Puthiyaparambil 2021-01-13 21:19:25 UTC
More examples of this issue:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.7/1349320537217699840
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/7440/pull-ci-openshift-console-master-e2e-gcp-console/1349435665762226176
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/392/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1349434199265775616

The error is slightly different but according to @wking, the issue is similar:

evel=error msg=Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-brxc1n30-810bb-n49fh-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-brxc1n30-810bb-n49fh-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: Service account ci-op-3tyk6539-f0666-qzmdc-w.gserviceaccount.com does not exist., badRequest

Comment 17 To Hung Sze 2021-01-14 14:28:58 UTC
From a hread in slack, the above (service account does not exist) is due to
"multiple jobs contending for some central GCP resource.  Hopefully as CI load quiets down tonight, the contention will decrease and the backlog of blocked PRs will be drained out by the retest bot"

Comment 18 Patrick Dillon 2021-01-22 15:40:37 UTC
This is a non-deterministic failure. So /retest your CI jobs if you hit this before this bug has been resolved, but the issue is more likely to occur during periods of heavy GCP use.

The issue occurs when a cluster installation tries to create a service account at the same as a different deprovisioning cluster is deleting its service accounts. Creating service accounts through terraform is a two-step process, where first the existing IAM policy is read and then the newly updated policy is written in a second step. If the policy pulled in the first step contains a service account which has been deleted by a different process, the API returns a 400 error, which Staebler pointed out https://bugzilla.redhat.com/show_bug.cgi?id=1896218#c8 is treated as an immediate failure. Here is an example of two conflicting jobs:

level=error
level=error msg=Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-1cth4672-75d12-w5x98-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-1cth4672-75d12-w5x98-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: Service account ci-op-nsfmht7q-77eb3-29f7p-m.gserviceaccount.com does not exist., badRequest

The installer logs (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/7836/pull-ci-openshift-console-master-e2e-gcp-console/1351549945999527936/artifacts/e2e-gcp-console/ipi-install-install/.openshift_install.log) for the installing cluster show that the policy is pulled at approximately 15:41. We can see in the pulled policy that service accounts have already started to be deleted because the worker service account is marked as deleted:serviceAccount:ci-op-nsfmht7q-77eb3-29f7p-w. This install fails on the master service account, which is not marked as deleted in the pulled policy, but must have been deleted by the time the second step to write the policy is posted.

Looking at the logs for the deprovisioning cluster here: https://prow.ci.openshift.org/log?container=test&id=1351538594103169024&job=rehearse-15040-pull-ci-openshift-cluster-kube-apiserver-operator-master-k8s-e2e-gcp shows that the time is consistent. Deprovisioning started at 15:40 and ended at 15:44.


The upstream GCP provider code does intend to handle conflicts, so it seems like it might be possible to add a retry upstream. We should also update CI. Course of action:

1. Create issue/pr upstream
2. Change most CI jobs to use passthrough credentials (this was a good idea from Trevor) it should also increase GCP capacity
3. Investigate installer service accounts w/ pass through, this would likely be a separate bug.

Comment 19 Patrick Dillon 2021-01-22 17:44:00 UTC
Upstream PR: https://github.com/hashicorp/terraform-provider-google/pull/8281

This BZ should not depend on this PR. Reducing the number of service accounts we use in CI will be the mitigating factor.

Comment 20 Patrick Dillon 2021-01-25 21:25:15 UTC
I opened https://github.com/openshift/installer/pull/4589 which would delete IAM role bindings before the service accounts. By disassociating the service account from the policy, there should be no conflicts with other policies when deleting the service account.

I don't know how these conflicts could be reliably reproduced by QE, and we could continue to see this error even in the branches where this patch has merged if other branches continue to delete service accounts without removing them from the policies.

Comment 24 To Hung Sze 2021-02-01 23:43:01 UTC
Closing as the original problem hasn't been observed in last few days.
Please reopen if it happens again.

Comment 25 ravig 2021-02-17 00:05:03 UTC
Hitting this error again in CI:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1361704876559568896

I know that this bug has been targeted to 4.7.

Comment 26 Matthew Staebler 2021-02-17 05:41:41 UTC
I do not see convincing evidence that the leaking issue that this BZ addresses has not been fixed.

In the last 7 days, there have been 25 occurrences of this failure. Of those 25 failures, all but 3 of them occurred within a 1- to 2-hour window around noon EST.

There are currently 29 members in the compute.viewer role. Of those 16 are active, 12 are deleted, and 1 is the CI provisioner. I count at most 21 VPCs that could reasonably be associated with a cluster. So there are no more active members in the compute.viewer role than there are clusters.

The more likely culprit here is that there was a spike in the number of GCP clusters created resulting in temporarily exceeding the maximum size of the policy. Unfortunately, it does not take all that many clusters to exceed the limit.

Comment 27 Patrick Dillon 2021-02-17 20:40:21 UTC
The error message Ravi is seeing is a quota limit error, and is ultimately unrelated to what was determined to be the root cause of this bz. We don't have reason to suspect leaks. I have created https://issues.redhat.com/browse/CORS-1640 which should solve this issue.

I'm moving this back to Verified for that reason.

Regarding the status of this BZ, the PR for this BZ solved the root cause in 4.7 but as long as the PR is not cherry-picked we can still see the error in master. I just created a cherry-pick for 4.6 and there is a conflict in 4.5 so I will hold off until the need arises. CORS-1640 would also solve this problem.

Also the patch merged in the upstream GCP terraform provider, which would add retry logic. We will hopefully bring that in in 4.8.

Comment 30 errata-xmlrpc 2021-02-24 15:32:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.