Bug 1903720

Summary:	Occasional GCP install failures: Error setting IAM policy for project ...: googleapi: Error 400: Service account ... does not exist., badRequest
Product:	OpenShift Container Platform	Reporter:	Sohan Kunkerkar <skunkerk>
Component:	Installer	Assignee:	aos-install
Installer sub component:	openshift-installer	QA Contact:	Gaoyun Pei <gpei>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-install, bleanhar, mstaeble, padillon, tsze, wking, yanyang
Version:	4.7
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1896218	Environment:
Last Closed:	2020-12-14 15:45:48 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1896218
Bug Blocks:

Description Sohan Kunkerkar 2020-12-02 17:12:29 UTC

+++ This bug was initially created as a clone of Bug #1896218 +++

Occasionally in CI for at least the past 13 days:

$ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=all&search=Error%20setting%20IAM%20policy%20for%20project.*serviceaccount.com%20does%20not%20exist' | grep 'failures match' | sort
periodic-ci-openshift-cnv-cnv-ci-master-e2e-test-cron - 4 runs, 25% failed, 100% of failures match
periodic-ci-openshift-kni-cnf-features-deploy-sctpci-release-v4.3-cnf-sctp-ovn-gcp-periodic - 4 runs, 100% failed, 25% of failures match
pull-ci-opendatahub-io-odh-manifests-master-odh-manifests-e2e - 5 runs, 20% failed, 100% of failures match
...
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade - 4 runs, 100% failed, 25% of failures match
rehearse-13388-pull-ci-kubevirt-ssp-operator-release-4.8-e2e-functests - 3 runs, 100% failed, 33% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 2 runs, 50% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-serial-4.2 - 2 runs, 100% failed, 50% of failures match
release-openshift-origin-installer-e2e-gcp-4.2 - 2 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-4.3 - 3 runs, 33% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-4.8 - 9 runs, 100% failed, 11% of failures match
release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci - 4 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.4 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.3 - 2 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-launch-gcp - 87 runs, 40% failed, 6% of failures match

Example job [1]:

level=error msg="Error: Request \"Create IAM Members roles/compute.viewer serviceAccount:ci-op-f9p4hy7m-ad214-g4lgc-w.gserviceaccount.com for \\\"project \\\\\\\"openshift-gce-devel-ci\\\\\\\"\\\"\" returned error: Error applying IAM policy for project \"openshift-gce-devel-ci\": Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 400: Service account ci-op-7ijd7tr8-f92fc-j2vtt-m.gserviceaccount.com does not exist., badRequest"
level=error
level=error msg="  on ../tmp/openshift-install-559599373/iam/main.tf line 6, in resource \"google_project_iam_member\" \"worker-compute-viewer\":"
level=error msg="   6: resource \"google_project_iam_member\" \"worker-compute-viewer\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1325830980283404288

--- Additional comment from Patrick Dillon on 2020-11-10 15:23:16 UTC ---

Looking at the logs here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25672/pull-ci-openshift-origin-master-e2e-agnostic-cmd/1325942865557196800/artifacts/e2e-agnostic-cmd/ipi-install-install/.openshift_install.log

I notice that the API request that is failing has a very long list of service accounts listed as members. If you search for "POST /v1/projects/openshift-gce-devel-ci:setIamPolicy?" you can find the request. To me, many of those service accounts (e.g. deleted ones) do not look relevant. Why are they being included in this API call? It may be worth investigating because (1) we don't want to be including extra service accounts by accident (2) if the list of service accounts included is growing out of control, it could be causing problems with the API.

I haven't had much time to look into it, but I thought there should only be one service account--the one created by the installer--included, but I could be wrong if there's a special setup for CI or I'm misreading the code.

--- Additional comment from Patrick Dillon on 2020-11-10 20:31:28 UTC ---

We are seeing a related error:

level=error msg=Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-s3jzbvw9-1354f-4w9q4-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-s3jzbvw9-1354f-4w9q4-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: The number of members in the policy (1,501) is larger than the maximum allowed size 1,500., badRequest

CI run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1986/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326217434331353088

The main leak was fixed in September with this PR in master: https://github.com/openshift/installer/pull/4193 and several cherry picks. 

I think a good resolution of this BZ would be:

1. Verify that we're not still leaking
2. Check whether the leaked policies were correctly deleted (why are we hitting this limit?)
3. If possible, narrow down the service accounts that are included in the API requests as discussed here: https://bugzilla.redhat.com/show_bug.cgi?id=1896218#c1

--- Additional comment from To Hung Sze on 2020-11-11 14:50:36 UTC ---

QE project is also still seeing some IAM left behind marked as deleted:serviceAccount:xxx

It doesn't seem to happen with (successful) manual IPI install / cleanup with latest 4.6 / 4.7.

Will provide more information if I see any pattern after doing some more experiments like UPI and automation through Flexy.

--- Additional comment from Patrick Dillon on 2020-11-11 14:56:35 UTC ---

Note: the service account that is not found and triggers the error is not the service account created in this installation run. The call to setIAMPolicy includes the entire list of service account members in the request and it looks like this account from another cluster has gone missing (been deleted). So this seems like a race. 

> QE project is also still seeing some IAM left behind marked as deleted:serviceAccount:xxx

That is expected.  It looks like GCP marks them as deleted and then eventually deletes them. The problem there is that it takes up some of our quota.

--- Additional comment from Matthew Staebler on 2020-11-11 15:03:11 UTC ---

(In reply to Patrick Dillon from comment #2)
> The main leak was fixed in September with this PR in master:
> https://github.com/openshift/installer/pull/4193 and several cherry picks. 

We cherry-picked back to 4.4, but CI still runs tests against 4.2 and 4.3 a couple times a day. Although, that does not seem like quite enough volume to account for the number of leaked service accounts.

--- Additional comment from Scott Dodson on 2020-11-11 15:52:34 UTC ---

Jeremiah and Abhinav looked into 4.3 and earlier, the problem which introduced the leak there was never present.

https://coreos.slack.com/archives/CBUT43E94/p1600449793021600?thread_ts=1600344979.193800&cid=CBUT43E94

--- Additional comment from Jeremiah Stuever on 2020-11-11 19:18:49 UTC ---

I believe this is a race condition where an unrelated service account is deleted in between when the policy is fetched and when it is posted for an update (to add or remove a related service account). The action is attempted multiple times (with a backoff); however, it is not clear if it uses the same stale list of members on each retry. This is going to expodentially be a problem in our CI project as we increase the number of clusters we are testing concurrently and thereby increase the churn on these policies. Perhaps it is beneficial to use pass-through credentials for most of our repos to reduce this churn while somewhere still ensuring that the non pass-through credentials work as expected.

--- Additional comment from Matthew Staebler on 2020-11-11 19:32:34 UTC ---

(In reply to Jeremiah Stuever from comment #7)
> I believe this is a race condition where an unrelated service account is
> deleted in between when the policy is fetched and when it is posted for an
> update (to add or remove a related service account). The action is attempted
> multiple times (with a backoff); however, it is not clear if it uses the
> same stale list of members on each retry. This is going to expodentially be
> a problem in our CI project as we increase the number of clusters we are
> testing concurrently and thereby increase the churn on these policies.
> Perhaps it is beneficial to use pass-through credentials for most of our
> repos to reduce this churn while somewhere still ensuring that the non
> pass-through credentials work as expected.

I don't think that there is even any backoff. Terraform treats the 400 error as non-recoverable. See https://github.com/openshift/installer/blob/8b21f015c42a3604e5737998780d17795bc831ff/vendor/github.com/terraform-providers/terraform-provider-google/google/iam.go#L151

--- Additional comment from Jeremiah Stuever on 2020-11-11 19:41:19 UTC ---

(In reply to Matthew Staebler from comment #8)
> I don't think that there is even any backoff. Terraform treats the 400 error
> as non-recoverable. See
> https://github.com/openshift/installer/blob/
> 8b21f015c42a3604e5737998780d17795bc831ff/vendor/github.com/terraform-
> providers/terraform-provider-google/google/iam.go#L151

You appear to be correct, I confused this as being a 'conflict' scenario. With only a single pass attempt, the number of clusters we can concurrently have installing in the CI project will be even more directly limited by the chance of this collision.

--- Additional comment from Brenton Leanhardt on 2020-11-30 19:00:49 UTC ---

We don't think this we be resolved in 4.7 due to the upstream dependencies.

Comment 1 Matthew Staebler 2020-12-14 15:45:48 UTC


*** This bug has been marked as a duplicate of bug 1896218 ***