Hide Forgot
Occasionally in CI for at least the past 13 days: $ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=all&search=Error%20setting%20IAM%20policy%20for%20project.*serviceaccount.com%20does%20not%20exist' | grep 'failures match' | sort periodic-ci-openshift-cnv-cnv-ci-master-e2e-test-cron - 4 runs, 25% failed, 100% of failures match periodic-ci-openshift-kni-cnf-features-deploy-sctpci-release-v4.3-cnf-sctp-ovn-gcp-periodic - 4 runs, 100% failed, 25% of failures match pull-ci-opendatahub-io-odh-manifests-master-odh-manifests-e2e - 5 runs, 20% failed, 100% of failures match ... pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade - 4 runs, 100% failed, 25% of failures match rehearse-13388-pull-ci-kubevirt-ssp-operator-release-4.8-e2e-functests - 3 runs, 100% failed, 33% of failures match release-openshift-ocp-installer-e2e-gcp-4.6 - 2 runs, 50% failed, 100% of failures match release-openshift-ocp-installer-e2e-gcp-serial-4.2 - 2 runs, 100% failed, 50% of failures match release-openshift-origin-installer-e2e-gcp-4.2 - 2 runs, 50% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-4.3 - 3 runs, 33% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-4.8 - 9 runs, 100% failed, 11% of failures match release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci - 4 runs, 50% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-shared-vpc-4.4 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.3 - 2 runs, 50% failed, 100% of failures match release-openshift-origin-installer-launch-gcp - 87 runs, 40% failed, 6% of failures match Example job [1]: level=error msg="Error: Request \"Create IAM Members roles/compute.viewer serviceAccount:ci-op-f9p4hy7m-ad214-g4lgc-w.gserviceaccount.com for \\\"project \\\\\\\"openshift-gce-devel-ci\\\\\\\"\\\"\" returned error: Error applying IAM policy for project \"openshift-gce-devel-ci\": Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 400: Service account ci-op-7ijd7tr8-f92fc-j2vtt-m.gserviceaccount.com does not exist., badRequest" level=error level=error msg=" on ../tmp/openshift-install-559599373/iam/main.tf line 6, in resource \"google_project_iam_member\" \"worker-compute-viewer\":" level=error msg=" 6: resource \"google_project_iam_member\" \"worker-compute-viewer\" {" level=error level=error level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change" [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1325830980283404288
Looking at the logs here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25672/pull-ci-openshift-origin-master-e2e-agnostic-cmd/1325942865557196800/artifacts/e2e-agnostic-cmd/ipi-install-install/.openshift_install.log I notice that the API request that is failing has a very long list of service accounts listed as members. If you search for "POST /v1/projects/openshift-gce-devel-ci:setIamPolicy?" you can find the request. To me, many of those service accounts (e.g. deleted ones) do not look relevant. Why are they being included in this API call? It may be worth investigating because (1) we don't want to be including extra service accounts by accident (2) if the list of service accounts included is growing out of control, it could be causing problems with the API. I haven't had much time to look into it, but I thought there should only be one service account--the one created by the installer--included, but I could be wrong if there's a special setup for CI or I'm misreading the code.
We are seeing a related error: level=error msg=Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-s3jzbvw9-1354f-4w9q4-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-s3jzbvw9-1354f-4w9q4-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: The number of members in the policy (1,501) is larger than the maximum allowed size 1,500., badRequest CI run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1986/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326217434331353088 The main leak was fixed in September with this PR in master: https://github.com/openshift/installer/pull/4193 and several cherry picks. I think a good resolution of this BZ would be: 1. Verify that we're not still leaking 2. Check whether the leaked policies were correctly deleted (why are we hitting this limit?) 3. If possible, narrow down the service accounts that are included in the API requests as discussed here: https://bugzilla.redhat.com/show_bug.cgi?id=1896218#c1
QE project is also still seeing some IAM left behind marked as deleted:serviceAccount:xxx It doesn't seem to happen with (successful) manual IPI install / cleanup with latest 4.6 / 4.7. Will provide more information if I see any pattern after doing some more experiments like UPI and automation through Flexy.
Note: the service account that is not found and triggers the error is not the service account created in this installation run. The call to setIAMPolicy includes the entire list of service account members in the request and it looks like this account from another cluster has gone missing (been deleted). So this seems like a race. > QE project is also still seeing some IAM left behind marked as deleted:serviceAccount:xxx That is expected. It looks like GCP marks them as deleted and then eventually deletes them. The problem there is that it takes up some of our quota.
(In reply to Patrick Dillon from comment #2) > The main leak was fixed in September with this PR in master: > https://github.com/openshift/installer/pull/4193 and several cherry picks. We cherry-picked back to 4.4, but CI still runs tests against 4.2 and 4.3 a couple times a day. Although, that does not seem like quite enough volume to account for the number of leaked service accounts.
Jeremiah and Abhinav looked into 4.3 and earlier, the problem which introduced the leak there was never present. https://coreos.slack.com/archives/CBUT43E94/p1600449793021600?thread_ts=1600344979.193800&cid=CBUT43E94
I believe this is a race condition where an unrelated service account is deleted in between when the policy is fetched and when it is posted for an update (to add or remove a related service account). The action is attempted multiple times (with a backoff); however, it is not clear if it uses the same stale list of members on each retry. This is going to expodentially be a problem in our CI project as we increase the number of clusters we are testing concurrently and thereby increase the churn on these policies. Perhaps it is beneficial to use pass-through credentials for most of our repos to reduce this churn while somewhere still ensuring that the non pass-through credentials work as expected.
(In reply to Jeremiah Stuever from comment #7) > I believe this is a race condition where an unrelated service account is > deleted in between when the policy is fetched and when it is posted for an > update (to add or remove a related service account). The action is attempted > multiple times (with a backoff); however, it is not clear if it uses the > same stale list of members on each retry. This is going to expodentially be > a problem in our CI project as we increase the number of clusters we are > testing concurrently and thereby increase the churn on these policies. > Perhaps it is beneficial to use pass-through credentials for most of our > repos to reduce this churn while somewhere still ensuring that the non > pass-through credentials work as expected. I don't think that there is even any backoff. Terraform treats the 400 error as non-recoverable. See https://github.com/openshift/installer/blob/8b21f015c42a3604e5737998780d17795bc831ff/vendor/github.com/terraform-providers/terraform-provider-google/google/iam.go#L151
(In reply to Matthew Staebler from comment #8) > I don't think that there is even any backoff. Terraform treats the 400 error > as non-recoverable. See > https://github.com/openshift/installer/blob/ > 8b21f015c42a3604e5737998780d17795bc831ff/vendor/github.com/terraform- > providers/terraform-provider-google/google/iam.go#L151 You appear to be correct, I confused this as being a 'conflict' scenario. With only a single pass attempt, the number of clusters we can concurrently have installing in the CI project will be even more directly limited by the chance of this collision.
We don't think this we be resolved in 4.7 due to the upstream dependencies.
*** Bug 1903720 has been marked as a duplicate of this bug. ***
this is still happening, fyi. I noticed it seemed to be a problem recently in getting 4.4-ci payload promotions: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.4.0-0.ci https://search.ci.openshift.org/?maxAge=24h&type=all&search=Error%20setting%20IAM%20policy%20for%20project.*serviceaccount.com%20does%20not%20exist
some recent examples of this in 4.7 jobs: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.7/1346408197933502464 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.7/1346119130780012544 not sure if it's related or not, but some GCP quota exceeded issues in an earlier run of the same job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.7/1346109074986504192 same search from comment #13 shows the frequency we are hitting this
We're hitting the same for machine-api CI runs: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/101/pull-ci-openshift-cluster-machine-approver-master-e2e-gcp-operator/1347197931492478976
More examples of this issue: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.7/1349320537217699840 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/7440/pull-ci-openshift-console-master-e2e-gcp-console/1349435665762226176 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/392/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1349434199265775616 The error is slightly different but according to @wking, the issue is similar: evel=error msg=Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-brxc1n30-810bb-n49fh-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-brxc1n30-810bb-n49fh-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: Service account ci-op-3tyk6539-f0666-qzmdc-w.gserviceaccount.com does not exist., badRequest
From a hread in slack, the above (service account does not exist) is due to "multiple jobs contending for some central GCP resource. Hopefully as CI load quiets down tonight, the contention will decrease and the backlog of blocked PRs will be drained out by the retest bot"
This is a non-deterministic failure. So /retest your CI jobs if you hit this before this bug has been resolved, but the issue is more likely to occur during periods of heavy GCP use. The issue occurs when a cluster installation tries to create a service account at the same as a different deprovisioning cluster is deleting its service accounts. Creating service accounts through terraform is a two-step process, where first the existing IAM policy is read and then the newly updated policy is written in a second step. If the policy pulled in the first step contains a service account which has been deleted by a different process, the API returns a 400 error, which Staebler pointed out https://bugzilla.redhat.com/show_bug.cgi?id=1896218#c8 is treated as an immediate failure. Here is an example of two conflicting jobs: level=error level=error msg=Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-1cth4672-75d12-w5x98-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-1cth4672-75d12-w5x98-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: Service account ci-op-nsfmht7q-77eb3-29f7p-m.gserviceaccount.com does not exist., badRequest The installer logs (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/7836/pull-ci-openshift-console-master-e2e-gcp-console/1351549945999527936/artifacts/e2e-gcp-console/ipi-install-install/.openshift_install.log) for the installing cluster show that the policy is pulled at approximately 15:41. We can see in the pulled policy that service accounts have already started to be deleted because the worker service account is marked as deleted:serviceAccount:ci-op-nsfmht7q-77eb3-29f7p-w. This install fails on the master service account, which is not marked as deleted in the pulled policy, but must have been deleted by the time the second step to write the policy is posted. Looking at the logs for the deprovisioning cluster here: https://prow.ci.openshift.org/log?container=test&id=1351538594103169024&job=rehearse-15040-pull-ci-openshift-cluster-kube-apiserver-operator-master-k8s-e2e-gcp shows that the time is consistent. Deprovisioning started at 15:40 and ended at 15:44. The upstream GCP provider code does intend to handle conflicts, so it seems like it might be possible to add a retry upstream. We should also update CI. Course of action: 1. Create issue/pr upstream 2. Change most CI jobs to use passthrough credentials (this was a good idea from Trevor) it should also increase GCP capacity 3. Investigate installer service accounts w/ pass through, this would likely be a separate bug.
Upstream PR: https://github.com/hashicorp/terraform-provider-google/pull/8281 This BZ should not depend on this PR. Reducing the number of service accounts we use in CI will be the mitigating factor.
I opened https://github.com/openshift/installer/pull/4589 which would delete IAM role bindings before the service accounts. By disassociating the service account from the policy, there should be no conflicts with other policies when deleting the service account. I don't know how these conflicts could be reliably reproduced by QE, and we could continue to see this error even in the branches where this patch has merged if other branches continue to delete service accounts without removing them from the policies.
Closing as the original problem hasn't been observed in last few days. Please reopen if it happens again.
Hitting this error again in CI: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1361704876559568896 I know that this bug has been targeted to 4.7.
I do not see convincing evidence that the leaking issue that this BZ addresses has not been fixed. In the last 7 days, there have been 25 occurrences of this failure. Of those 25 failures, all but 3 of them occurred within a 1- to 2-hour window around noon EST. There are currently 29 members in the compute.viewer role. Of those 16 are active, 12 are deleted, and 1 is the CI provisioner. I count at most 21 VPCs that could reasonably be associated with a cluster. So there are no more active members in the compute.viewer role than there are clusters. The more likely culprit here is that there was a spike in the number of GCP clusters created resulting in temporarily exceeding the maximum size of the policy. Unfortunately, it does not take all that many clusters to exceed the limit.
The error message Ravi is seeing is a quota limit error, and is ultimately unrelated to what was determined to be the root cause of this bz. We don't have reason to suspect leaks. I have created https://issues.redhat.com/browse/CORS-1640 which should solve this issue. I'm moving this back to Verified for that reason. Regarding the status of this BZ, the PR for this BZ solved the root cause in 4.7 but as long as the PR is not cherry-picked we can still see the error in master. I just created a cherry-pick for 4.6 and there is a conflict in 4.5 so I will hold off until the need arises. CORS-1640 would also solve this problem. Also the patch merged in the upstream GCP terraform provider, which would add retry logic. We will hopefully bring that in in 4.8.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633