1906172 – CI cleaner for GCP regularly fails on GCP networks

Bug 1906172 - CI cleaner for GCP regularly fails on GCP networks

Summary: CI cleaner for GCP regularly fails on GCP networks

Keywords:
Status:	CLOSED DUPLICATE of bug 1875511
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	aos-install
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-09 20:01 UTC by aaleman
Modified:	2020-12-11 17:46 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-11 17:46:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description aaleman 2020-12-09 20:01:04 UTC

The job used to clean up clusters very regularly fails to cleanup GCP networks because they have dependent resources that did not get cleaned up: https://prow.ci.openshift.org/?job=periodic-ipi-deprovision

Sample:
Job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ipi-deprovision/1336743701849837568
Output for that cluster: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ipi-deprovision/1336743701849837568/artifacts/deprovision/ci-op-4f9hpyqt-5ceb0-cw2hn/.openshift_install.log

The script that runs in this job is here: https://github.com/openshift/ci-tools/blob/45a30e1bea25badceb476602c2ab009f920560ba/cmd/ipi-deprovision/ipi-deprovision.sh#L1

Expectation: The job successfully cleans up clusters on GCP

Comment 1 Matthew Staebler 2020-12-09 22:14:10 UTC

@aaleman Can you estimate how often "very regularly" is? I am struggling to find other examples of this beyond the sample linked.

For the sample, this was from a cluster installation that was aborted. This could be what is causing the difficulty in the clean up. Unfortunately, with an aborted install, the install logs are not captured, so it is hard to confirm where the install was at the time when the install was aborted.

Comment 2 aaleman 2020-12-10 15:13:53 UTC

About daily I guess? You can go through the job history to find out: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ipi-deprovision
It is annoying to find out because the job runs very frequently and one failure might make many jobs fail.

Comment 3 Matthew Staebler 2020-12-10 16:50:33 UTC

I found a few occurrences of this from Dec 4 [1]. One of those cluster was created recently enough were I could find the logs [2]. The install failed because it could not create one of the IAM members. We may have a situation where a failed installation is leaving some resources in a state where the destroyer cannot find them, which in this case is blocking the delete of another resource.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ipi-deprovision/1334754122909356032
[2] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/13961/rehearse-13961-pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade/1334514961300328448/build-log.txt

Comment 4 Matthew Staebler 2020-12-10 16:52:17 UTC

The install error from the previous comment is the following.

level=error msg=Error: Request "Create IAM Members roles/storage.admin serviceAccount:ci-op-8j7scnqg-28b57-ztn4z-w.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: Service account ci-op-st8y14-openshift-g-pcl7k.gserviceaccount.com does not exist., badRequest
level=error
level=error msg=  on ../tmp/openshift-install-220613468/iam/main.tf line 11, in resource "google_project_iam_member" "worker-storage-admin":
level=error msg=  11: resource "google_project_iam_member" "worker-storage-admin" {
level=error
level=error
level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change


The destroy error is the following.

time="2020-12-04T07:05:30Z" level=debug msg="failed to delete network ci-op-6t11kq1y-5f4c2-hjrkp-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-gce-devel-ci/global/networks/ci-op-6t11kq1y-5f4c2-hjrkp-network' is already being used by 'projects/openshift-gce-devel-ci/global/firewalls/k8s-fw-a976e64f33c9b4a1299d1e565af87c0f'"

Comment 5 Jeremiah Stuever 2020-12-11 17:27:54 UTC

For context: this has been discussed in prior bugs.... and we have a JIRA card as well.

https://bugzilla.redhat.com/show_bug.cgi?id=1801968
https://bugzilla.redhat.com/show_bug.cgi?id=1875511
https://bugzilla.redhat.com/show_bug.cgi?id=1788708

https://issues.redhat.com/browse/CORS-1573

Comment 6 Matthew Staebler 2020-12-11 17:46:07 UTC


*** This bug has been marked as a duplicate of bug 1875511 ***

Note You need to log in before you can comment on or make changes to this bug.