Bug 1875511

Summary: openshift-install destroy cluster fails to delete a network in GCP
Product: OpenShift Container Platform Reporter: Petr Muller <pmuller>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: To Hung Sze <tsze>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: low CC: aaleman, adahiya, bleanhar, tsze, wking, yanyang
Version: 4.5Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-02 19:05:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Muller 2020-09-03 16:13:46 UTC
Description of problem:

The DPTP ipi-deprovisioner tool that runs openshift-install destroy cluster [1] gets stuck on deleting a network, accompanied by the following messages:

level=debug msg="Networks: failed to delete network ci-op-sq9x1it6-0df6f-kdt74-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-gce-devel-ci/global/networks/ci-op-sq9x1it6-0df6f-kdt74-network' is already being used by 'projects/openshift-gce-devel-ci/global/firewalls/k8s-a091b5cea9ce44d1589ce122fe0b62bb-http-hc'"

[1] https://github.com/openshift/ci-tools/blob/f977bb476cfacf74b8ecea1df1178a13cfa7a3e3/cmd/ipi-deprovision/ipi-deprovision.sh#L29-L60

Example occurrence: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ipi-deprovision/1301547877104881664#1:build-log.txt%3A444


How reproducible:
~cca 1-2x per week our CI produces something like this and it needs manual intervention

Comment 1 Abhinav Dahiya 2020-09-08 16:53:23 UTC
The health checks are created with random names, and the only way installer can associate them is to lookup which LB -> which machines -> which cluster. So if the machines are gone there is not way for us to re-associate.
Secondly the de-provision script is running on the same cluster multiple times with previously _deleted / left around_ clusters which makes this problem more apparent. There is not good way to circumvent this unless we involve upstream to tag them appropriately.

Will need a lot more work and planning, moving to 4.7

Comment 4 Abhinav Dahiya 2020-10-12 17:47:28 UTC
*** Bug 1801968 has been marked as a duplicate of this bug. ***

Comment 5 To Hung Sze 2020-10-14 14:36:15 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1801968 was closed as duplicate of this.

Comment 7 Abhinav Dahiya 2020-11-02 18:27:13 UTC
https://issues.redhat.com/browse/CORS-1573 should be good enough to also include this fix.

Comment 8 Brenton Leanhardt 2020-11-02 19:05:53 UTC
Thanks.  We'll track the work for this in Jira.

Comment 9 Matthew Staebler 2020-12-11 17:46:07 UTC
*** Bug 1906172 has been marked as a duplicate of this bug. ***