1807037 – Cluster does not work any longer after a failed cluster with the same name is destroyed

Bug 1807037 - Cluster does not work any longer after a failed cluster with the same name is destroyed

Summary: Cluster does not work any longer after a failed cluster with the same name is...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Abhinav Dahiya
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:	1807036
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-25 13:58 UTC by Scott Dodson
Modified:	2020-03-24 14:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1807036
Environment:
Last Closed:	2020-03-24 14:33:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3185	0	None	closed	[release-4.3] Bug 1807037: data/data/gcp,azure: block private dns zone on public record	2020-05-06 02:10:00 UTC
Red Hat Product Errata	RHBA-2020:0858	0	None	None	None	2020-03-24 14:34:14 UTC

Description Scott Dodson 2020-02-25 13:58:02 UTC

+++ This bug was initially created as a clone of Bug #1807036 +++

+++ This bug was initially created as a clone of Bug #1775873 +++

Description of problem:

Cluster does not work after a failed cluster with the same name is destroyed

Version-Release number of the following components:
/openshift-install version
./openshift-install v4.3.0
built from commit a702fd4beb593932067fe1b31f2d911feaa6d93e
release image registry.svc.ci.openshift.org/ocp/release@sha256:15132234d0b753aea0af00f5cfff429cf5eca57513e7cce530207b05167a999f


How reproducible:
100%

Steps to Reproduce:
1. Create cluster #1 
# openshift-install create cluster --dir cluster1
Platform : gcp
Project ID: openshift-qe
Base Domain: qe.gcp.devcluster.openshift.com
Cluster Name: yangyang

2. Create cluster #2
# openshift-install create cluster --dir cluster2
Make sure cluster #2 has the name cluster name and base domain with cluster #1

Platform : gcp
Project ID: openshift-qe
Base Domain: qe.gcp.devcluster.openshift.com
Cluster Name: yangyang

3. Make sure cluster #1 works well after installation completes

4. Destroy failed cluster #2

Actual results:
Cluster #1 does not work any longer after cluster #2 is destroyed
# oc get  co
Unable to connect to the server: dial tcp: lookup api.yangyang.qe.gcp.devcluster.openshift.com on 10.11.5.19:53: no such host

Expected results:
Cluster #1 still works after cluster #2 is destroyed

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

--- Additional comment from Scott Dodson on 2019-12-02 10:14:33 EST ---

This does not seem like a workflow that a customer is likely to encounter.

--- Additional comment from yangyang on 2020-02-19 02:28:25 EST ---

Scott Dodson, as it was deferred to 4.4 but with closed state, is it going to be fixed in 4.4?

--- Additional comment from Scott Dodson on 2020-02-19 09:14:22 EST ---

(In reply to yangyang from comment #2)
> Scott Dodson, as it was deferred to 4.4 but with closed state, is it going
> to be fixed in 4.4?

No, we do not feel that this is working as expected and is not something a customer is likely to do.

--- Additional comment from yangyang on 2020-02-19 22:39:35 EST ---

Scott Dodson, although it's an edge scenario, it can happen by chance. If a customer create a cluster with a name which is already used by an up and running cluster, the running cluster does not work any longer once customer destroys the failed cluster. From UX perspective, it deserves a fix.

I think we can look up the dns and validate the cluster name before creating resources. Thanks

--- Additional comment from Scott Dodson on 2020-02-20 11:55:50 EST ---

This is not a technically straight forward problem to solve.

In AWS we wait to remove resources that cannot be attributed to a unique cluster on removal of those resources which can be attributed directly to a unique cluster which lessens the likelihood of running into this problem. We can make sure that happens on GCP as well but it's not a complete solution and given this is a problem we feel is incredibly unlikely to happen we'll defer this to 4.5. If we enter 4.5 bug burn down without any indication of this happening in the field then we'll close this again at that time and expect that it will remain closed until there's indication that this is actually a problem we see in the field.

--- Additional comment from Abhinav Dahiya on 2020-02-20 15:52:32 EST ---

Can you include the .openshift_install.log from both the runs ?

and include `oc -v=6` after each run from original report.

--- Additional comment from yangyang on 2020-02-21 03:41:05 EST ---



--- Additional comment from yangyang on 2020-02-21 03:42:07 EST ---



--- Additional comment from yangyang on 2020-02-21 03:44:18 EST ---

> and include `oc -v=6` after each run from original report.

It's not much clear to me. I do not find -v option for oc.

Comment 3 Yang Yang 2020-03-16 07:21:39 UTC

Verified with 4.3.0-0.nightly-2020-03-15-221412

GCP destroy only purges what created by it.

level=debug msg="Images: 1 items pending"
level=debug msg="Listing DNS Zones"
level=debug msg="Private DNS zone not found"
level=debug msg="Listing storage buckets"

Previous healthy cluster still works after the failed cluster is destroyed. So move it to verified state.

Comment 5 errata-xmlrpc 2020-03-24 14:33:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0858

Note You need to log in before you can comment on or make changes to this bug.