Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1775873

Summary:

Cluster does not work any longer after a failed cluster with the same name is destroyed

Product:

OpenShift Container Platform

Reporter:

Yang Yang <yanyang>

Component:

Installer

Assignee:

Abhinav Dahiya <adahiya>

Installer sub component:

openshift-installer

QA Contact:

Yang Yang <yanyang>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

jiajliu, sdodson

Version:

4.3.0

Keywords:

Reopened

Target Milestone:

---

Target Release:

4.5.0

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1807036 (view as bug list)

Environment:

Last Closed:

2020-07-13 17:12:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1807036

Attachments:

Description	Flags
installation log for passed cluster	none
installation log for failed cluster	none

Description Yang Yang 2019-11-23 04:38:32 UTC

Description of problem:

Cluster does not work after a failed cluster with the same name is destroyed

Version-Release number of the following components:
/openshift-install version
./openshift-install v4.3.0
built from commit a702fd4beb593932067fe1b31f2d911feaa6d93e
release image registry.svc.ci.openshift.org/ocp/release@sha256:15132234d0b753aea0af00f5cfff429cf5eca57513e7cce530207b05167a999f


How reproducible:
100%

Steps to Reproduce:
1. Create cluster #1 
# openshift-install create cluster --dir cluster1
Platform : gcp
Project ID: openshift-qe
Base Domain: qe.gcp.devcluster.openshift.com
Cluster Name: yangyang

2. Create cluster #2
# openshift-install create cluster --dir cluster2
Make sure cluster #2 has the name cluster name and base domain with cluster #1

Platform : gcp
Project ID: openshift-qe
Base Domain: qe.gcp.devcluster.openshift.com
Cluster Name: yangyang

3. Make sure cluster #1 works well after installation completes

4. Destroy failed cluster #2

Actual results:
Cluster #1 does not work any longer after cluster #2 is destroyed
# oc get  co
Unable to connect to the server: dial tcp: lookup api.yangyang.qe.gcp.devcluster.openshift.com on 10.11.5.19:53: no such host

Expected results:
Cluster #1 still works after cluster #2 is destroyed

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2019-12-02 15:14:33 UTC

This does not seem like a workflow that a customer is likely to encounter.

Comment 2 Yang Yang 2020-02-19 07:28:25 UTC

Scott Dodson, as it was deferred to 4.4 but with closed state, is it going to be fixed in 4.4?

Comment 3 Scott Dodson 2020-02-19 14:14:22 UTC

(In reply to yangyang from comment #2)
> Scott Dodson, as it was deferred to 4.4 but with closed state, is it going
> to be fixed in 4.4?

No, we do not feel that this is working as expected and is not something a customer is likely to do.

Comment 4 Yang Yang 2020-02-20 03:39:35 UTC

Scott Dodson, although it's an edge scenario, it can happen by chance. If a customer create a cluster with a name which is already used by an up and running cluster, the running cluster does not work any longer once customer destroys the failed cluster. From UX perspective, it deserves a fix.

I think we can look up the dns and validate the cluster name before creating resources. Thanks

Comment 5 Scott Dodson 2020-02-20 16:55:50 UTC

This is not a technically straight forward problem to solve.

In AWS we wait to remove resources that cannot be attributed to a unique cluster on removal of those resources which can be attributed directly to a unique cluster which lessens the likelihood of running into this problem. We can make sure that happens on GCP as well but it's not a complete solution and given this is a problem we feel is incredibly unlikely to happen we'll defer this to 4.5. If we enter 4.5 bug burn down without any indication of this happening in the field then we'll close this again at that time and expect that it will remain closed until there's indication that this is actually a problem we see in the field.

Comment 6 Abhinav Dahiya 2020-02-20 20:52:32 UTC

Can you include the .openshift_install.log from both the runs ?

and include `oc -v=6` after each run from original report.

Comment 7 Yang Yang 2020-02-21 08:41:05 UTC

Created attachment 1664655 [details]
installation log for passed cluster

Comment 8 Yang Yang 2020-02-21 08:42:07 UTC

Created attachment 1664656 [details]
installation log for failed cluster

Comment 9 Yang Yang 2020-02-21 08:44:18 UTC

> and include `oc -v=6` after each run from original report.

It's not much clear to me. I do not find -v option for oc.

Comment 12 Yang Yang 2020-03-10 01:57:55 UTC

Found an issue when verifying it. Assuming there exists an up and running GCP cluster#1, then create GCP cluster#2 with the same install config as cluster#1. The cluster#2 installs successfully. Yet the 1st cluster does not work any more. An installation should not have side effect on an running cluster. Prior to 4.5, cluster#2 installation would fail since dns record exists. It's probably an issue introduced in 4.5.

Comment 13 Yang Yang 2020-05-13 03:37:49 UTC

> Found an issue when verifying it.
Logged a bz for this issue: Bug 1815071

It's difficult to verify it without the fix of Bug 1815071. Moving to modified state, I will move it back to ON-QA state once a nightly build includes the fix of Bug 1815071. Thanks.

Comment 15 Yang Yang 2020-05-15 03:31:18 UTC

Verify it with 4.5.0-0.nightly-2020-05-15-011814

Steps to verify:
1. Install a GCP cluster#1 with install-config.yaml
2. Check cluster#1 works well
3. Install a GCP cluster#2 with install-config.yaml

The installation of cluster#2 fails with below error and cluster#1 still works.

level=fatal msg="failed to fetch Cluster: failed to fetch dependency of \"Cluster\": failed to generate asset \"Platform Provisioning Check\": metadata.name: Invalid value: \"yanyang\": record api.yanyang.qe.gcp.devcluster.openshift.com. already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue"

4. Destroy cluster#2
5. Check cluster#1 still works well

The test results are as expected, hence move it to verified state

Comment 17 errata-xmlrpc 2020-07-13 17:12:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409