Description of problem: Cluster does not work after a failed cluster with the same name is destroyed Version-Release number of the following components: /openshift-install version ./openshift-install v4.3.0 built from commit a702fd4beb593932067fe1b31f2d911feaa6d93e release image registry.svc.ci.openshift.org/ocp/release@sha256:15132234d0b753aea0af00f5cfff429cf5eca57513e7cce530207b05167a999f How reproducible: 100% Steps to Reproduce: 1. Create cluster #1 # openshift-install create cluster --dir cluster1 Platform : gcp Project ID: openshift-qe Base Domain: qe.gcp.devcluster.openshift.com Cluster Name: yangyang 2. Create cluster #2 # openshift-install create cluster --dir cluster2 Make sure cluster #2 has the name cluster name and base domain with cluster #1 Platform : gcp Project ID: openshift-qe Base Domain: qe.gcp.devcluster.openshift.com Cluster Name: yangyang 3. Make sure cluster #1 works well after installation completes 4. Destroy failed cluster #2 Actual results: Cluster #1 does not work any longer after cluster #2 is destroyed # oc get co Unable to connect to the server: dial tcp: lookup api.yangyang.qe.gcp.devcluster.openshift.com on 10.11.5.19:53: no such host Expected results: Cluster #1 still works after cluster #2 is destroyed Additional info: Please attach logs from ansible-playbook with the -vvv flag
This does not seem like a workflow that a customer is likely to encounter.
Scott Dodson, as it was deferred to 4.4 but with closed state, is it going to be fixed in 4.4?
(In reply to yangyang from comment #2) > Scott Dodson, as it was deferred to 4.4 but with closed state, is it going > to be fixed in 4.4? No, we do not feel that this is working as expected and is not something a customer is likely to do.
Scott Dodson, although it's an edge scenario, it can happen by chance. If a customer create a cluster with a name which is already used by an up and running cluster, the running cluster does not work any longer once customer destroys the failed cluster. From UX perspective, it deserves a fix. I think we can look up the dns and validate the cluster name before creating resources. Thanks
This is not a technically straight forward problem to solve. In AWS we wait to remove resources that cannot be attributed to a unique cluster on removal of those resources which can be attributed directly to a unique cluster which lessens the likelihood of running into this problem. We can make sure that happens on GCP as well but it's not a complete solution and given this is a problem we feel is incredibly unlikely to happen we'll defer this to 4.5. If we enter 4.5 bug burn down without any indication of this happening in the field then we'll close this again at that time and expect that it will remain closed until there's indication that this is actually a problem we see in the field.
Can you include the .openshift_install.log from both the runs ? and include `oc -v=6` after each run from original report.
Created attachment 1664655 [details] installation log for passed cluster
Created attachment 1664656 [details] installation log for failed cluster
> and include `oc -v=6` after each run from original report. It's not much clear to me. I do not find -v option for oc.
Found an issue when verifying it. Assuming there exists an up and running GCP cluster#1, then create GCP cluster#2 with the same install config as cluster#1. The cluster#2 installs successfully. Yet the 1st cluster does not work any more. An installation should not have side effect on an running cluster. Prior to 4.5, cluster#2 installation would fail since dns record exists. It's probably an issue introduced in 4.5.
> Found an issue when verifying it. Logged a bz for this issue: Bug 1815071 It's difficult to verify it without the fix of Bug 1815071. Moving to modified state, I will move it back to ON-QA state once a nightly build includes the fix of Bug 1815071. Thanks.
Verify it with 4.5.0-0.nightly-2020-05-15-011814 Steps to verify: 1. Install a GCP cluster#1 with install-config.yaml 2. Check cluster#1 works well 3. Install a GCP cluster#2 with install-config.yaml The installation of cluster#2 fails with below error and cluster#1 still works. level=fatal msg="failed to fetch Cluster: failed to fetch dependency of \"Cluster\": failed to generate asset \"Platform Provisioning Check\": metadata.name: Invalid value: \"yanyang\": record api.yanyang.qe.gcp.devcluster.openshift.com. already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue" 4. Destroy cluster#2 5. Check cluster#1 still works well The test results are as expected, hence move it to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409