Hide Forgot
Description of problem: Version-Release number of the following components: # ./openshift-install version ./openshift-install v0.7.0-master-6-g8f02020b59147c933a08c5e248a8e2c69dad24ae # oc version oc v4.0.0-0.82.0 kubernetes v1.11.0+6855010f70 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-jialiu-api.devcluster.openshift.com:6443 kubernetes v1.11.0+3d38233 How reproducible: Always Steps to Reproduce: 1. Create cluster 1 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test1' option. 2. Cluster 1 is installed successfully. 3. Create cluster 2 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test2' option, but using different region. 4. Cluster 2 is installed successfully. Actual results: Cluster 1 can not be connected. # oc get node No resources found. Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "root-ca") That is because cluster 1's master api record in route53 is overwritten to cluster 2's lb address. Expected results: Installer should detect that 'qe-jialiu' api record in route53 already exists in advance when installing cluster 2, and exits the install. Additional info: Please attach logs from ansible-playbook with the -vvv flag
> Expected results: > Installer should detect that 'qe-jialiu' api record in route53 already exists in advance when installing cluster 2, and exits the install. This would be nice, but pre-checks like this are going to be racy (pre-check finds no colliding resource, separate process creates the resource, real creation collides and fails). And would you pre-check all of the resources the cluster would create (matching VPC? Matching instance profiles?)? I think this has no generic solution, but we can obviously code in specific checks per-resource if they get prioritized over other work in our queue.
Just some update for tracking: v4.0.0-0.173.0.0-dirty 1. Create cluster 1 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test1' option. 2. Cluster 1 is installed successfully. 3. Create cluster 2 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test2' option, but using different region. 4. Cluster 2 is failed due to existing IAM Role. time="2019-02-15T06:04:13-05:00" level=error msg="\t* module.iam.aws_iam_role.worker_role: 1 error occurred:" time="2019-02-15T06:04:13-05:00" level=error msg="\t* aws_iam_role.worker_role: Error creating IAM Role qe-jialiu-worker-role: EntityAlreadyExists: Role with name qe-jialiu-worker-role already exists." time="2019-02-15T06:04:13-05:00" level=error msg="\tstatus code: 409, request id: f538f841-3110-11e9-8096-7fe5dd58ed1f" time="2019-02-15T06:04:13-05:00" level=error msg="\t* module.masters.aws_iam_role.master_role: 1 error occurred:" time="2019-02-15T06:04:13-05:00" level=error msg="\t* aws_iam_role.master_role: Error creating IAM Role qe-jialiu-master-role: EntityAlreadyExists: Role with name qe-jialiu-master-role already exists." time="2019-02-15T06:04:13-05:00" level=error msg="\tstatus code: 409, request id: f536d548-3110-11e9-975a-ff85437cae00" Seem like instance iam role is global, so this error is an expected behavior. While cluster 1's api is override by cluster 2's. # oc get node Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "root-ca")
After cluster 2's installation failure, try to clean it up, found cluster 1's iam role "qe-jialiu-worker-role" and "qe-jialiu-mastser-role" is cleaned up together.
I'm pushing this to 4.1 since this would be a good eventual feature. We just don't have time to do it before 4.0.
The IAM-role-collision portion of this was addressed by [1]. I can still reproduce the Route 53 record clobber with openshift-install unreleased-master-581-gb4e06b04294af0ca17da53ea5dcc3f45a5a69fc2, despite Terraform's nominal default being to not allow overwrites for existing records not managed by that Terraform install [2]: $ grep 'aws_route53_record\.api_external' wking*/.openshift_install.log wking/.openshift_install.log:time="2019-03-19T15:20:13-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creating..." wking/.openshift_install.log:time="2019-03-19T15:20:23-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (10s elapsed)" wking/.openshift_install.log:time="2019-03-19T15:20:33-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (20s elapsed)" wking/.openshift_install.log:time="2019-03-19T15:20:43-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (30s elapsed)" wking/.openshift_install.log:time="2019-03-19T15:20:45-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creation complete after 33s (ID: Z3URY6TWQ91KVV_api.wking.devcluster.openshift.com_A)" wking2/.openshift_install.log:time="2019-03-19T16:04:52-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creating..." wking2/.openshift_install.log:time="2019-03-19T16:05:02-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (10s elapsed)" wking2/.openshift_install.log:time="2019-03-19T16:05:12-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (20s elapsed)" wking2/.openshift_install.log:time="2019-03-19T16:05:22-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (30s elapsed)" wking2/.openshift_install.log:time="2019-03-19T16:05:30-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creation complete after 38s (ID: Z3URY6TWQ91KVV_api.wking.devcluster.openshift.com_A)" Aha, our vendored Terraform predates [3]. I'll figure out how to bump our dependency. [1]: https://github.com/openshift/installer/pull/1280 (v0.13.0) [2]: https://www.terraform.io/docs/providers/aws/r/route53_record.html#allow_overwrite [3]: https://github.com/terraform-providers/terraform-provider-aws/pull/7734 (v2.0.0)
Re-test this bug with openshift-install v4.0.22-201903272149-dirty (extracted from 4.0.0-0.nightly-2019-03-29-144824) + 4.0.0-0.nightly-2019-03-28-030453 release payload, and partially fixed. 1. Create a install-config.yaml. 2. Install cluster 1 using the generated install-config.yaml, and succeed. 3. Install cluster 2 using the generated install-config.yaml, and failed with clear message. # ./openshift-install create cluster --dir demo2 WARNING Found override for ReleaseImage. Please be warned, this is not advised INFO Consuming "Install Config" from target directory INFO Creating infrastructure resources... ERROR ERROR Error: Error applying plan: ERROR ERROR 1 error occurred: ERROR * module.dns.aws_route53_record.api_external: 1 error occurred: ERROR * aws_route53_record.api_external: [ERR]: Error building changeset: InvalidChangeBatch: [Tried to create resource record set [name='api.qe-jialiu.qe.devcluster.openshift.com.', type='A'] but it already exists] ERROR status code: 400, request id: 7e522cc8-5442-11e9-89ab-d78b444a875d ERROR ERROR ERROR ERROR ERROR ERROR Terraform does not automatically rollback in the face of errors. ERROR Instead, your Terraform state file has been partially updated with ERROR any resources that successfully completed. Please address the error ERROR above and apply again to incrementally change your infrastructure. ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform 4. Run oc command against cluster 1, working well. 5. Destroy the failed cluster 2, run oc command against cluster 1, not working any more. # oc get node Unable to connect to the server: dial tcp: lookup api.qe-jialiu.qe.devcluster.openshift.com on 10.11.5.19:53: no such host Search installer log, found the following lines: time="2019-04-01T02:03:28-04:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu-24rlf\":\"owned\"}" time="2019-04-01T02:03:28-04:00" level=debug msg="listing AWS hosted zones \"qe-jialiu.qe.devcluster.openshift.com.\" (page 0)" arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD time="2019-04-01T02:03:28-04:00" level=debug msg="listing AWS hosted zones \"qe.devcluster.openshift.com.\" (page 0)" arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="SRV _etcd-server-ssl._tcp.qe-jialiu.qe.devcluster.openshift.com." time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD public zone=/hostedzone/Z3B3KOVA3TRCWP record set="A api.qe-jialiu.qe.devcluster.openshift.com." time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A api.qe-jialiu.qe.devcluster.openshift.com." time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A etcd-0.qe-jialiu.qe.devcluster.openshift.com." time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A etcd-1.qe-jialiu.qe.devcluster.openshift.com." time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A etcd-2.qe-jialiu.qe.devcluster.openshift.com." time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD time="2019-04-01T02:03:29-04:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"2cb62292-d5dc-4fa6-8926-cc4941f9391f\"}" "qe-jialiu.qe.devcluster.openshift.com" A record for cluster 1 seem like be deleted by mistake.
> "qe-jialiu.qe.devcluster.openshift.com" A record for cluster 1 seem like be deleted by mistake. I've filed [1] to guard against this. [1]: https://github.com/openshift/installer/pull/1508
(In reply to W. Trevor King from comment #8) > [1]: https://github.com/openshift/installer/pull/1508 Merged.
Verified this bug with 4.0.0-0.nightly-2019-04-03-202419, and PASS. # ./openshift-install version ./openshift-install v4.0.22-201904031458-dirty built from commit 9818510c907d4c195cf3d2532154b7d359ee0fcc release image registry.svc.ci.openshift.org/openshift/origin-release:v4.0