Bug 1659970

Summary: route53 record might be overwritten when using the same cluster name and base domain
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: medium CC: bleanhar, crawford, sponnaga, vlaad, wking
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-08 22:55:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2018-12-17 09:55:20 UTC
Description of problem:

Version-Release number of the following components:
# ./openshift-install version
./openshift-install v0.7.0-master-6-g8f02020b59147c933a08c5e248a8e2c69dad24ae

# oc version
oc v4.0.0-0.82.0
kubernetes v1.11.0+6855010f70
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-jialiu-api.devcluster.openshift.com:6443
kubernetes v1.11.0+3d38233


How reproducible:
Always

Steps to Reproduce:
1. Create cluster 1 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test1' option.
2. Cluster 1 is installed successfully.
3. Create cluster 2 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test2' option, but using different region.
4. Cluster 2 is installed successfully.

Actual results:
Cluster 1 can not be connected.
# oc get node
No resources found.
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "root-ca")

That is because cluster 1's master api record in route53 is overwritten to cluster 2's lb address.

Expected results:
Installer should detect that 'qe-jialiu' api record in route53 already exists in advance when installing cluster 2, and exits the install.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2019-01-24 09:00:43 UTC
> Expected results:
> Installer should detect that 'qe-jialiu' api record in route53 already exists in advance when installing cluster 2, and exits the install.

This would be nice, but pre-checks like this are going to be racy (pre-check finds no colliding resource, separate process creates the resource, real creation collides and fails).  And would you pre-check all of the resources the cluster would create (matching VPC?  Matching instance profiles?)?  I think this has no generic solution, but we can obviously code in specific checks per-resource if they get prioritized over other work in our queue.

Comment 2 Johnny Liu 2019-02-15 11:24:07 UTC
Just some update for tracking:

v4.0.0-0.173.0.0-dirty

1. Create cluster 1 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test1' option.
2. Cluster 1 is installed successfully.
3. Create cluster 2 with cluster name - 'qe-jialiu' and base domain - 'devcluster.openshift.com' together with '--dir ./test2' option, but using different region.
4. Cluster 2 is failed due to existing IAM Role.
time="2019-02-15T06:04:13-05:00" level=error msg="\t* module.iam.aws_iam_role.worker_role: 1 error occurred:"
time="2019-02-15T06:04:13-05:00" level=error msg="\t* aws_iam_role.worker_role: Error creating IAM Role qe-jialiu-worker-role: EntityAlreadyExists: Role with name qe-jialiu-worker-role already exists."
time="2019-02-15T06:04:13-05:00" level=error msg="\tstatus code: 409, request id: f538f841-3110-11e9-8096-7fe5dd58ed1f"

time="2019-02-15T06:04:13-05:00" level=error msg="\t* module.masters.aws_iam_role.master_role: 1 error occurred:"
time="2019-02-15T06:04:13-05:00" level=error msg="\t* aws_iam_role.master_role: Error creating IAM Role qe-jialiu-master-role: EntityAlreadyExists: Role with name qe-jialiu-master-role already exists."
time="2019-02-15T06:04:13-05:00" level=error msg="\tstatus code: 409, request id: f536d548-3110-11e9-975a-ff85437cae00"


Seem like instance iam role is global, so this error is an expected behavior.


While cluster 1's api is override by cluster 2's.
# oc get node
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "root-ca")

Comment 3 Johnny Liu 2019-02-15 11:33:41 UTC
After cluster 2's installation failure, try to clean it up, found cluster 1's iam role "qe-jialiu-worker-role" and "qe-jialiu-mastser-role" is cleaned up together.

Comment 4 Alex Crawford 2019-03-01 19:24:31 UTC
I'm pushing this to 4.1 since this would be a good eventual feature. We just don't have time to do it before 4.0.

Comment 5 W. Trevor King 2019-03-19 23:39:22 UTC
The IAM-role-collision portion of this was addressed by [1].  I can still reproduce the Route 53 record clobber with openshift-install unreleased-master-581-gb4e06b04294af0ca17da53ea5dcc3f45a5a69fc2, despite Terraform's nominal default being to not allow overwrites for existing records not managed by that Terraform install [2]:

$ grep 'aws_route53_record\.api_external' wking*/.openshift_install.log
wking/.openshift_install.log:time="2019-03-19T15:20:13-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creating..."
wking/.openshift_install.log:time="2019-03-19T15:20:23-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (10s elapsed)"
wking/.openshift_install.log:time="2019-03-19T15:20:33-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (20s elapsed)"
wking/.openshift_install.log:time="2019-03-19T15:20:43-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (30s elapsed)"
wking/.openshift_install.log:time="2019-03-19T15:20:45-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creation complete after 33s (ID: Z3URY6TWQ91KVV_api.wking.devcluster.openshift.com_A)"
wking2/.openshift_install.log:time="2019-03-19T16:04:52-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creating..."
wking2/.openshift_install.log:time="2019-03-19T16:05:02-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (10s elapsed)"
wking2/.openshift_install.log:time="2019-03-19T16:05:12-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (20s elapsed)"
wking2/.openshift_install.log:time="2019-03-19T16:05:22-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Still creating... (30s elapsed)"
wking2/.openshift_install.log:time="2019-03-19T16:05:30-07:00" level=debug msg="module.dns.aws_route53_record.api_external: Creation complete after 38s (ID: Z3URY6TWQ91KVV_api.wking.devcluster.openshift.com_A)"

Aha, our vendored Terraform predates [3].  I'll figure out how to bump our dependency.

[1]: https://github.com/openshift/installer/pull/1280 (v0.13.0)
[2]: https://www.terraform.io/docs/providers/aws/r/route53_record.html#allow_overwrite
[3]: https://github.com/terraform-providers/terraform-provider-aws/pull/7734 (v2.0.0)

Comment 7 Johnny Liu 2019-04-01 06:23:16 UTC
Re-test this bug with openshift-install v4.0.22-201903272149-dirty (extracted from 4.0.0-0.nightly-2019-03-29-144824) + 4.0.0-0.nightly-2019-03-28-030453 release payload, and partially fixed.

1. Create a install-config.yaml.
2. Install cluster 1 using the generated install-config.yaml, and succeed.
3. Install cluster 2 using the generated install-config.yaml, and failed with clear message.
# ./openshift-install create cluster --dir demo2
WARNING Found override for ReleaseImage. Please be warned, this is not advised 
INFO Consuming "Install Config" from target directory 
INFO Creating infrastructure resources...         
ERROR                                              
ERROR Error: Error applying plan:                  
ERROR                                              
ERROR 1 error occurred:                            
ERROR 	* module.dns.aws_route53_record.api_external: 1 error occurred: 
ERROR 	* aws_route53_record.api_external: [ERR]: Error building changeset: InvalidChangeBatch: [Tried to create resource record set [name='api.qe-jialiu.qe.devcluster.openshift.com.', type='A'] but it already exists] 
ERROR 	status code: 400, request id: 7e522cc8-5442-11e9-89ab-d78b444a875d 
ERROR                                              
ERROR                                              
ERROR                                              
ERROR                                              
ERROR                                              
ERROR Terraform does not automatically rollback in the face of errors. 
ERROR Instead, your Terraform state file has been partially updated with 
ERROR any resources that successfully completed. Please address the error 
ERROR above and apply again to incrementally change your infrastructure. 
ERROR                                              
ERROR                                              
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform 
4. Run oc command against cluster 1, working well.
5. Destroy the failed cluster 2, run oc command against cluster 1, not working any more.
# oc get node
Unable to connect to the server: dial tcp: lookup api.qe-jialiu.qe.devcluster.openshift.com on 10.11.5.19:53: no such host


Search installer log, found the following lines:
time="2019-04-01T02:03:28-04:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu-24rlf\":\"owned\"}"
time="2019-04-01T02:03:28-04:00" level=debug msg="listing AWS hosted zones \"qe-jialiu.qe.devcluster.openshift.com.\" (page 0)" arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD
time="2019-04-01T02:03:28-04:00" level=debug msg="listing AWS hosted zones \"qe.devcluster.openshift.com.\" (page 0)" arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD
time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="SRV _etcd-server-ssl._tcp.qe-jialiu.qe.devcluster.openshift.com."
time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD public zone=/hostedzone/Z3B3KOVA3TRCWP record set="A api.qe-jialiu.qe.devcluster.openshift.com."
time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A api.qe-jialiu.qe.devcluster.openshift.com."
time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A etcd-0.qe-jialiu.qe.devcluster.openshift.com."
time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A etcd-1.qe-jialiu.qe.devcluster.openshift.com."
time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD record set="A etcd-2.qe-jialiu.qe.devcluster.openshift.com."
time="2019-04-01T02:03:29-04:00" level=info msg=Deleted arn="arn:aws:route53:::hostedzone/Z25G1GJA4S3PKD" id=Z25G1GJA4S3PKD
time="2019-04-01T02:03:29-04:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"2cb62292-d5dc-4fa6-8926-cc4941f9391f\"}"


"qe-jialiu.qe.devcluster.openshift.com" A record for cluster 1 seem like be deleted by mistake.

Comment 8 W. Trevor King 2019-04-01 06:58:10 UTC
> "qe-jialiu.qe.devcluster.openshift.com" A record for cluster 1 seem like be deleted by mistake.

I've filed [1] to guard against this.

[1]: https://github.com/openshift/installer/pull/1508

Comment 9 W. Trevor King 2019-04-01 18:40:31 UTC
(In reply to W. Trevor King from comment #8)
> [1]: https://github.com/openshift/installer/pull/1508

Merged.

Comment 10 Johnny Liu 2019-04-04 07:24:37 UTC
Verified this bug with 4.0.0-0.nightly-2019-04-03-202419, and PASS.

# ./openshift-install version
./openshift-install v4.0.22-201904031458-dirty
built from commit 9818510c907d4c195cf3d2532154b7d359ee0fcc
release image registry.svc.ci.openshift.org/openshift/origin-release:v4.0