Bug 1674440

Summary: installer failed to create cluster due to leftover existing IAM instance profile
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: crawford, sponnaga, wking
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:42:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2019-02-11 11:05:11 UTC
Description of problem:

Version-Release number of the following components:
v4.0.0-0.169.0.0-dirty

How reproducible:
Always

Steps to Reproduce:
1. # cat ./demo2/metadata.json 
{"clusterName":"qe-jialiu","clusterID":"5e9150a6-c27b-4ef7-9470-5209676940d0","aws":{"region":"us-east-2","identifier":[{"openshiftClusterID":"5e9150a6-c27b-4ef7-9470-5209676940d0"},{"kubernetes.io/cluster/qe-jialiu":"owned"}]}}
2. # ./openshift-install destroy cluster --dir ./demo2/
3.

Actual results:
The installer would not exist, which is trying to delete some non-existing instances.
time="2019-02-11T05:52:08-05:00" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"openshiftClusterID\":\"5e9150a6-c27b-4ef7-9470-5209676940d0\"}"
time="2019-02-11T05:52:13-05:00" level=debug msg="deleting arn:aws:ec2:us-east-2:301721915996:volume/vol-087350d1e97a7ed17: RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: fc68d6d2-8ab9-4939-9555-df1e447ac364"
time="2019-02-11T05:52:18-05:00" level=debug msg="deleting arn:aws:ec2:us-east-2:301721915996:natgateway/nat-08696293109405515: RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: a79b3e37-4345-4a2e-8218-61c22e05a49f"
time="2019-02-11T05:52:21-05:00" level=info msg=Deleted arn="arn:aws:ec2:us-east-2:301721915996:natgateway/nat-09b17f28856f6f079" id=nat-09b17f28856f6f079
time="2019-02-11T05:52:21-05:00" level=info msg=Deleted arn="arn:aws:ec2:us-east-2:301721915996:natgateway/nat-0fd0094e3e953fbfe" id=nat-0fd0094e3e953fbfe
time="2019-02-11T05:52:28-05:00" level=debug msg="deleting arn:aws:ec2:us-east-2:301721915996:volume/vol-0b98219c51fdb8f0c: RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 30207bff-bb69-4b08-a5c6-a7fd242cefbd"
time="2019-02-11T05:52:28-05:00" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu\":\"owned\"}"
time="2019-02-11T05:52:34-05:00" level=debug msg="deleting arn:aws:ec2:us-east-2:301721915996:volume/vol-087350d1e97a7ed17: RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 81c4b82e-8d17-48a2-9545-9520f8c7f634"
time="2019-02-11T05:52:39-05:00" level=debug msg="deleting arn:aws:ec2:us-east-2:301721915996:natgateway/nat-08696293109405515: RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 0912d258-cc85-417c-a409-84b2e704c17b"
time="2019-02-11T05:52:43-05:00" level=debug msg="deleting arn:aws:ec2:us-east-2:301721915996:volume/vol-0b98219c51fdb8f0c: RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 6962f33b-83d0-4802-bb9f-1dc77069b59c"
time="2019-02-11T05:52:43-05:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"5e9150a6-c27b-4ef7-9470-5209676940d0\"}"
time="2019-02-11T05:52:43-05:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu\":\"owned\"}"
time="2019-02-11T05:52:43-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-0d9f7ce1307459951: InvalidInstanceID.NotFound: The instance ID 'i-0d9f7ce1307459951' does not exist\n\tstatus code: 400, request id: 9d818294-6dae-47f2-9a4d-739983038a79"
time="2019-02-11T05:52:44-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-0c0ec911b109516ce: InvalidInstanceID.NotFound: The instance ID 'i-0c0ec911b109516ce' does not exist\n\tstatus code: 400, request id: 66e150ed-3bf2-4417-b150-fb5de337ee08"
time="2019-02-11T05:52:44-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-09ec554c567d893fa: InvalidInstanceID.NotFound: The instance ID 'i-09ec554c567d893fa' does not exist\n\tstatus code: 400, request id: 3f01ba9c-4044-4be1-ad5b-ae409f858190"
time="2019-02-11T05:52:44-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-061363c67f8b27783: InvalidInstanceID.NotFound: The instance ID 'i-061363c67f8b27783' does not exist\n\tstatus code: 400, request id: 802e37ae-f98e-4ee7-989c-d3933f44e1ad"
time="2019-02-11T05:52:44-05:00" level=debug msg="search for IAM roles"
time="2019-02-11T05:52:45-05:00" level=debug msg="search for IAM users"
time="2019-02-11T05:52:56-05:00" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"openshiftClusterID\":\"5e9150a6-c27b-4ef7-9470-5209676940d0\"}"
time="2019-02-11T05:52:59-05:00" level=info msg=Deleted arn="arn:aws:ec2:us-east-2:301721915996:natgateway/nat-08696293109405515" id=nat-08696293109405515
time="2019-02-11T05:52:59-05:00" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu\":\"owned\"}"
time="2019-02-11T05:52:59-05:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"5e9150a6-c27b-4ef7-9470-5209676940d0\"}"
time="2019-02-11T05:52:59-05:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu\":\"owned\"}"
time="2019-02-11T05:52:59-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-0d9f7ce1307459951: InvalidInstanceID.NotFound: The instance ID 'i-0d9f7ce1307459951' does not exist\n\tstatus code: 400, request id: cc097e11-7f05-48ac-bad7-1042e77c21ca"
time="2019-02-11T05:53:00-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-0c0ec911b109516ce: InvalidInstanceID.NotFound: The instance ID 'i-0c0ec911b109516ce' does not exist\n\tstatus code: 400, request id: 18e4ad5a-4b6b-43a0-94fc-b9c7b5fadea5"
time="2019-02-11T05:53:00-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-09ec554c567d893fa: InvalidInstanceID.NotFound: The instance ID 'i-09ec554c567d893fa' does not exist\n\tstatus code: 400, request id: 7b655006-be3b-496f-a659-0cbbfe3eb981"
time="2019-02-11T05:53:00-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-061363c67f8b27783: InvalidInstanceID.NotFound: The instance ID 'i-061363c67f8b27783' does not exist\n\tstatus code: 400, request id: 75c46335-a0da-442b-a8bc-1015d109044c"
time="2019-02-11T05:53:00-05:00" level=debug msg="search for IAM roles"
time="2019-02-11T05:53:00-05:00" level=debug msg="search for IAM users"
time="2019-02-11T05:53:00-05:00" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"openshiftClusterID\":\"5e9150a6-c27b-4ef7-9470-5209676940d0\"}"
time="2019-02-11T05:53:00-05:00" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu\":\"owned\"}"
time="2019-02-11T05:53:00-05:00" level=debug msg="no deletions from us-east-2, removing client"
time="2019-02-11T05:53:00-05:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"5e9150a6-c27b-4ef7-9470-5209676940d0\"}"
time="2019-02-11T05:53:00-05:00" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/qe-jialiu\":\"owned\"}"
time="2019-02-11T05:53:00-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-0d9f7ce1307459951: InvalidInstanceID.NotFound: The instance ID 'i-0d9f7ce1307459951' does not exist\n\tstatus code: 400, request id: 42a77824-6956-4173-a0c0-4d2051b64971"
time="2019-02-11T05:53:00-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-0c0ec911b109516ce: InvalidInstanceID.NotFound: The instance ID 'i-0c0ec911b109516ce' does not exist\n\tstatus code: 400, request id: 25447971-e7cc-4d7b-bae4-c373bdf5ea7f"
time="2019-02-11T05:53:01-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-09ec554c567d893fa: InvalidInstanceID.NotFound: The instance ID 'i-09ec554c567d893fa' does not exist\n\tstatus code: 400, request id: be75b8d3-f498-457f-9b05-0f4d371c3b87"
time="2019-02-11T05:53:01-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-061363c67f8b27783: InvalidInstanceID.NotFound: The instance ID 'i-061363c67f8b27783' does not exist\n\tstatus code: 400, request id: aa439793-90a7-479c-ab7a-fdb32d2b0c12"
time="2019-02-11T05:53:01-05:00" level=debug msg="search for IAM roles"
time="2019-02-11T05:53:01-05:00" level=debug msg="search for IAM users"



Expected results:
Installer should exist, if the searched instances are not found.

Additional info:
In my testing, my cluster is set up in us-east-2, why the installer is trying to delete some InvalidInstanceID in us-east-1, though I indeed run some testing in "us-east-1" before this testing.

Comment 1 Johnny Liu 2019-02-13 10:16:00 UTC
Now this issue disappeared. Lower its severity.

Upon my following testing, hit another issue:
level=error msg="Error: Error applying plan:"
level=error
level=error msg="3 errors occurred:"
level=error msg="\t* module.iam.aws_iam_instance_profile.worker: 1 error occurred:"
level=error msg="\t* aws_iam_instance_profile.worker: Error creating IAM instance profile qe-akostadi-worker-profile: EntityAlreadyExists: Instance Profile qe-akostadi-worker-profile already exists."
level=error msg="\tstatus code: 409, request id: 0d0a1849-2f08-11e9-9889-5be3143234f8"
level=error
level=error
level=error msg="\t* module.bootstrap.aws_iam_instance_profile.bootstrap: 1 error occurred:"
level=error msg="\t* aws_iam_instance_profile.bootstrap: Error creating IAM instance profile qe-akostadi-bootstrap-profile: EntityAlreadyExists: Instance Profile qe-akostadi-bootstrap-profile already exists."
level=error msg="\tstatus code: 409, request id: 0d0c895d-2f08-11e9-a004-b375d31ba96b"
level=error
level=error
level=error msg="\t* module.masters.aws_iam_instance_profile.master: 1 error occurred:"
level=error msg="\t* aws_iam_instance_profile.master: Error creating IAM instance profile qe-akostadi-master-profile: EntityAlreadyExists: Instance Profile qe-akostadi-master-profile already exists."
level=error msg="\tstatus code: 409, request id: 0d090799-2f08-11e9-899b-516565cdca0e"

but using `aws` tool it said that it can't find such instance profile. Seem like this two issues caused by some Terraform cache when talking with aws?

Comment 3 Johnny Liu 2019-02-13 10:23:05 UTC
I was using v4.0.0-0.171.0.0-dirty installer hit the issue described in comment 1.

Comment 4 Johnny Liu 2019-02-14 04:14:17 UTC
The initial issue would be tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1672374, let us use this bug to tack the issue described in comment 1.

Comment 6 W. Trevor King 2019-02-14 07:17:47 UTC
> level=error msg="\t* module.iam.aws_iam_instance_profile.worker: 1 error occurred:"
> level=error msg="\t* aws_iam_instance_profile.worker: Error creating IAM instance profile qe-akostadi-worker-profile: EntityAlreadyExists: Instance Profile qe-akostadi-worker-profile already exists."
> level=error msg="\tstatus code: 409, request id: 0d0a1849-2f08-11e9-9889-5be3143234f8"

Dup of bug 1669274?

Comment 7 W. Trevor King 2019-02-14 07:23:06 UTC
> time="2019-02-11T05:52:43-05:00" level=debug msg="deleting arn:aws:ec2:us-east-1:301721915996:instance/i-0d9f7ce1307459951: InvalidInstanceID.NotFound: The instance ID 'i-0d9f7ce1307459951' does not exist\n\tstatus code: 400, request id: 9d818294-6dae-47f2-9a4d-739983038a79"

I've filed [1] to address this.

[1]: https://github.com/openshift/installer/pull/1250

Comment 8 Johnny Liu 2019-02-15 02:27:11 UTC
(In reply to W. Trevor King from comment #6)
> > level=error msg="\t* module.iam.aws_iam_instance_profile.worker: 1 error occurred:"
> > level=error msg="\t* aws_iam_instance_profile.worker: Error creating IAM instance profile qe-akostadi-worker-profile: EntityAlreadyExists: Instance Profile qe-akostadi-worker-profile already exists."
> > level=error msg="\tstatus code: 409, request id: 0d0a1849-2f08-11e9-9889-5be3143234f8"
> 
> Dup of bug 1669274?

I do not think so. 
Here is my thought, in my case, my running instances was terminated by prune script, the prune script only terminate instances, but not remove those dependent resource created by installer, so left these orphan instance profiles. After that, I followed https://access.redhat.com/solutions/3826921 to ask installer help me destroy the cluster. I guess installer can not find out those leftover instance profiles which already become orphan (due to the instances not existing any more).

Comment 9 W. Trevor King 2019-02-15 07:12:52 UTC
> I've filed [1] to address this.
>
> [1]: https://github.com/openshift/installer/pull/1250

This landed in master today.

> > Dup of bug 1669274?
>
> I do not think so. 
> Here is my thought, in my case, my running instances was terminated by prune script, the prune script only terminate instances, but not remove those dependent resource created by installer, so left these orphan instance profiles.

Tracker for this in [1].

> After that, I followed https://access.redhat.com/solutions/3826921 to ask installer help me destroy the cluster. I guess installer can not find out those leftover instance profiles which already become orphan (due to the instances not existing any more).

Right.  And there's not much we can do about that since AWS doesn't support tagging for instance profiles.  So can we close this as a dup of [1]?  Or adjust the title and component here to clarify that the issue is a buggy pruning script in account 301721915996?

[1]: https://jira.coreos.com/browse/DPP-1130

Comment 10 Johnny Liu 2019-02-15 07:48:50 UTC
Okay, seem like the jira issue is the same issue as mine. I am okay with close it.

Comment 11 W. Trevor King 2019-02-15 08:13:19 UTC
Looks like I can't close as a dup of Jira, since that's an external tracker.  And I haven't found a Bugzilla product for the DPP team.  So I've left this on the OCP product, shifted to the Unknown component, and assigned to John.

Comment 12 W. Trevor King 2019-02-26 17:30:49 UTC
[1] and follow-ups landed some uniquifying to avoid collisions.  That's enough for me to consider this fixed.  [2] is in flight in this space to support users who try and clean up after a broken pruner partially deletes their cluster, but I don't think we need that to close this issue.

[1]: https://github.com/openshift/installer/pull/1280
[2]: https://github.com/openshift/installer/pull/1268

Comment 15 Johnny Liu 2019-03-07 10:41:25 UTC
Verified this bug with v4.0.16-1-dirty installer extracted from 4.0.0-0.nightly-2019-03-06-074438, and PASS.

Install two cluster (cluster-1 and cluster-2) with same cluster name.
Installer would name all resource using an uniq string including infraID.

[root@preserve-jialiu-ansible 20190307]# cat demo1/metadata.json 
{"clusterName":"qe-jialiu","clusterID":"073cf6c0-126e-45c5-afc7-96ded57458c4","infraID":"qe-jialiu-wcpnw","aws":{"region":"us-east-2","identifier":[{"kubernetes.io/cluster/qe-jialiu-wcpnw":"owned"},{"openshiftClusterID":"073cf6c0-126e-45c5-afc7-96ded57458c4"}]}}

[root@preserve-jialiu-ansible 20190307]# cat  demo2/metadata.json 
{"clusterName":"qe-jialiu","clusterID":"769e5dbc-6b67-486e-ab63-d49e6d14aec6","infraID":"qe-jialiu-hzfxg","aws":{"region":"us-east-2","identifier":[{"kubernetes.io/cluster/qe-jialiu-hzfxg":"owned"},{"openshiftClusterID":"769e5dbc-6b67-486e-ab63-d49e6d14aec6"}]}} 

Destroy cluster-2, run oc command against cluster-1, cluster-1 is still working well.

Comment 18 errata-xmlrpc 2019-06-04 10:42:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758