Description of problem: `openshift-install create cluster` often failed due to "aws_s3_bucket.ignition: error getting S3 Bucket location: timeout while waiting for state to become 'success' (timeout: 1m0s)". When meeting this, we need to `destroy cluster` and `create cluster` again, then it will succeed. However, sometimes even the second `create cluster` fails again, and trying a third `create cluster` can succeed. Version-Release number of the following components: openshift-install extracted from v4.0.0-0.146.0.0-ose-installer How reproducible: Often, about 50% Steps to Reproduce: 1. time bin/openshift-install create cluster --dir=install-xxia --log-level=debug ... DEBUG data.aws_vpc.cluster_vpc: Refreshing state... DEBUG aws_security_group.master: Refreshing state... (ID: sg-089514abe32614cc3) DEBUG aws_subnet.master_subnet[0]: Refreshing state... (ID: subnet-010d8a6deab88aced) DEBUG aws_subnet.master_subnet[1]: Refreshing state... (ID: subnet-01240f6063acc6244) DEBUG aws_subnet.master_subnet[2]: Refreshing state... (ID: subnet-006852f832fc537db) DEBUG aws_security_group.bootstrap: Refreshing state... (ID: sg-029ccb6af2ad66aa0) DEBUG aws_security_group_rule.bootstrap_kubelet_secure: Refreshing state... (ID: sgrule-339798961) DEBUG aws_security_group_rule.bootstrap_journald_gateway: Refreshing state... (ID: sgrule-1925301900) ERROR ERROR Error: Error refreshing state: 1 error occurred: ERROR * module.bootstrap.aws_s3_bucket.ignition: 1 error occurred: ERROR * module.bootstrap.aws_s3_bucket.ignition: aws_s3_bucket.ignition: error getting S3 Bucket location: timeout while waiting for state to become 'success' (timeout: 1m0s) ERROR ERROR ERROR ERROR ERROR ERROR FATAL Terraform destroy: failed to destroy using Terraform real 18m56.122s user 0m14.818s sys 0m1.699s Actual results: 1. Often met the 1m0s timeout failure Expected results: 1. Should enlarge the timeout to be more tolerant e.g. 2 min or 5 min in [1] which causes the error in [2]. This should be acceptable compared to the time cost in retrying `create/destroy cluster`. [1] https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go#L37 func retryOnAwsCode(code string, f func() (interface{}, error)) (interface{}, error) { ... err := resource.Retry(1*time.Minute, func() *resource.RetryError { [2] https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/resource_aws_s3_bucket.go#L1087-L1095 : locationResponse, err := retryOnAwsCode("NoSuchBucket", func() (interface{}, error) { return s3conn.GetBucketLocation( ... if err != nil { return fmt.Errorf("error getting S3 Bucket location: %s", err) Additional info:
> https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go#L37 This is vendored code. If we need to tune it, we'd want to patch upstream first to avoid having to rebase downstream work as upstream evolves. But I've never seen this myself. What region are you using?
I use us-east-2. Before, I also tried other regions like us-east-1, us-west-1/2, they also hit, thus it's not region specific. > If we need to tune it, we'd want to patch upstream first to avoid having to rebase downstream work as upstream evolves. But I've never seen this myself. If it is indeed an issue, why not patch upstream? Thanks
From a recent CI run in us-east-1 [1]: time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 0s (ID: bootstrap.ign)" so I'm still curious about how this is happening. Maybe the timeout depends on the time it takes to upload the Ignition config and you have a slow network connection to us-east-1? If we had a tuneable timeout, why do you expect your proposed 2 or 5 minutes to be sufficient? What sorts of completion times do you see when it works? > If it is indeed an issue, why not patch upstream? Right, that's what we'd do. It just usually takes longer when an external upstream is involved, and I wanted to set the expectations around us not being able to turn this around in a day ;). [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1133/pull-ci-openshift-installer-master-e2e-aws/3206/artifacts/e2e-aws/installer/.openshift_install.log
(In reply to W. Trevor King from comment #4) > time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 0s (ID: bootstrap.ign)" Is the time 0s shown here == the time in resource.Retry(1*time.Minute, ...)? > why do you expect your proposed 2 or 5 minutes to be sufficient? I don't know what value is sufficient. BTW my office is in Beijing. Today I modified the time to 5 mins to ensure success and see what would happen in `create cluster` log: $ oc image info quay.io/...:v4.0.0-0.148.0.0-ose-installer # Get commit of latest OCP installer ... io.openshift.build.commit.url=https://github.com/openshift/installer/commit/7af6b141c34dd133f104b61b64c78d82e57f3d52 $ git checkout -b v4.0.0-0.148.0.0-my 7af6b141c34dd133f104b61b64c78d82e57f3d52 # Use above commit $ vi pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go # Update it to 5*time.Minute $ hack/build.sh $ bin/openshift-install create cluster --dir=install-xxia2nd --log-level=debug # Succeed $ grep module.bootstrap.aws_s3_bucket install-xxia2nd/.openshift_install.log time="2019-01-28T14:30:47+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creating..." time="2019-01-28T14:30:57+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Still creating... (10s elapsed)" time="2019-01-28T14:31:02+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creation complete after 16s (ID: terraform-20 190128063047243000000001)" time="2019-01-28T14:31:02+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creating..." time="2019-01-28T14:31:06+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 4s (ID: bootst rap.ign)" time="2019-01-28T14:48:40+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Destroying... (ID: bootstrap.ign)" ...
It makes more sense to add beta2blocker. Updating bug fields.
I realize I quoted the wrong resource in comment 4. It should have been: Creation complete after 0s (ID: dopt-095fb98c6117a1a0a)" time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creation complete after 1s (ID: terraform-20190127172834516800000001)" While there's an outside chance that aws_s3_bucket_object depends on upload bandwidth, bandwidth should not be an issue for aws_s3_bucket. How many buckets are in this account? Maybe AWS starts slowing down allocations for accounts with many buckets? For comment 7, 16 seconds is slower than the CI run I looked at, but still well short of a minute. Have you seen times over a minute with that build?
Is this still an ongoing issue? We haven't changed anything on our end, but we also didn't expect to be able to fix this particular issue.
Sorry for delayed reply due to holidays and later other work. Tried latest v4.0.0-0.176.0.0 installer, still hit the issue. Again, like comment 7, checkout v4.0.0-0.176.0.0 commit, modify it to 5 min and build binary, use it to create cluster, it can succeed. PS: like comment 7, the region uses us-east-2, the computer where I run `create cluster` is located in Beijing office. Not sure if geographical distance/network affects.
Hit this issue many times today. :( Maybe because of the bad network. DEBUG module.dns.aws_route53_record.api_external: Creation complete after 38s (ID: Z3B3KOVA3TRCWP_api.jian-10.qe.devcluster.openshift.com_A) ERROR ERROR Error: Error applying plan: ERROR ERROR 1 error occurred: ERROR * module.bootstrap.aws_s3_bucket.ignition: 1 error occurred: ERROR * aws_s3_bucket.ignition: error getting S3 Bucket location: timeout while waiting for state to become 'success' (timeout: 1m0s) [jzhang@dhcp-140-18 test2]$ ./openshift-install version ./openshift-install v4.0.5-1-dirty I can find the corresponding s3 bucket already ready in the AWS web console. So I was wondering if we can set more waiting time(maybe 3mins) as the default.
Jian, thank you for continuously reporting it. As my above comments said, the host I ran `openshift-install create cluster` is at local office site. Thus, FYI, to avoid such annoyance, now I gave up my local host, set up a VM hosted in UpShift which seems located in US lab, ssh to it and run `openshift-install create cluster` there. So far there didn't hit this bug. (In reply to Jian Zhang from comment #15) > Hit this issue many times today. :( Maybe because of the bad network. So, if this bug is not fixed, it would affect customers that don't luckily have a good alternative host like above US VM :)
Are we still seeing this? If so, I'm curious about whether AMI copies into a local region (cn-north-1 for Beijing) followed by a cluster launched in that region would work. Maybe try manually copying the current AMI from ap-northeast-2 with: $ aws --region cn-north-1 ec2 copy-image --source-region ap-northeast-2 --source-image-id ami-0ba6edf20991dee91 --name rhcos-410.8.20190325.0 --output text ami-0f7c2e27964ee1b28 # not the AMI you'll get, but match and replace this below $ STATE=pending $ while test "${STATE}" = pending; do sleep 10; STATE="$(aws --region cn-north-1 ec2 describe-images --image-ids ami-0f7c2e27964ee1b28 --query 'Images[].State' --output text)"; echo "${STATE}"; done pending pending ... # about 5 minutes, but can be up to 40+. pending available $ mkdir whatever $ cat <<EOF >whatever/install-config.yaml apiVersion: v1beta4 metadata: name: wking-test baseDomain: qe.devcluster.openshift.com platform: aws: region: cn-north-1 pullSecret: ...yours here... EOF $ OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=ami-0f7c2e27964ee1b28 openshift-install --dir=whatever --log-level=debug create cluster After you're done testing, you can use: $ aws --region cn-north-1 ec2 deregister-image --image-id ami-0f7c2e27964ee1b28 to release the AMI.
We have never been able to reproduce this issue and I have never heard of any others running into this. Given that, I'm inclined to believe that this may be a temporary issue with the Great Firewall. I tried to create a cluster in us-east-2 from Azure's China North 2 region, but was successful each time I tried. Maybe I was too late. I'm going to close this out since this is a very isolated problem and one that seems to be out of reach.
Hi, i ran into that problem since two days. Has somebody teared it down to the root? New S3 bucket is not created in my case, there is no cloudtrail log showing up on s3 api calls.