Bug 1669396
| Summary: | Time is wasted in retrying installation due to "error getting S3 Bucket location ... timeout: 1m0s" often met | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> |
| Component: | Installer | Assignee: | W. Trevor King <wking> |
| Installer sub component: | openshift-installer | QA Contact: | Johnny Liu <jialiu> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | crawford, florian.zouhar, hongli, jiazha, sponnaga, wking, wmeng, xxia |
| Version: | 4.1.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-04-03 23:21:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Xingxing Xia
2019-01-25 06:12:40 UTC
> https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go#L37
This is vendored code. If we need to tune it, we'd want to patch upstream first to avoid having to rebase downstream work as upstream evolves. But I've never seen this myself. What region are you using?
I use us-east-2. Before, I also tried other regions like us-east-1, us-west-1/2, they also hit, thus it's not region specific.
> If we need to tune it, we'd want to patch upstream first to avoid having to rebase downstream work as upstream evolves. But I've never seen this myself.
If it is indeed an issue, why not patch upstream?
Thanks
From a recent CI run in us-east-1 [1]: time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 0s (ID: bootstrap.ign)" so I'm still curious about how this is happening. Maybe the timeout depends on the time it takes to upload the Ignition config and you have a slow network connection to us-east-1? If we had a tuneable timeout, why do you expect your proposed 2 or 5 minutes to be sufficient? What sorts of completion times do you see when it works? > If it is indeed an issue, why not patch upstream? Right, that's what we'd do. It just usually takes longer when an external upstream is involved, and I wanted to set the expectations around us not being able to turn this around in a day ;). [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1133/pull-ci-openshift-installer-master-e2e-aws/3206/artifacts/e2e-aws/installer/.openshift_install.log (In reply to W. Trevor King from comment #4) > time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 0s (ID: bootstrap.ign)" Is the time 0s shown here == the time in resource.Retry(1*time.Minute, ...)? > why do you expect your proposed 2 or 5 minutes to be sufficient? I don't know what value is sufficient. BTW my office is in Beijing. Today I modified the time to 5 mins to ensure success and see what would happen in `create cluster` log: $ oc image info quay.io/...:v4.0.0-0.148.0.0-ose-installer # Get commit of latest OCP installer ... io.openshift.build.commit.url=https://github.com/openshift/installer/commit/7af6b141c34dd133f104b61b64c78d82e57f3d52 $ git checkout -b v4.0.0-0.148.0.0-my 7af6b141c34dd133f104b61b64c78d82e57f3d52 # Use above commit $ vi pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go # Update it to 5*time.Minute $ hack/build.sh $ bin/openshift-install create cluster --dir=install-xxia2nd --log-level=debug # Succeed $ grep module.bootstrap.aws_s3_bucket install-xxia2nd/.openshift_install.log time="2019-01-28T14:30:47+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creating..." time="2019-01-28T14:30:57+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Still creating... (10s elapsed)" time="2019-01-28T14:31:02+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creation complete after 16s (ID: terraform-20 190128063047243000000001)" time="2019-01-28T14:31:02+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creating..." time="2019-01-28T14:31:06+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 4s (ID: bootst rap.ign)" time="2019-01-28T14:48:40+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Destroying... (ID: bootstrap.ign)" ... It makes more sense to add beta2blocker. Updating bug fields. I realize I quoted the wrong resource in comment 4. It should have been: Creation complete after 0s (ID: dopt-095fb98c6117a1a0a)" time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creation complete after 1s (ID: terraform-20190127172834516800000001)" While there's an outside chance that aws_s3_bucket_object depends on upload bandwidth, bandwidth should not be an issue for aws_s3_bucket. How many buckets are in this account? Maybe AWS starts slowing down allocations for accounts with many buckets? For comment 7, 16 seconds is slower than the CI run I looked at, but still well short of a minute. Have you seen times over a minute with that build? Is this still an ongoing issue? We haven't changed anything on our end, but we also didn't expect to be able to fix this particular issue. Sorry for delayed reply due to holidays and later other work. Tried latest v4.0.0-0.176.0.0 installer, still hit the issue. Again, like comment 7, checkout v4.0.0-0.176.0.0 commit, modify it to 5 min and build binary, use it to create cluster, it can succeed. PS: like comment 7, the region uses us-east-2, the computer where I run `create cluster` is located in Beijing office. Not sure if geographical distance/network affects. Hit this issue many times today. :( Maybe because of the bad network. DEBUG module.dns.aws_route53_record.api_external: Creation complete after 38s (ID: Z3B3KOVA3TRCWP_api.jian-10.qe.devcluster.openshift.com_A) ERROR ERROR Error: Error applying plan: ERROR ERROR 1 error occurred: ERROR * module.bootstrap.aws_s3_bucket.ignition: 1 error occurred: ERROR * aws_s3_bucket.ignition: error getting S3 Bucket location: timeout while waiting for state to become 'success' (timeout: 1m0s) [jzhang@dhcp-140-18 test2]$ ./openshift-install version ./openshift-install v4.0.5-1-dirty I can find the corresponding s3 bucket already ready in the AWS web console. So I was wondering if we can set more waiting time(maybe 3mins) as the default. Jian, thank you for continuously reporting it. As my above comments said, the host I ran `openshift-install create cluster` is at local office site. Thus, FYI, to avoid such annoyance, now I gave up my local host, set up a VM hosted in UpShift which seems located in US lab, ssh to it and run `openshift-install create cluster` there. So far there didn't hit this bug. (In reply to Jian Zhang from comment #15) > Hit this issue many times today. :( Maybe because of the bad network. So, if this bug is not fixed, it would affect customers that don't luckily have a good alternative host like above US VM :) Are we still seeing this? If so, I'm curious about whether AMI copies into a local region (cn-north-1 for Beijing) followed by a cluster launched in that region would work. Maybe try manually copying the current AMI from ap-northeast-2 with:
$ aws --region cn-north-1 ec2 copy-image --source-region ap-northeast-2 --source-image-id ami-0ba6edf20991dee91 --name rhcos-410.8.20190325.0 --output text
ami-0f7c2e27964ee1b28 # not the AMI you'll get, but match and replace this below
$ STATE=pending
$ while test "${STATE}" = pending; do sleep 10; STATE="$(aws --region cn-north-1 ec2 describe-images --image-ids ami-0f7c2e27964ee1b28 --query 'Images[].State' --output text)"; echo "${STATE}"; done
pending
pending
... # about 5 minutes, but can be up to 40+.
pending
available
$ mkdir whatever
$ cat <<EOF >whatever/install-config.yaml
apiVersion: v1beta4
metadata:
name: wking-test
baseDomain: qe.devcluster.openshift.com
platform:
aws:
region: cn-north-1
pullSecret: ...yours here...
EOF
$ OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=ami-0f7c2e27964ee1b28 openshift-install --dir=whatever --log-level=debug create cluster
After you're done testing, you can use:
$ aws --region cn-north-1 ec2 deregister-image --image-id ami-0f7c2e27964ee1b28
to release the AMI.
We have never been able to reproduce this issue and I have never heard of any others running into this. Given that, I'm inclined to believe that this may be a temporary issue with the Great Firewall. I tried to create a cluster in us-east-2 from Azure's China North 2 region, but was successful each time I tried. Maybe I was too late. I'm going to close this out since this is a very isolated problem and one that seems to be out of reach. Hi, i ran into that problem since two days. Has somebody teared it down to the root? New S3 bucket is not created in my case, there is no cloudtrail log showing up on s3 api calls. |