Bug 1669396

Summary:	Time is wasted in retrying installation due to "error getting S3 Bucket location ... timeout: 1m0s" often met
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	Installer	Assignee:	W. Trevor King <wking>
Installer sub component:	openshift-installer	QA Contact:	Johnny Liu <jialiu>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	high
Priority:	high	CC:	crawford, florian.zouhar, hongli, jiazha, sponnaga, wking, wmeng, xxia
Version:	4.1.0
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-03 23:21:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2019-01-25 06:12:40 UTC

Description of problem:
`openshift-install create cluster` often failed due to "aws_s3_bucket.ignition: error getting S3 Bucket location: timeout while waiting for state to become 'success' (timeout: 1m0s)".

When meeting this, we need to `destroy cluster` and `create cluster` again, then it will succeed. However, sometimes even the second `create cluster` fails again, and trying a third `create cluster` can succeed.

Version-Release number of the following components:
openshift-install extracted from v4.0.0-0.146.0.0-ose-installer

How reproducible:
Often, about 50%

Steps to Reproduce:
1. time bin/openshift-install create cluster --dir=install-xxia --log-level=debug
...
DEBUG data.aws_vpc.cluster_vpc: Refreshing state...                                         
DEBUG aws_security_group.master: Refreshing state... (ID: sg-089514abe32614cc3)                                                        
DEBUG aws_subnet.master_subnet[0]: Refreshing state... (ID: subnet-010d8a6deab88aced)                                                 
DEBUG aws_subnet.master_subnet[1]: Refreshing state... (ID: subnet-01240f6063acc6244)                                                  
DEBUG aws_subnet.master_subnet[2]: Refreshing state... (ID: subnet-006852f832fc537db)                                                  
DEBUG aws_security_group.bootstrap: Refreshing state... (ID: sg-029ccb6af2ad66aa0)                                               
DEBUG aws_security_group_rule.bootstrap_kubelet_secure: Refreshing state... (ID: sgrule-339798961)                                    
DEBUG aws_security_group_rule.bootstrap_journald_gateway: Refreshing state... (ID: sgrule-1925301900)                                  
ERROR                                                                                                                                  
ERROR Error: Error refreshing state: 1 error occurred:                                                                                
ERROR   * module.bootstrap.aws_s3_bucket.ignition: 1 error occurred:                                                                   
ERROR   * module.bootstrap.aws_s3_bucket.ignition: aws_s3_bucket.ignition: error getting S3 Bucket location: timeout while waiting for
state to become 'success' (timeout: 1m0s)                                                                                             
ERROR                                                                                                                                  
ERROR                                                                                                                                 
ERROR                                                                                                                                  
ERROR
ERROR
ERROR
FATAL Terraform destroy: failed to destroy using Terraform

real    18m56.122s
user    0m14.818s
sys     0m1.699s

Actual results:
1. Often met the 1m0s timeout failure

Expected results:
1. Should enlarge the timeout to be more tolerant e.g. 2 min or 5 min in [1] which causes the error in [2]. This should be acceptable compared to the time cost in retrying `create/destroy cluster`.

[1] https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go#L37
func retryOnAwsCode(code string, f func() (interface{}, error)) (interface{}, error) {
...
err := resource.Retry(1*time.Minute, func() *resource.RetryError {

[2] https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/resource_aws_s3_bucket.go#L1087-L1095 :
	locationResponse, err := retryOnAwsCode("NoSuchBucket", func() (interface{}, error) {
		return s3conn.GetBucketLocation(
...
	if err != nil {
return fmt.Errorf("error getting S3 Bucket location: %s", err)

Additional info:

Comment 1 W. Trevor King 2019-01-25 11:19:38 UTC

> https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go#L37

This is vendored code.  If we need to tune it, we'd want to patch upstream first to avoid having to rebase downstream work as upstream evolves.  But I've never seen this myself.  What region are you using?

Comment 2 Xingxing Xia 2019-01-28 05:17:37 UTC

I use us-east-2. Before, I also tried other regions like us-east-1, us-west-1/2, they also hit, thus it's not region specific.
> If we need to tune it, we'd want to patch upstream first to avoid having to rebase downstream work as upstream evolves.  But I've never seen this myself.
If it is indeed an issue, why not patch upstream?
Thanks

Comment 4 W. Trevor King 2019-01-28 05:49:57 UTC

From a recent CI run in us-east-1 [1]:

  time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 0s (ID: bootstrap.ign)"

so I'm still curious about how this is happening.  Maybe the timeout depends on the time it takes to upload the Ignition config and you have a slow network connection to us-east-1?  If we had a tuneable timeout, why do you expect your proposed 2 or 5 minutes to be sufficient?  What sorts of completion times do you see when it works?

> If it is indeed an issue, why not patch upstream?

Right, that's what we'd do.  It just usually takes longer when an external upstream is involved, and I wanted to set the expectations around us not being able to turn this around in a day ;).

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1133/pull-ci-openshift-installer-master-e2e-aws/3206/artifacts/e2e-aws/installer/.openshift_install.log

Comment 7 Xingxing Xia 2019-01-28 07:44:22 UTC

(In reply to W. Trevor King from comment #4)
> time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 0s (ID: bootstrap.ign)"
Is the time 0s shown here == the time in resource.Retry(1*time.Minute, ...)?

> why do you expect your proposed 2 or 5 minutes to be sufficient?
I don't know what value is sufficient. BTW my office is in Beijing. Today I modified the time to 5 mins to ensure success and see what would happen in `create cluster` log:
$ oc image info quay.io/...:v4.0.0-0.148.0.0-ose-installer # Get commit of latest OCP installer
...
             io.openshift.build.commit.url=https://github.com/openshift/installer/commit/7af6b141c34dd133f104b61b64c78d82e57f3d52

$ git checkout -b v4.0.0-0.148.0.0-my 7af6b141c34dd133f104b61b64c78d82e57f3d52 # Use above commit

$ vi pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/awserr.go # Update it to 5*time.Minute

$ hack/build.sh

$ bin/openshift-install create cluster --dir=install-xxia2nd --log-level=debug # Succeed

$ grep module.bootstrap.aws_s3_bucket install-xxia2nd/.openshift_install.log
time="2019-01-28T14:30:47+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creating..."                                
time="2019-01-28T14:30:57+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Still creating... (10s elapsed)"           
time="2019-01-28T14:31:02+08:00" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creation complete after 16s (ID: terraform-20
190128063047243000000001)"                                                                                                             
time="2019-01-28T14:31:02+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creating..."                         
time="2019-01-28T14:31:06+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Creation complete after 4s (ID: bootst
rap.ign)"
time="2019-01-28T14:48:40+08:00" level=debug msg="module.bootstrap.aws_s3_bucket_object.ignition: Destroying... (ID: bootstrap.ign)"
...

Comment 9 Xingxing Xia 2019-01-28 10:00:52 UTC

It makes more sense to add beta2blocker. Updating bug fields.

Comment 10 W. Trevor King 2019-01-28 11:09:16 UTC

I realize I quoted the wrong resource in comment 4.  It should have been:

  Creation complete after 0s (ID: dopt-095fb98c6117a1a0a)"
time="2019-01-27T17:28:36Z" level=debug msg="module.bootstrap.aws_s3_bucket.ignition: Creation complete after 1s (ID: terraform-20190127172834516800000001)"

While there's an outside chance that aws_s3_bucket_object depends on upload bandwidth, bandwidth should not be an issue for aws_s3_bucket.  How many buckets are in this account?  Maybe AWS starts slowing down allocations for accounts with many buckets?

For comment 7, 16 seconds is slower than the CI run I looked at, but still well short of a minute.  Have you seen times over a minute with that build?

Comment 13 Alex Crawford 2019-02-13 21:51:07 UTC

Is this still an ongoing issue? We haven't changed anything on our end, but we also didn't expect to be able to fix this particular issue.

Comment 14 Xingxing Xia 2019-02-19 14:36:39 UTC

Sorry for delayed reply due to holidays and later other work.
Tried latest v4.0.0-0.176.0.0 installer, still hit the issue. Again, like comment 7, checkout v4.0.0-0.176.0.0 commit, modify it to 5 min and build binary, use it to create cluster, it can succeed.
PS: like comment 7, the region uses us-east-2, the computer where I run `create cluster` is located in Beijing office. Not sure if geographical distance/network affects.

Comment 15 Jian Zhang 2019-03-01 05:23:34 UTC

Hit this issue many times today. :( Maybe because of the bad network.

DEBUG module.dns.aws_route53_record.api_external: Creation complete after 38s (ID: Z3B3KOVA3TRCWP_api.jian-10.qe.devcluster.openshift.com_A) 
ERROR                                              
ERROR Error: Error applying plan:                  
ERROR                                              
ERROR 1 error occurred:                            
ERROR 	* module.bootstrap.aws_s3_bucket.ignition: 1 error occurred: 
ERROR 	* aws_s3_bucket.ignition: error getting S3 Bucket location: timeout while waiting for state to become 'success' (timeout: 1m0s) 

[jzhang@dhcp-140-18 test2]$ ./openshift-install version
./openshift-install v4.0.5-1-dirty

I can find the corresponding s3 bucket already ready in the AWS web console. So I was wondering if we can set more waiting time(maybe 3mins) as the default.

Comment 16 Xingxing Xia 2019-03-01 07:31:45 UTC

Jian, thank you for continuously reporting it.
As my above comments said, the host I ran `openshift-install create cluster` is at local office site.
Thus, FYI, to avoid such annoyance, now I gave up my local host, set up a VM hosted in UpShift which seems located in US lab, ssh to it and run `openshift-install create cluster` there. So far there didn't hit this bug.

(In reply to Jian Zhang from comment #15)
> Hit this issue many times today. :( Maybe because of the bad network.
So, if this bug is not fixed, it would affect customers that don't luckily have a good alternative host like above US VM :)

Comment 18 W. Trevor King 2019-04-03 22:39:38 UTC

Are we still seeing this?  If so, I'm curious about whether AMI copies into a local region (cn-north-1 for Beijing) followed by a cluster launched in that region would work.  Maybe try manually copying the current AMI from ap-northeast-2 with:

$ aws --region cn-north-1 ec2 copy-image --source-region ap-northeast-2 --source-image-id ami-0ba6edf20991dee91 --name rhcos-410.8.20190325.0 --output text
ami-0f7c2e27964ee1b28  # not the AMI you'll get, but match and replace this below
$ STATE=pending
$ while test "${STATE}" = pending; do sleep 10; STATE="$(aws --region cn-north-1 ec2 describe-images --image-ids ami-0f7c2e27964ee1b28 --query 'Images[].State' --output text)"; echo "${STATE}"; done
pending
pending
... # about 5 minutes, but can be up to 40+.
pending
available
$ mkdir whatever
$ cat <<EOF >whatever/install-config.yaml
apiVersion: v1beta4
metadata:
  name: wking-test
baseDomain: qe.devcluster.openshift.com
platform:
  aws:
    region: cn-north-1
pullSecret: ...yours here...
EOF
$ OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=ami-0f7c2e27964ee1b28 openshift-install --dir=whatever --log-level=debug create cluster

After you're done testing, you can use:

$ aws --region cn-north-1 ec2 deregister-image --image-id ami-0f7c2e27964ee1b28

to release the AMI.

Comment 19 Alex Crawford 2019-04-03 23:21:10 UTC

We have never been able to reproduce this issue and I have never heard of any others running into this. Given that, I'm inclined to believe that this may be a temporary issue with the Great Firewall. I tried to create a cluster in us-east-2 from Azure's China North 2 region, but was successful each time I tried. Maybe I was too late.

I'm going to close this out since this is a very isolated problem and one that seems to be out of reach.

Comment 20 Florian Zouhar 2020-01-23 15:56:03 UTC

Hi, i ran into that problem since two days. Has somebody teared it down to the root? New S3 bucket is not created in my case, there is no cloudtrail log showing up on s3 api calls.