1752370 – copying ami on aws is timeout

Bug 1752370 - copying ami on aws is timeout

Summary: copying ami on aws is timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Abhinav Dahiya
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-16 08:10 UTC by sheng.lao
Modified:	2020-05-04 11:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:13:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:14:26 UTC

Description sheng.lao 2019-09-16 08:10:51 UTC

Description of problem:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.2/124

level=info msg="Consuming \"Install Config\" from target directory" level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised" level=info msg="Consuming \"Common Manifests\" from target directory" level=info msg="Consuming \"Worker Machines\" from target directory" level=info msg="Consuming \"Master Machines\" from target directory" level=info msg="Consuming \"Openshift Manifests\" from target directory" level=info msg="Creating infrastructure resources..." level=error level=error msg="Error: Error waiting for AMI (ami-0b59ba2db1a2d2747) to be ready: timeout while waiting for state to become 'available' (timeout: 40m0s)" level=error level=error msg=" on ../tmp/openshift-install-838476254/main.tf line 91, in resource \"aws_ami_copy\" \"main\":" level=error msg=" 91: resource \"aws_ami_copy\" \"main\" {" level=error level=error level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Abhinav Dahiya 2019-09-16 16:00:28 UTC

> Error: Error waiting for AMI (ami-0b59ba2db1a2d2747) to be ready: timeout while waiting for state to become 'available' (timeout: 40m0s)"

Seems like we waiting enough (40 mins) for the AMI copy to finish.

Not sure the installer could have done anything.

Comment 2 W. Trevor King 2020-01-09 22:31:43 UTC

About these slow AMI copies, here's what Darren L. at AWS told us on Fri Mar 22 2019, 16:48:00 GMT-0700:

> 1. We have discussed with the internal team. Internally, we would try to address issues and shorten the duration of image creation/copy process, but however at the moment we do not have SLAs or official guidelines for expected copy times.
>
> 2. I have submitted a product feature request to our product development so that they could consider improving the image copy feature in the future.

Comment 3 Jan Safranek 2020-01-22 16:24:56 UTC

It started impacting our CI a lot today:

https://search.svc.ci.openshift.org/?search=Error+waiting+for+AMI&maxAge=6h&context=2&type=build-log
Found 200 results in past 6 hours.

It's the main contributor why AWS CI jobs fail. From https://prow.svc.ci.openshift.org/?job=*pull*aws* :
Success rate over time: 3h: 6%, 12h: 26%

We're 2 days before feature freeze and we cannot merge anything because of failed tests. I know it's mostly AWS fault, however, we should work around it somehow (not necessarily in the installer).

Comment 4 Ben Parees 2020-01-22 21:20:39 UTC

At a minimum perhaps we need to reduce our boskos lease quota so we have fewer jobs running simultaneously?

Comment 5 W. Trevor King 2020-01-22 22:39:28 UTC

In our CI account, this can also manifest as:

* AMI copies get slow.
* We start failing out with the 40+m copies, closing the jobs and releasing the Boskos leases (making way for new jobs), but the slow copies were still running.
* Copy Task limits also apply to same-region copies, which AWS does not currently document.  They do have [1]

    Destination Regions are limited to 50 concurrent AMI copies.

  for cross-region copies, but the installer is always doing same-region copies.    
* We started hitting the Task limit between the currently-running jobs and the dangling slow copies from previous jobs.  For example, [2]:

    level=error msg="Error: ResourceLimitExceeded: Task limit exceeded" level=error msg="\tstatus code: 400, request id: db3b367b-d2c0-4d1a-bec7-73a0c1b05dde"
    level=error msg="\tstatus code: 400, request id: db3b367b-d2c0-4d1a-bec7-73a0c1b05dde"
    level=error
    level=error msg="  on ../tmp/openshift-install-917883190/main.tf line 104, in resource \"aws_ami_copy\" \"main\":"
    level=error msg=" 104: resource \"aws_ami_copy\" \"main\" {"

Updates from AWS:

On Thu Jan 09 2020 14:42:05 GMT-0800, Derek H. wrote:
> I have asked for our internal AMI team to confirm this is the case here and also to confirm the 50 limit applies for copies occuring within the same region. I will respond back once I have received more information from them.

but no further details on that front yet.

On Thu Jan 16 2020 16:02:09 GMT-0800, Derek H. wrote:
> A few curiosities that popped up. There are ~5000 copies of snap-02ad473199123fb7a and snap-00196d4ecd27734ee each (and there are more such examples over just last one month). Can you explain why you would need to copy the same AMI/snapshot so many times?
>
> The problem seems to stem from multiple copy calls for the same resource. If you were copying a different resource each time we would expect to see less of an issue and faster completion. Please provide the use case and details so I can forward it to our internal team. From there they can evaluate or provide some suggestions on how we can improve the performance here.

I explained that we use this account for OpenShift CI.  Currently, creating a cluster involves copying official, unencrypted AMIs into the target AWS account so they can be encrypted.  We can transition to launching encrypted-disk instances from the official AMI without the copy, but that's blocked on us bumping some internal Terraform [3].  Longer-term, we will probably transition to creating AMIs from VMDK instead of using official AMIs.  But that means right now, and with our long-term approach, we are going to have lots of very similar AMIs being constantly created and destroyed in this account, because we spin up and tear down a lot of clusters.  Most OpenShift consumers will likely spin up longer-running clusters, unless they are also performing CI on cluster creation and teardown.

About the VMDK upload approach, e.g. the upload to S3 followed by an ImportImage call described in [4].

On Thu Jan 09 2020 15:41:57 GMT-0800, Derek H. wrote:
> Checking here and it appears by default is a limit of 20 concurrent VM Import tasks. It does appear that this limit is not a hard limit and can be changed with approval from the VM Import team.

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/CopyingAMIs.html#copy-amis-across-regions
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console-operator/370/pull-ci-openshift-console-operator-master-e2e-aws-operator/1577
[3]: https://github.com/openshift/installer/pull/2160#issuecomment-554239946
[4]: https://docs.aws.amazon.com/vm-import/latest/userguide/vmimport-image-import.html#import-vm-image

Comment 6 W. Trevor King 2020-01-22 22:43:39 UTC

> At a minimum perhaps we need to reduce our boskos lease quota so we have fewer jobs running simultaneously?

I do not think this would help, because I expect the slow copies to be leaked out of aborted jobs.  E.g. there is no 'Deleted ...' comment about the AMI getting deregistered [1] in [2].  It's possible that adjusting the installer to close the destroy-time leak of in-flight AMIs might be easier than bumping Terraform so we can drop the AMI copy entirely; I'm not sure.

[1]: https://github.com/openshift/installer/blob/dd770f02b5263983fe04a41a2a4cb0669b4cf25e/pkg/destroy/aws/aws.go#L622
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/370/pull-ci-openshift-console-operator-master-e2e-aws-operator/1577/artifacts/e2e-aws-operator/installer/.openshift_install.log

Comment 7 W. Trevor King 2020-01-23 13:58:18 UTC

[1] landed a drop in the Boskos quota, so we'll see if that helps.  [2] is in flight with some region-sharding.  These are both just mitigations, so not linking formally from the issue.

[1]: https://github.com/openshift/release/pull/6832
[2]: https://github.com/openshift/release/pull/6833

Comment 8 Jesus M. Rodriguez 2020-01-23 14:06:36 UTC

[1] is another example of the AMI copy error

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/210

Comment 9 W. Trevor King 2020-01-24 00:29:13 UTC

Setting a target release so it doesn't show up as untriaged, although I don't think we need to deliver anything here for 4.4 if we can mitigate CI (e.g. via comment 7), because end users who aren't creating/destroying thousands of clusters a day in a single AWS account won't hit this (unless AWS breaks something else on their end).

Comment 10 W. Trevor King 2020-01-24 00:30:22 UTC

Oh, and Boskos leases were dropped via [1], so we can clear the NEEDINFO from comment 4.

[1]: https://github.com/openshift/release/pull/6832

Comment 11 Scott Dodson 2020-01-24 19:16:23 UTC

This has not happened in the 24 hours since we landed the region sharding and lease reduction. 

Installer has a story to evaluate potential change that would reduce the overhead but unless we foresee customers continuously building and tearing down more than 50 concurrent clusters per region per account this is not likely to be something customers encounter.


https://issues.redhat.com/browse/CORS-1352

Comment 12 Johnny Liu 2020-02-03 08:45:41 UTC

LGTM. And from QE side, we did not hit such issue.

Comment 14 errata-xmlrpc 2020-05-04 11:13:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.