1717604 – buildcop: failures in e2e-aws container setup related to \"aws_lb\" \"api_external\"

Bug 1717604 - buildcop: failures in e2e-aws container setup related to \"aws_lb\" \"api_external\"

Summary: buildcop: failures in e2e-aws container setup related to \"aws_lb\" \"api_ext...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	W. Trevor King
QA Contact:	sheng.lao
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1729525 1746958 (view as bug list)
Depends On:
Blocks:	1749624
TreeView+	depends on / blocked

Reported:	2019-06-05 19:15 UTC by Kirsten Garrison
Modified:	2019-10-16 06:31 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1749624 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:31:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
List slow CI ELBs from the past 24 hours (2.37 KB, application/x-shellscript) 2019-09-10 19:27 UTC, W. Trevor King	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 2279	None	None	None	2019-08-28 04:46:45 UTC
Github	terraform-providers terraform-provider-aws issues 8054	None	None	None	2019-08-27 23:14:44 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:31:26 UTC

Description Kirsten Garrison 2019-06-05 19:15:15 UTC

Description of problem:

Seeing failures similar to:
Installing from release registry.svc.ci.openshift.org/ci-op-n9cc84cd/release@sha256:1a22bc4527c3d0d33811bbcacd5565643321f49f30c3a3fabbb1381a9fe2e30a
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=error
level=error msg="Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)"
level=error
level=error msg="  on ../tmp/openshift-install-375369954/vpc/master-elb.tf line 19, in resource \"aws_lb\" \"api_external\":"
level=error msg="  19: resource \"aws_lb\" \"api_external\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"
Version-Release number of selected component (if applicable):

in a few PRs:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/240/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws/1145

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console/1674/pull-ci-openshift-console-master-e2e-aws/2032

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console/1674/pull-ci-openshift-console-master-e2e-aws/2027

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/23026/pull-ci-openshift-origin-master-e2e-aws/9465

Comment 1 Kirsten Garrison 2019-06-05 20:44:11 UTC

Note: This failure accounts for 71 out of 79 failures that contain: level=error.*timeout while waiting for state.*

The occurrences have started appearing consistently in the last 6 hours.

Comment 2 Kirsten Garrison 2019-06-05 21:52:14 UTC

7 out of 10 of the last failures in master-e2e-aws were due to this error.

Also seen in origin runs:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2276

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/640

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2275

Comment 3 Kirsten Garrison 2019-06-05 23:19:29 UTC

Also hitting e2e-aws-upgrade tests
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/40
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-scheduler-operator/140/pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws/529

We were investigating a big and found that these errors started appearing before we were anywhere near our peak aws usage & we aren't seeing anything in our account-specific alerts on aws either.

Comment 4 Nick Hale 2019-07-12 03:10:01 UTC

Was seeing this again during a usage spike today:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3831

Comment 5 OpenShift BugZilla Robot 2019-07-12 18:09:02 UTC

Abhinav, can we explicitly define the number of clusters we support concurrently in AWS?

Comment 6 Ben Parees 2019-07-12 20:49:13 UTC

This needs attention.  It may be pointing to a product bug in which something is exhausting our api budget.

Comment 7 Ben Parees 2019-07-12 20:53:56 UTC

*** Bug 1729525 has been marked as a duplicate of this bug. ***

Comment 10 Steve Kuznetsov 2019-07-24 15:31:59 UTC

Seeing this again today during EDT morning, at 00:00 when we had ~40 concurrent clusters. Unlikely it was load.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/2381#0:build-log.txt%3A26

Comment 11 W. Trevor King 2019-08-27 23:11:03 UTC

So for load balancers this is not necessarily API throttling, although it may be correlated.  Checking a recent instance [1] for a slow ci-op-8zpt2ntt-257f3-pj6mj-int/2b37a3db41c66e55:

time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:11:20 [DEBUG] [aws-sdk-go] DEBUG: Response elasticloadbalancing/CreateLoadBalancer Detai
ls:"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 200 OK"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Connection: close"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Content-Length: 1533"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Content-Type: text/xml"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Date: Tue, 27 Aug 2019 21:11:19 GMT"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: X-Amzn-Requestid: 371cfdea-c90f-11e9-a722-bfeb2f807c8c"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:11:20 [DEBUG] [aws-sdk-go] <CreateLoadBalancerResponse xmlns=\"http://elasticloadbalanci
ng.amazonaws.com/doc/2015-12-01/\">"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:   <CreateLoadBalancerResult>"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:     <LoadBalancers>"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:       <member>"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:         <LoadBalancerArn>arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-8zpt2ntt-257f3-pj6mj-int/2b37a3db41c66e55</LoadBalancerArn>"
...
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:         <CreatedTime>2019-08-27T21:11:20.120Z</CreatedTime>"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:         <State>"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:           <Code>provisioning</Code>"
time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws:         </State>"

For the next 10 minutes we keep polling:

time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:21:16 [DEBUG] [aws-sdk-go] DEBUG: Response elasticloadbalancing/DescribeLoadBalancers De
tails:"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 200 OK"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Connection: close"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Content-Length: 1545"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Content-Type: text/xml"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Date: Tue, 27 Aug 2019 21:21:15 GMT"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: X-Amzn-Requestid: 9a97ccad-c910-11e9-9bbb-e3805984d5f1"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:21:16 [DEBUG] [aws-sdk-go] <DescribeLoadBalancersResponse xmlns=\"http://elasticloadbala
ncing.amazonaws.com/doc/2015-12-01/\">"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws:   <DescribeLoadBalancersResult>"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws:     <LoadBalancers>"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws:       <member>"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws:         <LoadBalancerArn>arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-8zpt2ntt-257f3-pj6mj-int/2b37a3db41c66e55</LoadBalancerArn>"
...
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws:         <CreatedTime>2019-08-27T21:11:20.120Z</CreatedTime>"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws:         <State>"
time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws:           <Code>provisioning</Code>"

before giving up:

time="2019-08-27T21:21:20Z" level=error msg="1 error occurred:"
time="2019-08-27T21:21:20Z" level=error msg="\t* module.vpc.aws_lb.api_internal: 1 error occurred:"
time="2019-08-27T21:21:20Z" level=error msg="\t* aws_lb.api_internal: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)"

We can monitor the flow of provisioning NAT load balancers with:

$ $ AWS_PROFILE=ci aws elbv2 describe-load-balancers --output json | jq -r '[.LoadBalancers[] | .age = now - (.CreatedTime | sub("[.][0-9]*"; "") | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime)] | sort_by(.age)[] | select(.age < 600) | (.age / 60 * 100 | floor | . / 100 | tostring) + "m " + .State.Code'
5.29m active

And maybe polling that to see how long things are taking to provision?

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-nightly/297/artifacts/e2e-aws-upgrade/installer/.openshift_install.log

Comment 12 W. Trevor King 2019-08-27 23:14:44 UTC

Linking the upstream issue (about application load balancers, but it's probably the same backing code as we use for network load balancers, since both are v2 load balancers).  And here's a more interesting current state:

$ AWS_PROFILE=ci aws elbv2 describe-load-balancers --output json | jq -r '[.LoadBalancers[] | .age = now - (.CreatedTime | sub("[.][0-9]*"; "") | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime)] | sort_by(.age)[] | select(.age < 600) | (.age / 60 * 100 | floor | . / 100 | tostring) + "m " + .State.Code + " " + .LoadBalancerArn'
0.91m provisioning arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wrpzgtib-ab887-6xxbz-int/141889bbcf589db2
0.91m provisioning arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wrpzgtib-ab887-6xxbz-ext/48ed5585bbe2e456
3.79m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-1d3f3-r5l88-int/880ec8994f6d2ad1
3.79m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-1d3f3-r5l88-ext/c8fd1d4a6c514df6
4.33m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-76f5e-9qw6x-ext/aa9943e4845c0eb7
4.34m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-76f5e-9qw6x-int/66a16b033e055941
9.88m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/a0b88f8ddc91f11e9a0f112059dc191d/b6c50ecd9f73f930

Comment 13 W. Trevor King 2019-08-27 23:56:27 UTC

Grabbing the day's installer logs with:

--- a/deck-build-log/deck-build-log-pull
+++ b/deck-build-log/deck-build-log-pull
@@ -30 +30 @@ sync() {
-                       for DETAIL in #clusteroperators.json events.json nodes.json pods.json
+                       for DETAIL in installer/.openshift_install.log #clusteroperators.json events.json nodes.json pods.json
@@ -31,0 +32 @@ sync() {
+                               DETAIL_BASE="$(basename "${DETAIL}")" &&
@@ -33 +34 @@ sync() {
-                               DETAIL_PATH="${BUILD_LOG_PATH/build-log.txt/}${DETAIL}" &&
+                               DETAIL_PATH="${BUILD_LOG_PATH/build-log.txt/}${DETAIL_BASE}" &&

applied to [1].  It's still sucking down data, but looking at the content I've already downloaded and binning creation times with floor(minute) granularity:

$ grep -r '\(aws_lb\..*Creation complete after \|debug.*aws_lb\..*timeout while waiting for state\)' ~/.cache/openshift-deck-build-logs | sort | uniq | sed 's/.*Creation complete after \([0-9]*\)m.*/\1/;s/.*timeout: 10m0s.*/10/' | sort -n | uniq -c
    705 2
    206 3
      2 4
     60 5
     64 6
     33 7
     26 8
     15 9
     59 10

I'm not sure how long the tail would have been without the current 10m cap.

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 15 W. Trevor King 2019-08-28 01:16:35 UTC

Log-slurping complete.  Final stats are:

  $ grep -r '\(aws_lb\..*Creation complete after \|debug.*aws_lb\..*timeout while waiting for state\)' ~/.cache/openshift-deck-build-logs | sort | uniq | sed 's/.*Creation complete after \([0-9]*\)m.*/\1/;s/.*timeout: 10m0s.*/10/' | sort -n | uniq -c
   1584 2
    465 3
      5 4
    106 5
     86 6
     46 7
     37 8
     19 9
     83 10
      1 19

19m case is from [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_openshift-controller-manager/18/pull-ci-openshift-openshift-controller-manager-master-e2e-aws-builds/16/artifacts/e2e-aws-builds/installer/.openshift_install.log | grep '\.aws_lb\..*after 19m'
time="2019-08-20T16:34:29Z" level=debug msg="module.vpc.aws_lb.api_external: Creation complete after 19m6s [id=arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-xdts8ybz-5894a-7j67p-ext/e379e829580fd4cd]"

Dunno what was going on there, but it suggests that if we do end up bumping the timeout, we should bump it to 20+m (although by that point we're pushing our 30m cap for all infrastructure creation).  Total number of creation attempts:

  $ grep -r '\(aws_lb\..*Creation complete after \|debug.*aws_lb\..*timeout while waiting for state\)' ~/.cache/openshift-deck-build-logs | sort | uniq | sed 's/.*Creation complete after \([0-9]*\)m.*/\1/;s/.*timeout: 10m0s.*/10/' | wc -l
  2432

So the rate of 10+m timeouts is 83 / 2432 = 3.4% for the ~24hours covered by my deck-build-log-pull slurp.  I'm following up with AWS to see if they think that's a reasonable percentage.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_openshift-controller-manager/18/pull-ci-openshift-openshift-controller-manager-master-e2e-aws-builds/16

Comment 20 W. Trevor King 2019-08-29 04:24:04 UTC

Heard back from AWS:

On Wed, Aug 28, 2019 at 8:40 PM Jack S. wrote:
> While we do not have SLA's regarding the time to provision a Network Load Balancer, in my experience NLB's take less than 10 minutes to provision. Our SLA's are around Monthly Uptime Percentage [1].
> ...
> [1] https://aws.amazon.com/elasticloadbalancing/sla/

They're continuing to dig into our higher incidence of 10+m provisioning times, so we will hopefully have more details on why this suddenly flared up again soon.

Comment 22 W. Trevor King 2019-08-29 17:16:10 UTC

*** Bug 1746958 has been marked as a duplicate of this bug. ***

Comment 23 W. Trevor King 2019-08-29 18:01:05 UTC

From AWS:

On Wed, Aug 28, 2019 at 10:13 PM Jack S. wrote:
> Our investigation into this issue has indicated an internal issue which impacted the provisioning times for NLB in the us-east-1 Region. The issue had been resolved as of  2019-08-28 11:32:10AM GMT-0700.

But we've seen three 20+m timeouts today since then [1].  I'm continuing to follow up with them.

On the bright side, overall incidence of these timeouts is down in CI.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=aws&maxAge=12h&search=timeout%20while%20waiting%20for%20state.*20m

Comment 24 W. Trevor King 2019-08-29 22:05:57 UTC

> Abhinav, can we explicitly define the number of clusters we support concurrently in AWS?

Clearing the NEEDINFO, at least this most recent flare-up seems unrelated to concurrent cluster counts.

On the CI-health front:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?maxAge=24h&context=3&search=timeout+while+waiting+for+state' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's|.*in resource \\\"\([^\]*\).*|\1|p' | sort | uniq -c | sort -n
      4 aws_ami_copy
      4 aws_route_table_association
      8 aws_lb_listener
     38 aws_lb

is a lot less than:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?maxAge=48h&context=3&search=timeout+while+waiting+for+state' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's|.*in resource \\\"\([^\]*\).*|\1|p' | sort | uniq -c | sort -n
      1 aws_route
      1 aws_security_group
      2 aws_iam_instance_profile
      6 aws_subnet
      7 aws_internet_gateway
      8 aws_ami_copy
      8 aws_lb_listener
     20 aws_route_table_association
    288 aws_lb

on the aws_lb front; so we're mostly out of the fire.  AWS ticket is still ongoing, as mentioned in the previous comment.  aws_lb_listener issues are new in the past 24 hours, so we might to spin off a new bug for them if we want to dig into it.  But 8 failures out of the ~1.2k AWS jobs we've run today isn't a major issue.

Comment 25 sheng.lao 2019-09-03 05:41:04 UTC

It works for days. so I move it to Status: `VERIFIED`

Comment 26 W. Trevor King 2019-09-10 19:27:00 UTC

Created attachment 1613717 [details]
List slow CI ELBs from the past 24 hours

Got a run of these today passing the 20m timeout.  AWS wants me to forward the ARNs to help with their internal debugging, and there were enough that it seemed worth scripting it out.  Script attached.  You need a bit of post-processing to dedup things like:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/11717
  artifacts: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/11717/
  job type: e2e-aws
  install log: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/11717/artifacts/e2e-aws/installer/.openshift_install.log
  install log: 4
  created: arn:aws:elasticloadbalancing:us-east-1:...h7771-342ef-d9dnp-int/648d83c2fa6e4644
  slow: arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wiwh7771-342ef-d9dnp-ext/48e24dedb81b6d47
  slow: arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wiwh7771-342ef-d9dnp-int/648d83c2fa6e4644
  slow: arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wiwh7771-342ef-d9dnp-ext/48e24dedb81b6d47

where the internal LB wasn't actually slow (Terraform just elided its ARN when announcing creation).  And because of our tag processing we attempted to delete the external LB twice (once via the VPC and once directly).

Comment 28 errata-xmlrpc 2019-10-16 06:31:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.