Bug 1717604
Summary: | buildcop: failures in e2e-aws container setup related to \"aws_lb\" \"api_external\" | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kirsten Garrison <kgarriso> | ||||
Component: | Installer | Assignee: | W. Trevor King <wking> | ||||
Installer sub component: | openshift-installer | QA Contact: | sheng.lao <shlao> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | high | CC: | adahiya, aos-bugs, bparees, jokerman, krizza, mmccomas, nhale, openshift-bugzilla-robot, rmarasch, sdodson, wking | ||||
Version: | 4.2.0 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.2.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1749624 (view as bug list) | Environment: | |||||
Last Closed: | 2019-10-16 06:31:10 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1749624 | ||||||
Attachments: |
|
Description
Kirsten Garrison
2019-06-05 19:15:15 UTC
Note: This failure accounts for 71 out of 79 failures that contain: level=error.*timeout while waiting for state.* The occurrences have started appearing consistently in the last 6 hours. 7 out of 10 of the last failures in master-e2e-aws were due to this error. Also seen in origin runs: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2276 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/640 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2275 Also hitting e2e-aws-upgrade tests https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/40 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-scheduler-operator/140/pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws/529 We were investigating a big and found that these errors started appearing before we were anywhere near our peak aws usage & we aren't seeing anything in our account-specific alerts on aws either. Was seeing this again during a usage spike today: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3831 Abhinav, can we explicitly define the number of clusters we support concurrently in AWS? This needs attention. It may be pointing to a product bug in which something is exhausting our api budget. *** Bug 1729525 has been marked as a duplicate of this bug. *** Seeing this again today during EDT morning, at 00:00 when we had ~40 concurrent clusters. Unlikely it was load. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/2381#0:build-log.txt%3A26 So for load balancers this is not necessarily API throttling, although it may be correlated. Checking a recent instance [1] for a slow ci-op-8zpt2ntt-257f3-pj6mj-int/2b37a3db41c66e55: time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:11:20 [DEBUG] [aws-sdk-go] DEBUG: Response elasticloadbalancing/CreateLoadBalancer Detai ls:" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 200 OK" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Connection: close" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Content-Length: 1533" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Content-Type: text/xml" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: Date: Tue, 27 Aug 2019 21:11:19 GMT" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: X-Amzn-Requestid: 371cfdea-c90f-11e9-a722-bfeb2f807c8c" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: " time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: " time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:11:20 [DEBUG] [aws-sdk-go] <CreateLoadBalancerResponse xmlns=\"http://elasticloadbalanci ng.amazonaws.com/doc/2015-12-01/\">" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: <CreateLoadBalancerResult>" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: <LoadBalancers>" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: <member>" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: <LoadBalancerArn>arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-8zpt2ntt-257f3-pj6mj-int/2b37a3db41c66e55</LoadBalancerArn>" ... time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: <CreatedTime>2019-08-27T21:11:20.120Z</CreatedTime>" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: <State>" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: <Code>provisioning</Code>" time="2019-08-27T21:11:20Z" level=debug msg="2019-08-27T21:11:20.166Z [DEBUG] plugin.terraform-provider-aws: </State>" For the next 10 minutes we keep polling: time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:21:16 [DEBUG] [aws-sdk-go] DEBUG: Response elasticloadbalancing/DescribeLoadBalancers De tails:" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 200 OK" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Connection: close" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Content-Length: 1545" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Content-Type: text/xml" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: Date: Tue, 27 Aug 2019 21:21:15 GMT" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: X-Amzn-Requestid: 9a97ccad-c910-11e9-9bbb-e3805984d5f1" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: " time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: " time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: 2019/08/27 21:21:16 [DEBUG] [aws-sdk-go] <DescribeLoadBalancersResponse xmlns=\"http://elasticloadbala ncing.amazonaws.com/doc/2015-12-01/\">" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: <DescribeLoadBalancersResult>" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: <LoadBalancers>" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: <member>" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: <LoadBalancerArn>arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-8zpt2ntt-257f3-pj6mj-int/2b37a3db41c66e55</LoadBalancerArn>" ... time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: <CreatedTime>2019-08-27T21:11:20.120Z</CreatedTime>" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: <State>" time="2019-08-27T21:21:16Z" level=debug msg="2019-08-27T21:21:16.223Z [DEBUG] plugin.terraform-provider-aws: <Code>provisioning</Code>" before giving up: time="2019-08-27T21:21:20Z" level=error msg="1 error occurred:" time="2019-08-27T21:21:20Z" level=error msg="\t* module.vpc.aws_lb.api_internal: 1 error occurred:" time="2019-08-27T21:21:20Z" level=error msg="\t* aws_lb.api_internal: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)" We can monitor the flow of provisioning NAT load balancers with: $ $ AWS_PROFILE=ci aws elbv2 describe-load-balancers --output json | jq -r '[.LoadBalancers[] | .age = now - (.CreatedTime | sub("[.][0-9]*"; "") | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime)] | sort_by(.age)[] | select(.age < 600) | (.age / 60 * 100 | floor | . / 100 | tostring) + "m " + .State.Code' 5.29m active And maybe polling that to see how long things are taking to provision? [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-nightly/297/artifacts/e2e-aws-upgrade/installer/.openshift_install.log Linking the upstream issue (about application load balancers, but it's probably the same backing code as we use for network load balancers, since both are v2 load balancers). And here's a more interesting current state: $ AWS_PROFILE=ci aws elbv2 describe-load-balancers --output json | jq -r '[.LoadBalancers[] | .age = now - (.CreatedTime | sub("[.][0-9]*"; "") | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime)] | sort_by(.age)[] | select(.age < 600) | (.age / 60 * 100 | floor | . / 100 | tostring) + "m " + .State.Code + " " + .LoadBalancerArn' 0.91m provisioning arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wrpzgtib-ab887-6xxbz-int/141889bbcf589db2 0.91m provisioning arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wrpzgtib-ab887-6xxbz-ext/48ed5585bbe2e456 3.79m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-1d3f3-r5l88-int/880ec8994f6d2ad1 3.79m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-1d3f3-r5l88-ext/c8fd1d4a6c514df6 4.33m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-76f5e-9qw6x-ext/aa9943e4845c0eb7 4.34m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-j07q0cw1-76f5e-9qw6x-int/66a16b033e055941 9.88m active arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/a0b88f8ddc91f11e9a0f112059dc191d/b6c50ecd9f73f930 Grabbing the day's installer logs with: --- a/deck-build-log/deck-build-log-pull +++ b/deck-build-log/deck-build-log-pull @@ -30 +30 @@ sync() { - for DETAIL in #clusteroperators.json events.json nodes.json pods.json + for DETAIL in installer/.openshift_install.log #clusteroperators.json events.json nodes.json pods.json @@ -31,0 +32 @@ sync() { + DETAIL_BASE="$(basename "${DETAIL}")" && @@ -33 +34 @@ sync() { - DETAIL_PATH="${BUILD_LOG_PATH/build-log.txt/}${DETAIL}" && + DETAIL_PATH="${BUILD_LOG_PATH/build-log.txt/}${DETAIL_BASE}" && applied to [1]. It's still sucking down data, but looking at the content I've already downloaded and binning creation times with floor(minute) granularity: $ grep -r '\(aws_lb\..*Creation complete after \|debug.*aws_lb\..*timeout while waiting for state\)' ~/.cache/openshift-deck-build-logs | sort | uniq | sed 's/.*Creation complete after \([0-9]*\)m.*/\1/;s/.*timeout: 10m0s.*/10/' | sort -n | uniq -c 705 2 206 3 2 4 60 5 64 6 33 7 26 8 15 9 59 10 I'm not sure how long the tail would have been without the current 10m cap. [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log Log-slurping complete. Final stats are: $ grep -r '\(aws_lb\..*Creation complete after \|debug.*aws_lb\..*timeout while waiting for state\)' ~/.cache/openshift-deck-build-logs | sort | uniq | sed 's/.*Creation complete after \([0-9]*\)m.*/\1/;s/.*timeout: 10m0s.*/10/' | sort -n | uniq -c 1584 2 465 3 5 4 106 5 86 6 46 7 37 8 19 9 83 10 1 19 19m case is from [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_openshift-controller-manager/18/pull-ci-openshift-openshift-controller-manager-master-e2e-aws-builds/16/artifacts/e2e-aws-builds/installer/.openshift_install.log | grep '\.aws_lb\..*after 19m' time="2019-08-20T16:34:29Z" level=debug msg="module.vpc.aws_lb.api_external: Creation complete after 19m6s [id=arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-xdts8ybz-5894a-7j67p-ext/e379e829580fd4cd]" Dunno what was going on there, but it suggests that if we do end up bumping the timeout, we should bump it to 20+m (although by that point we're pushing our 30m cap for all infrastructure creation). Total number of creation attempts: $ grep -r '\(aws_lb\..*Creation complete after \|debug.*aws_lb\..*timeout while waiting for state\)' ~/.cache/openshift-deck-build-logs | sort | uniq | sed 's/.*Creation complete after \([0-9]*\)m.*/\1/;s/.*timeout: 10m0s.*/10/' | wc -l 2432 So the rate of 10+m timeouts is 83 / 2432 = 3.4% for the ~24hours covered by my deck-build-log-pull slurp. I'm following up with AWS to see if they think that's a reasonable percentage. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_openshift-controller-manager/18/pull-ci-openshift-openshift-controller-manager-master-e2e-aws-builds/16 Heard back from AWS:
On Wed, Aug 28, 2019 at 8:40 PM Jack S. wrote:
> While we do not have SLA's regarding the time to provision a Network Load Balancer, in my experience NLB's take less than 10 minutes to provision. Our SLA's are around Monthly Uptime Percentage [1].
> ...
> [1] https://aws.amazon.com/elasticloadbalancing/sla/
They're continuing to dig into our higher incidence of 10+m provisioning times, so we will hopefully have more details on why this suddenly flared up again soon.
*** Bug 1746958 has been marked as a duplicate of this bug. *** From AWS: On Wed, Aug 28, 2019 at 10:13 PM Jack S. wrote: > Our investigation into this issue has indicated an internal issue which impacted the provisioning times for NLB in the us-east-1 Region. The issue had been resolved as of 2019-08-28 11:32:10AM GMT-0700. But we've seen three 20+m timeouts today since then [1]. I'm continuing to follow up with them. On the bright side, overall incidence of these timeouts is down in CI. [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=aws&maxAge=12h&search=timeout%20while%20waiting%20for%20state.*20m > Abhinav, can we explicitly define the number of clusters we support concurrently in AWS? Clearing the NEEDINFO, at least this most recent flare-up seems unrelated to concurrent cluster counts. On the CI-health front: $ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?maxAge=24h&context=3&search=timeout+while+waiting+for+state' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's|.*in resource \\\"\([^\]*\).*|\1|p' | sort | uniq -c | sort -n 4 aws_ami_copy 4 aws_route_table_association 8 aws_lb_listener 38 aws_lb is a lot less than: $ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?maxAge=48h&context=3&search=timeout+while+waiting+for+state' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's|.*in resource \\\"\([^\]*\).*|\1|p' | sort | uniq -c | sort -n 1 aws_route 1 aws_security_group 2 aws_iam_instance_profile 6 aws_subnet 7 aws_internet_gateway 8 aws_ami_copy 8 aws_lb_listener 20 aws_route_table_association 288 aws_lb on the aws_lb front; so we're mostly out of the fire. AWS ticket is still ongoing, as mentioned in the previous comment. aws_lb_listener issues are new in the past 24 hours, so we might to spin off a new bug for them if we want to dig into it. But 8 failures out of the ~1.2k AWS jobs we've run today isn't a major issue. It works for days. so I move it to Status: `VERIFIED` Created attachment 1613717 [details] List slow CI ELBs from the past 24 hours Got a run of these today passing the 20m timeout. AWS wants me to forward the ARNs to help with their internal debugging, and there were enough that it seemed worth scripting it out. Script attached. You need a bit of post-processing to dedup things like: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/11717 artifacts: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/11717/ job type: e2e-aws install log: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/11717/artifacts/e2e-aws/installer/.openshift_install.log install log: 4 created: arn:aws:elasticloadbalancing:us-east-1:...h7771-342ef-d9dnp-int/648d83c2fa6e4644 slow: arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wiwh7771-342ef-d9dnp-ext/48e24dedb81b6d47 slow: arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wiwh7771-342ef-d9dnp-int/648d83c2fa6e4644 slow: arn:aws:elasticloadbalancing:us-east-1:460538899914:loadbalancer/net/ci-op-wiwh7771-342ef-d9dnp-ext/48e24dedb81b6d47 where the internal LB wasn't actually slow (Terraform just elided its ARN when announcing creation). And because of our tag processing we attempted to delete the external LB twice (once via the VPC and once directly). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |