Bug 1690588 - DynamicProvisioner tests failing in 4.0 e2e
Summary: DynamicProvisioner tests failing in 4.0 e2e
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.3.0
Assignee: Jan Safranek
QA Contact: Chao Yang
URL:
Whiteboard:
: 1707495 1765275 (view as bug list)
Depends On:
Blocks: 1690620
TreeView+ depends on / blocked
 
Reported: 2019-03-19 18:39 UTC by Chance Zibolski
Modified: 2020-01-23 11:04 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1690620 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:03:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (342.18 KB, image/svg+xml)
2019-03-22 05:20 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24004 0 None closed Bug 1759998: Bug 1690588: UPSTREAM: 84181: Lower AWS DescribeVolume frequency 2020-02-24 15:16:18 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:04:13 UTC

Description Chance Zibolski 2019-03-19 18:39:01 UTC
Description of problem:

Release promotion is failing on Storage related e2e tests. Specifically dynamic provisioning related.

https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.ci/release/4.0.0-0.ci-2019-03-19-143430 to https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.ci/release/4.0.0-0.ci-2019-03-19-144430 are failing on the same test suites related to storage:

[sig-storage] Dynamic Provisioning DynamicProvisioner should provision storage with different parameters [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 1 W. Trevor King 2019-03-19 18:45:53 UTC
Here are the 144340 e2e-aws logs for digging: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872

Comment 2 Matthew Wong 2019-03-19 19:27:56 UTC
144340 is an issue we have seen before where there are 3 masters but 2 workers for whatever reason (install fail?). Some of the e2e tests randomly pick zones from among those where there are schedulable nodes but the 3 masters are tainted so there are effectively only 2 zones causing the tests to fail when they pick the 3rd zone with no worker in it.

Still looking at 143430.

Comment 3 W. Trevor King 2019-03-19 20:11:37 UTC
> 144340 is an issue we have seen before where there are 3 masters but 2 workers for whatever reason (install fail?).

The installer doesn't create compute nodes, that's the machine-API.  I'll create a separate bug for them.

Comment 4 Matthew Wong 2019-03-19 20:13:16 UTC
143430 is due to AWS API throttling. I see repeated RequestLimitExceeded messages and attempts by the controller to avoid it, e.g.:
W0319 15:32:46.442550       1 retry_handler.go:55] Inserting delay before AWS request (ec2::DetachVolume) to avoid RequestLimitExceeded: 23s
W0319 15:30:12.803852       1 retry_handler.go:99] Got RequestLimitExceeded error on AWS request (ec2::DetachVolume)
W0319 15:33:48.804427       1 retry_handler.go:55] Inserting delay before AWS request (ec2::DeleteVolume) to avoid RequestLimitExceeded: 12s
W0319 15:34:00.814020       1 retry_handler.go:99] Got RequestLimitExceeded error on AWS request (ec2::DeleteVolume)


First detach attempt fails because of 503:
E0319 15:30:51.088249       1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e\"" failed. No retries permitted until 2019-03-19 15:30:51.588213212 +0000 UTC m=+1371.492833671 (durationBeforeRetry 500ms). Error: "DetachVolume.Detach failed for volume \"pvc-300a62d2-4a5b-11e9-9ff4-0a97e445762a\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e\") on node \"ip-10-0-145-137.ec2.internal\" : error detaching EBS volume \"vol-0d81484f71692454e\" from \"i-0bdce73584da52045\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 75967ad2-325b-4cbb-8e6b-e88fa7347351\""

2 1/2 minutes later it succeeds:
I0319 15:33:27.800684       1 operation_generator.go:420] DetachVolume.Detach succeeded for volume "pvc-300a62d2-4a5b-11e9-9ff4-0a97e445762a" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e") on node "ip-10-0-145-137.ec2.internal" 

But then deleteVolume gets throttled.

Comment 5 W. Trevor King 2019-03-22 05:20:46 UTC
Created attachment 1546763 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This has caused 25 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed: .*Dynamic Provisioning DynamicProvisioner should provision storage with different parameters'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 6 Matthew Wong 2019-03-22 17:20:27 UTC
I went through recent failures & most of the 'Dynamic Provisioning DynamicProvisioner should provision storage with different parameters' failures are due to the machine-API bug (14/17), the remainder 3/17 are API exceeds limit/throttling since provision/deletion takes 2+minutes every time and the test times out at 15m. I am not sure how to investigate what is causing us to hit the limits, it seems that for these PR logs only the build-log is available, are there AWS API metrics available somewhere?

multi-AZ bug:
/logs/release-openshift-origin-installer-e2e-aws-4.0/6046/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/5967/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/6012/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/5980/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/5954/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/6022/build-log.txt:1
/pr-logs/pull/22156/pull-ci-openshift-origin-master-e2e-aws/5841/build-log.txt:1
/pr-logs/pull/batch/pull-ci-openshift-console-master-e2e-aws/185/build-log.txt:1
/pr-logs/pull/openshift_console/1319/pull-ci-openshift-console-master-e2e-aws/208/build-log.txt:1
/pr-logs/pull/openshift_cluster-storage-operator/21/pull-ci-openshift-cluster-storage-operator-master-e2e-aws/70/build-log.txt:1
/pr-logs/pull/openshift_machine-config-operator/564/pull-ci-openshift-machine-config-operator-master-e2e-aws/2687/build-log.txt:1
/pr-logs/pull/openshift_installer/1439/pull-ci-openshift-installer-master-e2e-aws/4549/build-log.txt:1
/pr-logs/pull/openshift_installer/1445/pull-ci-openshift-installer-master-e2e-aws/4572/build-log.txt:1
/pr-logs/pull/operator-framework_operator-registry/39/pull-ci-operator-framework-operator-registry-master-e2e-aws/70/build-log.txt:1

throttle:
/pr-logs/pull/openshift_cluster-kube-apiserver-operator/350/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/1733/build-log.txt:1
/pr-logs/pull/openshift_cluster-logging-operator/123/pull-ci-openshift-cluster-logging-operator-master-e2e-aws/323/build-log.txt:1
/pr-logs/pull/operator-framework_operator-registry/40/pull-ci-operator-framework-operator-registry-master-e2e-aws/71/build-log.txt:1

Comment 7 Hemant Kumar 2019-05-01 18:42:23 UTC
BTW the Metrics are available in the CI build itself as wal file. It took me bit of time to figure out how to use WAL file to start prometheus(because I did not know version we are using in OCP - https://bugzilla.redhat.com/show_bug.cgi?id=1704915)  and query the API. You need prometheus version 2.7.2 locally and you can run it locally and point it to your local download of wal file directory as a data directory and then go ahead and query. 

wking also has some tools for fetching global API usage for the AWS account - https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events

Comment 10 Jan Safranek 2019-09-16 14:06:43 UTC
*** Bug 1707495 has been marked as a duplicate of this bug. ***

Comment 16 Jan Safranek 2019-11-15 16:11:35 UTC
*** Bug 1765275 has been marked as a duplicate of this bug. ***

Comment 18 W. Trevor King 2020-01-10 17:11:19 UTC
Jan, I don't know if the Errata sweeper will pick this up and move it from MODIFIED to ON_QE unless you have a link to the relevant PR(s) in the structured bug links.  Is it just https://github.com/openshift/origin/pull/24004 ?  Looks like that was against master, but landed before 4.3 split off from master?

Comment 19 Jan Safranek 2020-01-13 12:20:40 UTC
Manually added link to the PR.

Comment 21 Chao Yang 2020-01-16 03:22:24 UTC
This case is passed in CI and no regression issue found.
Update it to verified.

Comment 23 errata-xmlrpc 2020-01-23 11:03:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.