Bug 1690588

Summary: DynamicProvisioner tests failing in 4.0 e2e
Product: OpenShift Container Platform Reporter: Chance Zibolski <chancez>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED ERRATA QA Contact: Chao Yang <chaoyang>
Severity: low Docs Contact:
Priority: low    
Version: 4.1.0CC: aos-bugs, aos-storage-staff, bbennett, bchilds, hekumar, jupierce, lxia, rmeggins, weliang, wking
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1690620 (view as bug list) Environment:
Last Closed: 2020-01-23 11:03:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1690620    
Attachments:
Description Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC none

Description Chance Zibolski 2019-03-19 18:39:01 UTC
Description of problem:

Release promotion is failing on Storage related e2e tests. Specifically dynamic provisioning related.

https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.ci/release/4.0.0-0.ci-2019-03-19-143430 to https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.ci/release/4.0.0-0.ci-2019-03-19-144430 are failing on the same test suites related to storage:

[sig-storage] Dynamic Provisioning DynamicProvisioner should provision storage with different parameters [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 1 W. Trevor King 2019-03-19 18:45:53 UTC
Here are the 144340 e2e-aws logs for digging: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872

Comment 2 Matthew Wong 2019-03-19 19:27:56 UTC
144340 is an issue we have seen before where there are 3 masters but 2 workers for whatever reason (install fail?). Some of the e2e tests randomly pick zones from among those where there are schedulable nodes but the 3 masters are tainted so there are effectively only 2 zones causing the tests to fail when they pick the 3rd zone with no worker in it.

Still looking at 143430.

Comment 3 W. Trevor King 2019-03-19 20:11:37 UTC
> 144340 is an issue we have seen before where there are 3 masters but 2 workers for whatever reason (install fail?).

The installer doesn't create compute nodes, that's the machine-API.  I'll create a separate bug for them.

Comment 4 Matthew Wong 2019-03-19 20:13:16 UTC
143430 is due to AWS API throttling. I see repeated RequestLimitExceeded messages and attempts by the controller to avoid it, e.g.:
W0319 15:32:46.442550       1 retry_handler.go:55] Inserting delay before AWS request (ec2::DetachVolume) to avoid RequestLimitExceeded: 23s
W0319 15:30:12.803852       1 retry_handler.go:99] Got RequestLimitExceeded error on AWS request (ec2::DetachVolume)
W0319 15:33:48.804427       1 retry_handler.go:55] Inserting delay before AWS request (ec2::DeleteVolume) to avoid RequestLimitExceeded: 12s
W0319 15:34:00.814020       1 retry_handler.go:99] Got RequestLimitExceeded error on AWS request (ec2::DeleteVolume)


First detach attempt fails because of 503:
E0319 15:30:51.088249       1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e\"" failed. No retries permitted until 2019-03-19 15:30:51.588213212 +0000 UTC m=+1371.492833671 (durationBeforeRetry 500ms). Error: "DetachVolume.Detach failed for volume \"pvc-300a62d2-4a5b-11e9-9ff4-0a97e445762a\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e\") on node \"ip-10-0-145-137.ec2.internal\" : error detaching EBS volume \"vol-0d81484f71692454e\" from \"i-0bdce73584da52045\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 75967ad2-325b-4cbb-8e6b-e88fa7347351\""

2 1/2 minutes later it succeeds:
I0319 15:33:27.800684       1 operation_generator.go:420] DetachVolume.Detach succeeded for volume "pvc-300a62d2-4a5b-11e9-9ff4-0a97e445762a" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e") on node "ip-10-0-145-137.ec2.internal" 

But then deleteVolume gets throttled.

Comment 5 W. Trevor King 2019-03-22 05:20:46 UTC
Created attachment 1546763 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This has caused 25 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed: .*Dynamic Provisioning DynamicProvisioner should provision storage with different parameters'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 6 Matthew Wong 2019-03-22 17:20:27 UTC
I went through recent failures & most of the 'Dynamic Provisioning DynamicProvisioner should provision storage with different parameters' failures are due to the machine-API bug (14/17), the remainder 3/17 are API exceeds limit/throttling since provision/deletion takes 2+minutes every time and the test times out at 15m. I am not sure how to investigate what is causing us to hit the limits, it seems that for these PR logs only the build-log is available, are there AWS API metrics available somewhere?

multi-AZ bug:
/logs/release-openshift-origin-installer-e2e-aws-4.0/6046/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/5967/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/6012/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/5980/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/5954/build-log.txt:1
/logs/release-openshift-origin-installer-e2e-aws-4.0/6022/build-log.txt:1
/pr-logs/pull/22156/pull-ci-openshift-origin-master-e2e-aws/5841/build-log.txt:1
/pr-logs/pull/batch/pull-ci-openshift-console-master-e2e-aws/185/build-log.txt:1
/pr-logs/pull/openshift_console/1319/pull-ci-openshift-console-master-e2e-aws/208/build-log.txt:1
/pr-logs/pull/openshift_cluster-storage-operator/21/pull-ci-openshift-cluster-storage-operator-master-e2e-aws/70/build-log.txt:1
/pr-logs/pull/openshift_machine-config-operator/564/pull-ci-openshift-machine-config-operator-master-e2e-aws/2687/build-log.txt:1
/pr-logs/pull/openshift_installer/1439/pull-ci-openshift-installer-master-e2e-aws/4549/build-log.txt:1
/pr-logs/pull/openshift_installer/1445/pull-ci-openshift-installer-master-e2e-aws/4572/build-log.txt:1
/pr-logs/pull/operator-framework_operator-registry/39/pull-ci-operator-framework-operator-registry-master-e2e-aws/70/build-log.txt:1

throttle:
/pr-logs/pull/openshift_cluster-kube-apiserver-operator/350/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/1733/build-log.txt:1
/pr-logs/pull/openshift_cluster-logging-operator/123/pull-ci-openshift-cluster-logging-operator-master-e2e-aws/323/build-log.txt:1
/pr-logs/pull/operator-framework_operator-registry/40/pull-ci-operator-framework-operator-registry-master-e2e-aws/71/build-log.txt:1

Comment 7 Hemant Kumar 2019-05-01 18:42:23 UTC
BTW the Metrics are available in the CI build itself as wal file. It took me bit of time to figure out how to use WAL file to start prometheus(because I did not know version we are using in OCP - https://bugzilla.redhat.com/show_bug.cgi?id=1704915)  and query the API. You need prometheus version 2.7.2 locally and you can run it locally and point it to your local download of wal file directory as a data directory and then go ahead and query. 

wking also has some tools for fetching global API usage for the AWS account - https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events

Comment 10 Jan Safranek 2019-09-16 14:06:43 UTC
*** Bug 1707495 has been marked as a duplicate of this bug. ***

Comment 16 Jan Safranek 2019-11-15 16:11:35 UTC
*** Bug 1765275 has been marked as a duplicate of this bug. ***

Comment 18 W. Trevor King 2020-01-10 17:11:19 UTC
Jan, I don't know if the Errata sweeper will pick this up and move it from MODIFIED to ON_QE unless you have a link to the relevant PR(s) in the structured bug links.  Is it just https://github.com/openshift/origin/pull/24004 ?  Looks like that was against master, but landed before 4.3 split off from master?

Comment 19 Jan Safranek 2020-01-13 12:20:40 UTC
Manually added link to the PR.

Comment 21 Chao Yang 2020-01-16 03:22:24 UTC
This case is passed in CI and no regression issue found.
Update it to verified.

Comment 23 errata-xmlrpc 2020-01-23 11:03:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062