Description of problem: Release promotion is failing on Storage related e2e tests. Specifically dynamic provisioning related. https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.ci/release/4.0.0-0.ci-2019-03-19-143430 to https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.ci/release/4.0.0-0.ci-2019-03-19-144430 are failing on the same test suites related to storage: [sig-storage] Dynamic Provisioning DynamicProvisioner should provision storage with different parameters [Suite:openshift/conformance/parallel] [Suite:k8s]
Here are the 144340 e2e-aws logs for digging: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872
144340 is an issue we have seen before where there are 3 masters but 2 workers for whatever reason (install fail?). Some of the e2e tests randomly pick zones from among those where there are schedulable nodes but the 3 masters are tainted so there are effectively only 2 zones causing the tests to fail when they pick the 3rd zone with no worker in it. Still looking at 143430.
> 144340 is an issue we have seen before where there are 3 masters but 2 workers for whatever reason (install fail?). The installer doesn't create compute nodes, that's the machine-API. I'll create a separate bug for them.
143430 is due to AWS API throttling. I see repeated RequestLimitExceeded messages and attempts by the controller to avoid it, e.g.: W0319 15:32:46.442550 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DetachVolume) to avoid RequestLimitExceeded: 23s W0319 15:30:12.803852 1 retry_handler.go:99] Got RequestLimitExceeded error on AWS request (ec2::DetachVolume) W0319 15:33:48.804427 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DeleteVolume) to avoid RequestLimitExceeded: 12s W0319 15:34:00.814020 1 retry_handler.go:99] Got RequestLimitExceeded error on AWS request (ec2::DeleteVolume) First detach attempt fails because of 503: E0319 15:30:51.088249 1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e\"" failed. No retries permitted until 2019-03-19 15:30:51.588213212 +0000 UTC m=+1371.492833671 (durationBeforeRetry 500ms). Error: "DetachVolume.Detach failed for volume \"pvc-300a62d2-4a5b-11e9-9ff4-0a97e445762a\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e\") on node \"ip-10-0-145-137.ec2.internal\" : error detaching EBS volume \"vol-0d81484f71692454e\" from \"i-0bdce73584da52045\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 75967ad2-325b-4cbb-8e6b-e88fa7347351\"" 2 1/2 minutes later it succeeds: I0319 15:33:27.800684 1 operation_generator.go:420] DetachVolume.Detach succeeded for volume "pvc-300a62d2-4a5b-11e9-9ff4-0a97e445762a" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0d81484f71692454e") on node "ip-10-0-145-137.ec2.internal" But then deleteVolume gets throttled.
Created attachment 1546763 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This has caused 25 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'failed: .*Dynamic Provisioning DynamicProvisioner should provision storage with different parameters' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
I went through recent failures & most of the 'Dynamic Provisioning DynamicProvisioner should provision storage with different parameters' failures are due to the machine-API bug (14/17), the remainder 3/17 are API exceeds limit/throttling since provision/deletion takes 2+minutes every time and the test times out at 15m. I am not sure how to investigate what is causing us to hit the limits, it seems that for these PR logs only the build-log is available, are there AWS API metrics available somewhere? multi-AZ bug: /logs/release-openshift-origin-installer-e2e-aws-4.0/6046/build-log.txt:1 /logs/release-openshift-origin-installer-e2e-aws-4.0/5967/build-log.txt:1 /logs/release-openshift-origin-installer-e2e-aws-4.0/6012/build-log.txt:1 /logs/release-openshift-origin-installer-e2e-aws-4.0/5980/build-log.txt:1 /logs/release-openshift-origin-installer-e2e-aws-4.0/5954/build-log.txt:1 /logs/release-openshift-origin-installer-e2e-aws-4.0/6022/build-log.txt:1 /pr-logs/pull/22156/pull-ci-openshift-origin-master-e2e-aws/5841/build-log.txt:1 /pr-logs/pull/batch/pull-ci-openshift-console-master-e2e-aws/185/build-log.txt:1 /pr-logs/pull/openshift_console/1319/pull-ci-openshift-console-master-e2e-aws/208/build-log.txt:1 /pr-logs/pull/openshift_cluster-storage-operator/21/pull-ci-openshift-cluster-storage-operator-master-e2e-aws/70/build-log.txt:1 /pr-logs/pull/openshift_machine-config-operator/564/pull-ci-openshift-machine-config-operator-master-e2e-aws/2687/build-log.txt:1 /pr-logs/pull/openshift_installer/1439/pull-ci-openshift-installer-master-e2e-aws/4549/build-log.txt:1 /pr-logs/pull/openshift_installer/1445/pull-ci-openshift-installer-master-e2e-aws/4572/build-log.txt:1 /pr-logs/pull/operator-framework_operator-registry/39/pull-ci-operator-framework-operator-registry-master-e2e-aws/70/build-log.txt:1 throttle: /pr-logs/pull/openshift_cluster-kube-apiserver-operator/350/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/1733/build-log.txt:1 /pr-logs/pull/openshift_cluster-logging-operator/123/pull-ci-openshift-cluster-logging-operator-master-e2e-aws/323/build-log.txt:1 /pr-logs/pull/operator-framework_operator-registry/40/pull-ci-operator-framework-operator-registry-master-e2e-aws/71/build-log.txt:1
BTW the Metrics are available in the CI build itself as wal file. It took me bit of time to figure out how to use WAL file to start prometheus(because I did not know version we are using in OCP - https://bugzilla.redhat.com/show_bug.cgi?id=1704915) and query the API. You need prometheus version 2.7.2 locally and you can run it locally and point it to your local download of wal file directory as a data directory and then go ahead and query. wking also has some tools for fetching global API usage for the AWS account - https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events
*** Bug 1707495 has been marked as a duplicate of this bug. ***
*** Bug 1765275 has been marked as a duplicate of this bug. ***
Jan, I don't know if the Errata sweeper will pick this up and move it from MODIFIED to ON_QE unless you have a link to the relevant PR(s) in the structured bug links. Is it just https://github.com/openshift/origin/pull/24004 ? Looks like that was against master, but landed before 4.3 split off from master?
Manually added link to the PR.
This case is passed in CI and no regression issue found. Update it to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062