Bug 1810470 - [Flake] volume expansion tests occasionally flake with EBS CSI driver
Summary: [Flake] volume expansion tests occasionally flake with EBS CSI driver
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Hemant Kumar
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks: 1879192
TreeView+ depends on / blocked
 
Reported: 2020-03-05 10:26 UTC by Fabio Bertinatto
Modified: 2020-09-15 16:11 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift aws-ebs-csi-driver pull 163 None closed Bug 1810470:Make EBS controllerexpansion idempotent 2020-09-14 05:50:22 UTC
Github openshift aws-ebs-csi-driver pull 164 None closed Bug 1810470: Drop optimizing check from resizing call 2020-09-14 05:50:21 UTC
Github openshift aws-ebs-csi-driver pull 165 None closed Bug 1810470: Verify pending volume modifications and size both 2020-09-14 05:50:21 UTC
Github openshift aws-ebs-csi-driver pull 167 None closed Bug 1810470: carry: Check for optimizing state too 2020-09-14 05:50:21 UTC
Github openshift csi-external-resizer pull 111 None closed Bug 1810470: Update with rc3 which fixes idempotency issues 2020-09-14 05:50:21 UTC
Github openshift origin pull 24873 None closed Bug 1810470: UPSTREAM: <drop> Increase timeout in volume expansion test 2020-09-14 05:50:21 UTC

Description Fabio Bertinatto 2020-03-05 10:26:05 UTC
A few examples:

Block: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24633/pull-ci-openshift-origin-master-e2e-aws-csi/158
FS: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24622/pull-ci-openshift-origin-master-e2e-aws-csi/165

External Storage [Driver: ebs.csi.aws.com] [Testpattern: Dynamic PV (block volmode)(allowExpansion)] volume-expand Verify if offline PVC expansion works expand_less 	5m57s
fail [k8s.io/kubernetes/test/e2e/storage/testsuites/volume_expand.go:213]: while recreating pod for resizing
Unexpected error:
    <*errors.errorString | 0xc002801c80>: {
        s: "pod \"security-context-70334232-54ff-4328-b510-acc0ac7a3a95\" is not Running: timed out waiting for the condition",
    }
    pod "security-context-70334232-54ff-4328-b510-acc0ac7a3a95" is not Running: timed out waiting for the condition
occurred

Comment 9 Chao Yang 2020-05-11 07:20:26 UTC
PR is not merged right now. 
Update the status to assigned.

Comment 11 Fabio Bertinatto 2020-05-27 15:21:30 UTC
Inspected a new failure here: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/24680/pull-ci-openshift-origin-master-e2e-aws-csi/2908

1. This is the first time the controller tried to attach the resized volume to a new pod (15:05:37):

May 26 15:15:45.314: INFO: At 2020-05-26 15:05:37 +0000 UTC - event for security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-ffe583b7-5d6d-4937-84ff-5cb5e02df464" : rpc error: code = Internal desc = Could not attach volume "vol-069d2b1853f909252" to node "i-045220a861c8e0b43": could not attach volume "vol-069d2b1853f909252" to node "i-045220a861c8e0b43": OperationNotPermitted: Cannot attach volume vol-069d2b1853f909252 when it is in modification state: MODIFYING


2. As we can see, it failed. However, it retries many times, until it succeeds more than 8 minutes later:

May 26 15:15:45.314: INFO: At 2020-05-26 15:13:51 +0000 UTC - event for security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-ffe583b7-5d6d-4937-84ff-5cb5e02df464" 

3. So far it spent 8 min and 14 secs out of 10 minutes of the deadline.

4. The deadline of 10 min is reached, so the pod is deleted and the test fails

May 26 15:15:37.004: INFO: Deleting pod "security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e" in namespace "e2e-volume-expand-4993"

Comment 12 Fabio Bertinatto 2020-05-28 08:27:56 UTC
The cause of this flake is: reattaching a resized volume in AWS can take many minutes, exceeding the deadlines we tried.

We'll have to fix this in upstream. One possible solutions is add some granularity to deadlines, making them configurable per plugin/driver.

Comment 14 Fabio Bertinatto 2020-06-19 15:26:29 UTC
Will work on the upstream issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/498

Comment 17 Fabio Bertinatto 2020-08-21 11:54:43 UTC
@hekumar is working on a fix.

Comment 23 Chao Yang 2020-08-31 09:57:43 UTC
Passed on 4.6.0-0.nightly-2020-08-27-005538/4.6.0-0.nightly-2020-08-26-202109/4.6.0-0.nightly-2020-08-26-093617

Comment 27 Hemant Kumar 2020-09-01 05:22:41 UTC
failed again.

Comment 28 Chao Yang 2020-09-01 05:35:22 UTC
Yes it is failed again.
@fbertina Thanks for finding this.

Comment 32 Chao Yang 2020-09-15 06:25:46 UTC
Checked several job here https://prow.ci.openshift.org/?job=pull-ci-openshift-origin-master-e2e-aws-csi
Passed this bz.


Note You need to log in before you can comment on or make changes to this bug.