Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1810470

Summary: [Flake] volume expansion tests occasionally flake with EBS CSI driver
Product: OpenShift Container Platform Reporter: Fabio Bertinatto <fbertina>
Component: StorageAssignee: OpenShift Storage Bugzilla Bot <ocp-storage-bot>
Storage sub component: Storage QA Contact: Wei Duan <wduan>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, chaoyang, jsafrane, piqin, vlaad
Version: 4.4   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-25 20:41:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1879192    

Description Fabio Bertinatto 2020-03-05 10:26:05 UTC
A few examples:

Block: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24633/pull-ci-openshift-origin-master-e2e-aws-csi/158
FS: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24622/pull-ci-openshift-origin-master-e2e-aws-csi/165

External Storage [Driver: ebs.csi.aws.com] [Testpattern: Dynamic PV (block volmode)(allowExpansion)] volume-expand Verify if offline PVC expansion works expand_less 	5m57s
fail [k8s.io/kubernetes/test/e2e/storage/testsuites/volume_expand.go:213]: while recreating pod for resizing
Unexpected error:
    <*errors.errorString | 0xc002801c80>: {
        s: "pod \"security-context-70334232-54ff-4328-b510-acc0ac7a3a95\" is not Running: timed out waiting for the condition",
    }
    pod "security-context-70334232-54ff-4328-b510-acc0ac7a3a95" is not Running: timed out waiting for the condition
occurred

Comment 9 Chao Yang 2020-05-11 07:20:26 UTC
PR is not merged right now. 
Update the status to assigned.

Comment 11 Fabio Bertinatto 2020-05-27 15:21:30 UTC
Inspected a new failure here: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/24680/pull-ci-openshift-origin-master-e2e-aws-csi/2908

1. This is the first time the controller tried to attach the resized volume to a new pod (15:05:37):

May 26 15:15:45.314: INFO: At 2020-05-26 15:05:37 +0000 UTC - event for security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-ffe583b7-5d6d-4937-84ff-5cb5e02df464" : rpc error: code = Internal desc = Could not attach volume "vol-069d2b1853f909252" to node "i-045220a861c8e0b43": could not attach volume "vol-069d2b1853f909252" to node "i-045220a861c8e0b43": OperationNotPermitted: Cannot attach volume vol-069d2b1853f909252 when it is in modification state: MODIFYING


2. As we can see, it failed. However, it retries many times, until it succeeds more than 8 minutes later:

May 26 15:15:45.314: INFO: At 2020-05-26 15:13:51 +0000 UTC - event for security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-ffe583b7-5d6d-4937-84ff-5cb5e02df464" 

3. So far it spent 8 min and 14 secs out of 10 minutes of the deadline.

4. The deadline of 10 min is reached, so the pod is deleted and the test fails

May 26 15:15:37.004: INFO: Deleting pod "security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e" in namespace "e2e-volume-expand-4993"

Comment 12 Fabio Bertinatto 2020-05-28 08:27:56 UTC
The cause of this flake is: reattaching a resized volume in AWS can take many minutes, exceeding the deadlines we tried.

We'll have to fix this in upstream. One possible solutions is add some granularity to deadlines, making them configurable per plugin/driver.

Comment 14 Fabio Bertinatto 2020-06-19 15:26:29 UTC
Will work on the upstream issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/498

Comment 17 Fabio Bertinatto 2020-08-21 11:54:43 UTC
@hekumar is working on a fix.

Comment 23 Chao Yang 2020-08-31 09:57:43 UTC
Passed on 4.6.0-0.nightly-2020-08-27-005538/4.6.0-0.nightly-2020-08-26-202109/4.6.0-0.nightly-2020-08-26-093617

Comment 27 Hemant Kumar 2020-09-01 05:22:41 UTC
failed again.

Comment 28 Chao Yang 2020-09-01 05:35:22 UTC
Yes it is failed again.
@fbertina Thanks for finding this.

Comment 32 Chao Yang 2020-09-15 06:25:46 UTC
Checked several job here https://prow.ci.openshift.org/?job=pull-ci-openshift-origin-master-e2e-aws-csi
Passed this bz.