1810470 – [Flake] volume expansion tests occasionally flake with EBS CSI driver

Bug 1810470 - [Flake] volume expansion tests occasionally flake with EBS CSI driver

Summary: [Flake] volume expansion tests occasionally flake with EBS CSI driver

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	OpenShift Storage Bugzilla Bot
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1879192
TreeView+	depends on / blocked

Reported:	2020-03-05 10:26 UTC by Fabio Bertinatto
Modified:	2022-08-25 20:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-25 20:41:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift aws-ebs-csi-driver pull 163	None	closed	Bug 1810470:Make EBS controllerexpansion idempotent	2021-01-21 21:43:42 UTC
Github	openshift aws-ebs-csi-driver pull 164	None	closed	Bug 1810470: Drop optimizing check from resizing call	2021-01-21 21:43:42 UTC
Github	openshift aws-ebs-csi-driver pull 165	None	closed	Bug 1810470: Verify pending volume modifications and size both	2021-01-21 21:43:42 UTC
Github	openshift aws-ebs-csi-driver pull 167	None	closed	Bug 1810470: carry: Check for optimizing state too	2021-01-21 21:43:42 UTC
Github	openshift csi-external-resizer pull 111	None	closed	Bug 1810470: Update with rc3 which fixes idempotency issues	2021-01-21 21:43:46 UTC
Github	openshift origin pull 24873	None	closed	Bug 1810470: UPSTREAM: <drop> Increase timeout in volume expansion test	2021-01-21 21:43:42 UTC

Description Fabio Bertinatto 2020-03-05 10:26:05 UTC

A few examples:

Block: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24633/pull-ci-openshift-origin-master-e2e-aws-csi/158
FS: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24622/pull-ci-openshift-origin-master-e2e-aws-csi/165

External Storage [Driver: ebs.csi.aws.com] [Testpattern: Dynamic PV (block volmode)(allowExpansion)] volume-expand Verify if offline PVC expansion works expand_less 	5m57s
fail [k8s.io/kubernetes/test/e2e/storage/testsuites/volume_expand.go:213]: while recreating pod for resizing
Unexpected error:
    <*errors.errorString | 0xc002801c80>: {
        s: "pod \"security-context-70334232-54ff-4328-b510-acc0ac7a3a95\" is not Running: timed out waiting for the condition",
    }
    pod "security-context-70334232-54ff-4328-b510-acc0ac7a3a95" is not Running: timed out waiting for the condition
occurred

Comment 9 Chao Yang 2020-05-11 07:20:26 UTC

PR is not merged right now. 
Update the status to assigned.

Comment 11 Fabio Bertinatto 2020-05-27 15:21:30 UTC

Inspected a new failure here: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/24680/pull-ci-openshift-origin-master-e2e-aws-csi/2908

1. This is the first time the controller tried to attach the resized volume to a new pod (15:05:37):

May 26 15:15:45.314: INFO: At 2020-05-26 15:05:37 +0000 UTC - event for security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-ffe583b7-5d6d-4937-84ff-5cb5e02df464" : rpc error: code = Internal desc = Could not attach volume "vol-069d2b1853f909252" to node "i-045220a861c8e0b43": could not attach volume "vol-069d2b1853f909252" to node "i-045220a861c8e0b43": OperationNotPermitted: Cannot attach volume vol-069d2b1853f909252 when it is in modification state: MODIFYING


2. As we can see, it failed. However, it retries many times, until it succeeds more than 8 minutes later:

May 26 15:15:45.314: INFO: At 2020-05-26 15:13:51 +0000 UTC - event for security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-ffe583b7-5d6d-4937-84ff-5cb5e02df464" 

3. So far it spent 8 min and 14 secs out of 10 minutes of the deadline.

4. The deadline of 10 min is reached, so the pod is deleted and the test fails

May 26 15:15:37.004: INFO: Deleting pod "security-context-eb8d1bb4-af5b-4fbb-bcc5-35bfcd225d3e" in namespace "e2e-volume-expand-4993"

Comment 12 Fabio Bertinatto 2020-05-28 08:27:56 UTC

The cause of this flake is: reattaching a resized volume in AWS can take many minutes, exceeding the deadlines we tried.

We'll have to fix this in upstream. One possible solutions is add some granularity to deadlines, making them configurable per plugin/driver.

Comment 14 Fabio Bertinatto 2020-06-19 15:26:29 UTC

Will work on the upstream issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/498

Comment 17 Fabio Bertinatto 2020-08-21 11:54:43 UTC

@hekumar is working on a fix.

Comment 23 Chao Yang 2020-08-31 09:57:43 UTC

Passed on 4.6.0-0.nightly-2020-08-27-005538/4.6.0-0.nightly-2020-08-26-202109/4.6.0-0.nightly-2020-08-26-093617

Comment 24 Fabio Bertinatto 2020-08-31 11:34:44 UTC

I hit this recently (around 2 hours ago): https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_aws-ebs-csi-driver-operator/83/pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-operator/1300345097472184320

@chao, could you take a look?

Comment 27 Hemant Kumar 2020-09-01 05:22:41 UTC

failed again.

Comment 28 Chao Yang 2020-09-01 05:35:22 UTC

Yes it is failed again.
@fbertina Thanks for finding this.

Comment 30 Chao Yang 2020-09-04 09:05:10 UTC

Still hit similar error in here https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25442/pull-ci-openshift-origin-master-e2e-aws-csi/1301692073983873024

Comment 32 Chao Yang 2020-09-15 06:25:46 UTC

Checked several job here https://prow.ci.openshift.org/?job=pull-ci-openshift-origin-master-e2e-aws-csi
Passed this bz.

Note You need to log in before you can comment on or make changes to this bug.