Bug 1948603 - Azure CSI driver does not pass e2e-azure-csi tests
Summary: Azure CSI driver does not pass e2e-azure-csi tests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.9.0
Assignee: Fabio Bertinatto
QA Contact: Wei Duan
URL:
Whiteboard:
: 1948535 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-12 15:03 UTC by Jan Safranek
Modified: 2021-10-18 17:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:29:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift azure-disk-csi-driver-operator pull 14 0 None closed Bug 1948603: Fix some failing tests 2021-04-27 15:55:53 UTC
Github openshift azure-disk-csi-driver-operator pull 15 0 None closed Bug 1948603: Disable volume expansion e2e tests 2021-04-27 15:55:53 UTC
Github openshift azure-disk-csi-driver-operator pull 18 0 None None None 2021-08-25 11:55:25 UTC
Github openshift azure-disk-csi-driver pull 6 0 None closed Bug 1948603: Rebase v1.1.1 2021-04-27 15:55:57 UTC
Github openshift release pull 15360 0 None closed Add workflow for CSI migration tests 2021-04-27 15:55:58 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:30:13 UTC

Description Jan Safranek 2021-04-12 15:03:13 UTC
e2e-azure-csi tests runs our CSI certification tests and Azure CSI driver consistently fails these tests.

Example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_azure-disk-csi-driver-operator/12/pull-ci-openshift-azure-disk-csi-driver-operator-master-e2e-azure-csi/1380559175712509952

Full history (in the operator repo):
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-azure-disk-csi-driver-operator-master-e2e-azure-csi

Failed tests:

[Testpattern: Dynamic PV (default fs)] volumes should allow exec of files on the volume
[Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand should resize volume when PVC is edited while pod is using it
[Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand Verify if offline PVC expansion works
[Testpattern: Pre-provisioned Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion
[Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options
[Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]
[Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed in first pod, new pod with different fsgroup applied to the volume contents
[Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents
[Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with different volume mode and retain data across pod recreation on the same node [LinuxOnly]
[Testpattern: Dynamic PV (default fs)] subPath should fail if non-existent subpath is outside the volume [Slow][LinuxOnly]
[Testpattern: Dynamic PV (block volmode)(allowExpansion)] volume-expand Verify if offline PVC expansion works
[Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with the same volume mode and retain data across pod recreation on the same node [LinuxOnly]
[Testpattern: Dynamic PV (immediate binding)] topology should provision a volume and schedule a pod with AllowedTopologies
[Testpattern: Dynamic PV (default fs)] subPath should support restarting containers using file as subpath [Slow][LinuxOnly]
[Testpattern: Dynamic PV (default fs)] volumes should store data
[Testpattern: Dynamic PV (block volmode)(allowExpansion)] volume-expand should resize volume when PVC is edited while pod is using it
[Testpattern: Dynamic PV (xfs)][Slow] volumes should store data
[Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]
[Testpattern: Dynamic PV (block volmode)] volumeMode should not mount / map unused volumes in a pod [LinuxOnly]
[Testpattern: Dynamic Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion

Comment 1 Fabio Bertinatto 2021-04-15 15:26:24 UTC
*** Bug 1948535 has been marked as a duplicate of this bug. ***

Comment 2 Jan Safranek 2021-04-15 18:13:10 UTC
What I noticed today is that when our CI job enables featureSet: TechPreviewNoUpgrade, it starts the tests relatively quickly afterwards. But the FeatureSet enables also CSI migration and MCO starts draining / restarting machines when the CSI tests are running. I don't think it's the root cause of *all* test failures, but at least it increases flakiness of the CI job.

I am trying to wait until the CSI migration is applied everywhere before starting the tests in https://github.com/openshift/release/pull/15360.
I.e. when testing manually, wait ~10 minutes after setting the FeatureSet (or watch `oc get node -w` until everything is restarted).

Comment 4 Qin Ping 2021-04-16 02:33:54 UTC
Yes, we need to wait for about 10 minutes for the feature gates are enabled in other components. Tried to wait for all the components are ready and ran the csi verificaiton tool, still found some cases are failed.

I think we can use this bug to track the fix in the release repo, and use different bugs to track other issues. So, I'll reopen bug 1948535 (marked as a duplicated bug with this bug) and try to verify this bug first.


Hi Fabio,

If you think bug 1948535 is still a duplicated bug, feel free to close it.

Comment 5 Jan Safranek 2021-04-16 14:41:42 UTC
*** Bug 1948535 has been marked as a duplicate of this bug. ***

Comment 6 Qin Ping 2021-04-19 06:00:19 UTC
For PR https://github.com/openshift/release/pull/15360 is not merged yet, I'll update the status to post first.

Comment 8 Fabio Bertinatto 2021-05-03 08:13:26 UTC
Currently, there are 2 category of tests that are still failing with Azure Disk CSI driver: snapshots and volume expansion tests.

Regarding the snapshot tests, openshift/origin needs to get a k8s.io/* bump so that it contains commit [1]. This should be done in PR [2].

Regarding the expansion tests, the rebase done in PR [3] should have fixed some of the failing tests, but there's still some investigation needed to identify if more fixes for the driver are required.

[1] https://github.com/openshift/kubernetes/commit/ad4f896bdef4619f63b9df878a6e78213db4eef0
[2]  https://github.com/openshift/origin/pull/26126
[3] https://github.com/openshift/azure-disk-csi-driver/pull/6

Comment 9 Ben Parees 2021-05-13 13:45:27 UTC
this is blocking https://github.com/openshift/origin/pull/26131

sample failure:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26131/pull-ci-openshift-origin-master-e2e-aws-csi/1392813571842248704


If we are not close to a fix, can we get these tests temporarily disabled to unblock teams?

Comment 10 Fabio Bertinatto 2021-05-17 16:06:46 UTC
Ben, this ticket isn't related to that failure (this is Azure CSI driver). The correct ticket tracking this issue is bug 1913974.

There is some ongoing work to fix that, I'll check if we can disable the test in the meantime.

Comment 11 Fabio Bertinatto 2021-05-18 11:40:39 UTC
For reference, this is the the upstream PR that tries to fix the snapshot issue: https://github.com/kubernetes/kubernetes/pull/102021.

Comment 15 Wei Duan 2021-08-26 12:52:07 UTC
Verified pass

Comment 18 errata-xmlrpc 2021-10-18 17:29:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.