Bug 1832627 - CI: azure-disk: volumes should allow exec of files on the volume: Persistent Volume ... not deleted by dynamic provisioner
Summary: CI: azure-disk: volumes should allow exec of files on the volume: Persistent ...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.7.0
Assignee: Hemant Kumar
QA Contact: Qin Ping
URL:
Whiteboard:
: 1852598 (view as bug list)
Depends On:
Blocks: 1871301
TreeView+ depends on / blocked
 
Reported: 2020-05-06 23:36 UTC by W. Trevor King
Modified: 2023-09-15 00:31 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1871301 (view as bug list)
Environment:
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (default fs)] volumes should allow exec of files on the volume [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] subPath should support readOnly file specified in the volumeMount [LinuxOnly] [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (ext4)] volumes should allow exec of files on the volume
Last Closed: 2020-10-02 15:00:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-05-06 23:36:40 UTC
test:

[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (default fs)] volumes should allow exec of files on the volume 

is failing frequently in CI, see search results [1].  At least some of these errors seem to be due to deprovision timeouts [2], e.g. [3]:

fail [k8s.io/kubernetes/test/e2e/framework/util.go:894]: Unexpected error:
    <*errors.errorString | 0xc000ace230>: {
        s: "expected pod \"exec-volume-test-azure-dynamicpv-m24m\" success: Gave up after waiting 5m0s for pod \"exec-volume-test-azure-dynamicpv-m24m\" to be \"success or failure\"",
    }
    expected pod "exec-volume-test-azure-dynamicpv-m24m" success: Gave up after waiting 5m0s for pod "exec-volume-test-azure-dynamicpv-m24m" to be "success or failure"
occurred

Possibly just a provider flake, but it shows up in 15% of the past two day's release-openshift-ocp-installer-e2e-azure-4.4 failures [2], so if nothing else may be worth raising a timeout.

[1]: https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-storage%5C%5D+In-tree+Volumes+%5C%5BDriver%3A+azure-disk%5C%5D+%5C%5BTestpattern%3A+Pre-provisioned+PV+%5C%28default+fs%5C%29%5C%5D+volumes+should+allow+exec+of+files+on+the+volume
[2]: https://search.svc.ci.openshift.org/?search=Persistent+Volume.*not+deleted+by+dynamic+provisioner.*still+exists+within&maxAge=48h&context=1&type=bug%2Bjunit&name=azure
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1579

Comment 1 W. Trevor King 2020-05-06 23:41:01 UTC
I would not be surprised if the issue behind these other recent Azure storage failures was slow deprovisioning:

[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Inline-volume (ext3)] volumes should store data
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (default fs)] volumes should allow exec of files on the volume
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (block volmode)] volumes should store data
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Inline-volume (default fs)] volumes should store data
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (block volmode)] volumeMode should not mount / map unused volumes in a pod
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Inline-volume (default fs)] volumes should allow exec of files on the volume
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (filesystem volmode)] volumeMode should not mount / map unused volumes in a pod
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (ext4)] volumes should allow exec of files on the volume
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (immediate binding)] topology should provision a volume and schedule a pod with AllowedTopologies
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (ext3)] volumes should allow exec of files on the volume

Listing in full so CI search/Sippy can find this bug [1].

[1]: https://sippy-bparees.svc.ci.openshift.org/?release=4.5

Comment 2 Clayton Coleman 2020-05-07 00:19:10 UTC
This in theory should have been fixed in https://github.com/openshift/origin/pull/24933

Comment 3 Clayton Coleman 2020-05-07 00:19:47 UTC
Yeah, I'm going to mark this as on QA.

Comment 4 Clayton Coleman 2020-05-07 00:20:04 UTC

*** This bug has been marked as a duplicate of bug 1828174 ***

Comment 5 Hemant Kumar 2020-05-07 02:29:04 UTC
I investigated bunch of failing builds in the test run wking linked.  For example:

TEP: creating a claim
May  6 16:54:22.862: INFO: Warning: Making PVC: VolumeMode specified as invalid empty string, treating as nil
May  6 16:54:22.923: INFO: Waiting up to 5m0s for PersistentVolumeClaims [azure-disk2zwtb] to have phase Bound
May  6 16:54:22.967: INFO: PersistentVolumeClaim azure-disk2zwtb found but phase is Pending instead of Bound.
May  6 16:54:25.033: INFO: PersistentVolumeClaim azure-disk2zwtb found but phase is Pending instead of Bound.
May  6 16:54:27.079: INFO: PersistentVolumeClaim azure-disk2zwtb found but phase is Pending instead of Bound.
May  6 16:54:29.126: INFO: PersistentVolumeClaim azure-disk2zwtb found but phase is Pending instead of Bound.
May  6 16:54:31.174: INFO: PersistentVolumeClaim azure-disk2zwtb found but phase is Pending instead of Bound.
May  6 16:54:33.221: INFO: PersistentVolumeClaim azure-disk2zwtb found but phase is Pending instead of Bound.
May  6 16:54:35.268: INFO: PersistentVolumeClaim azure-disk2zwtb found and phase=Bound (12.345400974s)
STEP: starting azure-injector
STEP: Deleting pod azure-injector in namespace e2e-volume-7764
May  6 16:59:35.608: INFO: Waiting for pod azure-injector to disappear
May  6 16:59:35.652: INFO: Pod azure-injector still exists
May  6 16:59:37.652: INFO: Waiting for pod azure-injector to disappear
May  6 16:59:37.700: INFO: Pod azure-injector still exists
May  6 16:59:39.652: INFO: Waiting for pod azure-injector to disappear
May  6 16:59:39.697: INFO: Pod azure-injector no longer exists
May  6 16:59:39.697: FAIL: Failed to create injector pod: timed out waiting for the condition

And further down:
May  6 17:04:03.135: INFO: At 2020-05-06 16:53:54 +0000 UTC - event for azure-injector: {default-scheduler } Scheduled: Successfully assigned e2e-volume-1634/azure-injector to ci-op-w9z3168r-fe6b0-xbgpz-worker-centralus1-htp58
May  6 17:04:03.135: INFO: At 2020-05-06 16:55:57 +0000 UTC - event for azure-injector: {kubelet ci-op-w9z3168r-fe6b0-xbgpz-worker-centralus1-htp58} FailedMount: Unable to attach or mount volumes: unmounted volumes=[azure-volume-0], unattached volumes=[default-token-wsbdb azure-volume-0]: timed out waiting for the condition
May  6 17:04:03.135: INFO: At 2020-05-06 16:59:11 +0000 UTC - event for azure-injector: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "azure-disk-c87x5" 
May  6 17:04:03.135: INFO: At 2020-05-06 17:00:28 +0000 UTC - event for azure-injector: {kubelet ci-op-w9z3168r-fe6b0-xbgpz-worker-centralus1-htp58} FailedMount: Unable to attach or mount volumes: unmounted volumes=[default-token-wsbdb azure-volume-0], unattached volumes=[default-token-wsbdb azure-volume-0]: timed

So what is happening is - attach operation is simply taking too long and causing test to fail.  Jan filed an issue with Azure team - https://github.com/Azure/aks-engine/issues/2888 to track this. 

The fixes we have put in - https://github.com/openshift/origin/pull/24900 and openshift/origin/pull/24933 fix volume provisioning errors where zone and node don't match or nodes are created without availability zones. These PRs do not fix attach/detach issues and hence I am afraid this bug is not a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1828174

Comment 6 Hemant Kumar 2020-05-07 02:33:28 UTC
Having said that - some test runs do have same problem as what is being fixed in - https://bugzilla.redhat.com/show_bug.cgi?id=1828174 . For example from the test run - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1044 this fails with:

fail [k8s.io/kubernetes/test/e2e/storage/drivers/in_tree.go:1608]: Unexpected error:
    <*errors.errorString | 0xc001495690>: {
        s: "Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {\"error\":{\"code\":\"LocationNotSupportAvailabilityZones\",\"message\":\"The resource 'Microsoft.Compute/disks/e2e-9baa396f-bb14-4553-9998-5d0b677d7d56' does not support availability zones at location 'westus'.\"}}",
    }
    Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {"error":{"code":"LocationNotSupportAvailabilityZones","message":"The resource 'Microsoft.Compute/disks/e2e-9baa396f-bb14-4553-9998-5d0b677d7d56' does not support availability zones at location 'westus'."}}
occurred

So this bug should be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1828174 .

Comment 7 W. Trevor King 2020-05-07 02:51:06 UTC
The PR behind bug 1828174 landed yesterday.  Looking for 4.5 jobs hitting this failure mode in the past 12h [1] turns up [2].  Checking the version of tests in that job:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1053/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "tests").annotations["io.openshift.build.commit.id"]'
f978cdc178b022951c5c6bbdc418d6c957826d9d

Checking vs. origin:

$ git --no-pager log --oneline --first-parent origin/master | grep 'f978cdc17\|a476cd7' 
f978cdc178 Merge pull request #24951 from mfojtik/improve-auth-test
a476cd7100 Merge pull request #24933 from gnufied/azure-zone-e2e

So that run did have the change from bug 1828174, but still timed out trying to delete the volume.

[1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=Persistent+Volume.*not+deleted+by+dynamic+provisioner.*still+exists+within&maxAge=12h&context=1&type=bug%2Bjunit&name=azure-4.5&maxMatches=5&maxBytes=20971520&groupBy=job
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1053

Comment 8 Hemant Kumar 2020-05-07 03:01:41 UTC
Before bumping the timings it is probably worth loading attach/detach call metrics from https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1053/artifacts/e2e-azure/metrics/ and comparing. Another problem is - e2e don't allow fine tuning of timeouts for certain providers.

Comment 9 W. Trevor King 2020-05-07 03:43:51 UTC
More Azure volume test names for CI-search to match on:

[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (delayed binding)] topology should provision a volume and schedule a pod with AllowedTopologies
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (ext4)] volumes should store data
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (default fs)] volumes should store data
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (ext3)] volumes should store data
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Inline-volume (ext4)] volumes should allow exec of files on the volume
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Inline-volume (ext3)] volumes should allow exec of files on the volume

Comment 10 Jan Safranek 2020-05-07 07:17:05 UTC
> Another problem is - e2e don't allow fine tuning of timeouts for certain providers.

Another problem is that all tests have hard 15 minute timeout. There is a StatefulSet test that start/stop Pods that use volumes and they will hit these 15 minutes if we increase timeouts for individual volume attach / pod start. Imo it's quite useful test and I would not like to disable it.

[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]

From last few Azure runs:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1061: 10m38s
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1060: 15m0s (!!)
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1059: 9m57s
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1058: 12m8s
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1057: 10m17s

Comment 12 Fabiano Franz 2020-05-22 20:04:34 UTC
More Azure volume test names for CI-search to match on:

[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (ext3)] volumes should store data

Comment 13 Fabiano Franz 2020-05-26 15:02:28 UTC
More Azure volume test names for CI-search to match on:

[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (default fs)] volumes should store data

Comment 14 Fabiano Franz 2020-05-26 21:33:23 UTC
Since 

1) the underlying issue is not really actionable from us, Azure being slow on attach/detach is known by Microsoft and according to this[1] comment they're working on it, but a solution shouldn't be expected before mid-2020;
2) Increasing the timeout on attach/detach would result in hitting the hard e2e pod 15-min deadline, which is not configurable per platform;

I'd like to advocate for temporarily disabling these tests, or even better depending on the amount of work involved, move them to a separate set of jobs out of release-openshift-ocp-installer-e2e-azure-4.x. 

We have been hit pretty hard[2] by this, over multiple tests across most Azure jobs. The reasoning for disabling it is not simply for the sake of increasing passing rates, but because these failures have been widespread and it could already be obfuscating our view for other legitimate test failures that may be happening and we're missing out. It happened already with the 'westus' issue, which was a separate one, and there might be others we're missing.

WDYT?


[1] https://github.com/Azure/aks-engine/issues/2888#issuecomment-598693484
[2] https://sippy-bparees.svc.ci.openshift.org/?release=4.5

Comment 15 W. Trevor King 2020-05-26 22:22:08 UTC
I don't have a personal stake in this, but bumping the timeout for this test (even if it's across all platforms, not just Azure) seems reasonable to me.  Also yeah, making this an operator-side test that is not part of the default suite everyone runs.  I dunno what the exposure is for these tests, but my outsider guess would be that there are probably not all that many repositories in a position to break these storage tests without also breaking some other test.

Comment 16 Hemant Kumar 2020-05-29 19:01:36 UTC
So we can increase test timings for tests where we have driver configuration available but in other places it is super ugly to increase the timeout just for azure (or we will have to increase timeout for every test regardless of driver). There is another issue that, there are other tests which rely on certain operations to succeed or fail within 5minutes. For example - default timeout for namespace deletion is 5mins, but if it takes 7minutes to perform an action that involves these volumes, 5 min time may not enough. So it is bit like dominoes effect and this should be considered carefully. 

Having said that - this is not the only azure issue that is causing e2e failure. We have recently identified that because azure uses immediate binding if number of worker and master nodes are not same - tests will fail - https://bugzilla.redhat.com/show_bug.cgi?id=1838730

I need more time to tune the timeouts carefully to workaround the issue of slow attach/detach and I will be fixing this next sprint.

Comment 17 Colin Walters 2020-06-09 23:01:10 UTC
https://search.apps.build01.ci.devcluster.openshift.com/?search=volumes+should+allow+exec+of+files+on+the+volume&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
has hits outside Azure, but I am not finding an existing bug.
In particular it hit a few times in MCO PRs recently:

pull-ci-openshift-machine-config-operator-master-e2e-aws - 14 runs, 36% failed, 20% of failures match

Is storage side tracking this somewhere?

Comment 19 Jan Safranek 2020-07-01 15:03:40 UTC
*** Bug 1852598 has been marked as a duplicate of this bug. ***

Comment 21 Steve Kuznetsov 2020-07-22 18:22:06 UTC
More matching:

[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Inline-volume (ext4)] volumes should store data

Comment 23 Maru Newby 2020-08-22 00:22:19 UTC
@hekumar I've proposed an extension of the test timeout for azure in the linked PR.

Comment 28 Red Hat Bugzilla 2023-09-15 00:31:30 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.