Bug 1913974

Summary: snapshot test periodically failing with "can't open '/mnt/test/data': No such file or directory"
Product: OpenShift Container Platform Reporter: jamo luhrsen <jluhrsen>
Component: StorageAssignee: Christian Huffman <chuffman>
Storage sub component: Storage QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, fbertina, jsafrane
Version: 4.7Keywords: Reopened
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:36:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jamo luhrsen 2021-01-07 21:51:44 UTC
Description of problem:

Found in CI while analyzing 4.7 release health. This job [0] appears to have a new test case
added in the last few weeks:
  External Storage [Driver: ebs.csi.aws.com] [Testpattern: Dynamic Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller  should check snapshot fields, check restore correctly works after modifying source data, check deletion

it's failed twice in the past ~20 runs. stdout from the test case is this:

fail [k8s.io/kubernetes.0/test/e2e/framework/util.go:598]: Unexpected error:
    <exec.CodeExitError>: {
        Err: {
            s: "error running /tmp/shared/kubectl --server=https://api.ci-op-ivzy36s9-7d661.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/shared/admin.kubeconfig --namespace=e2e-snapshotting-488 exec restored-pvc-tester-4p8lh --namespace=e2e-snapshotting-488 -- cat /mnt/test/data:\nCommand stdout:\n\nstderr:\ncat: can't open '/mnt/test/data': No such file or directory\ncommand terminated with exit code 1\n\nerror:\nexit status 1",
        },
        Code: 1,
    }
    error running /tmp/shared/kubectl --server=https://api.ci-op-ivzy36s9-7d661.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/shared/admin.kubeconfig --namespace=e2e-snapshotting-488 exec restored-pvc-tester-4p8lh --namespace=e2e-snapshotting-488 -- cat /mnt/test/data:
    Command stdout:

    stderr:
    cat: can't open '/mnt/test/data': No such file or directory
    command terminated with exit code 1

    error:
    exit status 1
occurred


I noticed https://bugzilla.redhat.com/show_bug.cgi?id=1891459 seemed to be similar, but it is closed and
looks like the work there was just to get stdout printed in the logs [1], which is where the above came from.



[0] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-aws-csi-4.7&grid=old
[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.7/1347223175288590336/build-log.txt

Comment 3 Christian Huffman 2021-02-02 21:00:40 UTC
It looks like this job has stabilized. Just to reiterate, I saw no obvious issues in the original test failures. Closing this, as it's no longer consistently reproducible with only two failures in the past 6 weeks.

Comment 4 jamo luhrsen 2021-02-02 22:18:27 UTC
agreed that it is likely happening a lot less frequently now, but it is still happening.
this 14d search shows it happening 9 times in the last 14days. Most of those were on
pre-submit jobs so getting in the way of folks trying to get a passing CI on their pull
requests.

Also, every job is somehow related to CSI and running on AWS (has those strings in it's name at least) so
maybe that is some basic clue?

here's the search I did:
https://search.ci.openshift.org/?search=cat%3A+can%27t+open+%27%2Fmnt%2Ftest%2Fdata%27%3A+No+such+file+or+directory&maxAge=336h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

and all the jobs that showed up:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25848/pull-ci-openshift-origin-master-e2e-aws-csi/1356675660923277312
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25810/pull-ci-openshift-origin-master-e2e-aws-csi/1352388116643581952
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25813/pull-ci-openshift-origin-master-e2e-aws-csi/1352014510130466816
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/510/pull-ci-openshift-kubernetes-master-e2e-aws-csi/1356560051409522688
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/531/pull-ci-openshift-kubernetes-master-e2e-aws-csi/1351768076806262784
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_aws-ebs-csi-driver-operator/111/pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-aws-csi/1356554769547661312
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15187/rehearse-15187-pull-ci-openshift-aws-ebs-csi-driver-master-e2e-aws-csi/1352712267904323584
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.7/1352299413669679104
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15135/rehearse-15135-pull-ci-openshift-aws-ebs-csi-driver-master-e2e-aws-csi/1352300267629973504

Comment 5 Christian Huffman 2021-02-03 15:21:19 UTC
Reopning this per the previous comment. I was looking at [1] for the test failures. I'll investigate the reported test fails in the previous comment as well.

[1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-aws-csi-4.7&grid=old

Comment 7 Jan Safranek 2021-04-19 15:50:01 UTC
I just reproduced and I filing the times into the test steps:

> 1. Create a PVC and a Pod that consumes it.
> 2. Write a file to /mnt/test/data that records the Pod's namespace.

Expanding on step 2., it does:
2.1 Create a Pod that writes to /mnt/test/data
2.2 Delete the Pod.

In the driver logs on the node, I can see:

I0419 14:32:54.508906       1 node.go:393] NodeUnpublishVolume: called with args {VolumeId:vol-006a875808eaf3905 TargetPath:/var/lib/kubelet/pods/0b4cca95-3fea-4bac-889e-9d1d65afb5f3/volumes/kubernetes.io~csi/pvc-3f6e57e7-0c3b-44d9-b07b-425536cd307b/mount XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0419 14:32:54.620054       1 node.go:222] NodeUnstageVolume: called with args {VolumeId:vol-006a875808eaf3905 StagingTargetPath:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3f6e57e7-0c3b-44d9-b07b-425536cd307b/globalmount XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}


> 3. Create a VolumeSnapshot from the PVC.

In the driver logs (in the controller pod) I can see:
I0419 14:32:54.326754       1 controller.go:471] CreateSnapshot: called with args source_volume_id:"vol-006a875808eaf3905" name:"snapshot-47e60f81-a81b-44f7-a9c8-fdb2b0a10894" 
I0419 14:32:54.750763       1 controller.go:471] CreateSnapshot: called with args source_volume_id:"vol-006a875808eaf3905" name:"snapshot-47e60f81-a81b-44f7-a9c8-fdb2b0a10894" 
I0419 14:32:54.826142       1 controller.go:490] Snapshot snapshot-47e60f81-a81b-44f7-a9c8-fdb2b0a10894 of volume vol-006a875808eaf3905 already exists; nothing to do

CreateSnapshot was called *before* the CSI driver unmounted the volume in step 2.2 (NodeUnpublishVolume/NodeUnstageVolume). At that point, the node may not have written its caches to the volume, so the snapshot can contain an empty ext4 volume.
Or clocks are out of sync - the difference is only 0.2s and Pods should not be deleted from the API server until at least NodeUnstageVolume returns.

Comment 8 Jan Safranek 2021-04-19 15:51:57 UTC
Correction: NodeUnpublishVolume should succeed before a Pod is deleted.

Comment 9 Christian Huffman 2021-04-20 13:50:27 UTC
I've spoken with Jan further, and he appears to be correct. Forcing a `sync` to occur remedies this issue in my tests. With no changes we can reproduce this failure over 50% of the time, and after adding a `sync` or forcing a wait until NodeUnstageVolume occurs I do not see it reoccurring.

I've filed [1] to track the issue upstream. We already had logic in the test that waited for NodeUnstageVolume to complete in Windows environments, and I've required this for all distros upstream in [2].

[1] https://github.com/kubernetes/kubernetes/issues/101279
[2] https://github.com/kubernetes/kubernetes/pull/101280

Comment 10 Fabio Bertinatto 2021-04-23 08:34:00 UTC
Just a reminder:

I'm disabling snapshot tests for Azure CSI in this PR: https://github.com/openshift/azure-disk-csi-driver-operator/pull/17. Once this issue is fixed, please re-enabled them.

Comment 13 Christian Huffman 2021-05-14 18:59:13 UTC
I'm not certain what changed from my initial testing to now, but the patch that was merged in https://github.com/kubernetes/kubernetes/pull/101280 doesn't appear to fully be sufficient to prevent this issue, as these tests are still failing. Adding a `sync` to the original command (to force the cache write) does prevent the issue, but from previous discussions we don't want to take this route.

Alternatively, I've also had success adding a brief sleep immediately after we wait for the node to unstage the volume to give the volume more time for the cache write to conclude. I've submitted [1] to add a 5 second sleep after we ensure the node has unstaged the volume. So far this test has repeatedly passed without issues after simply sleeping for 5 seconds.

[1] https://github.com/kubernetes/kubernetes/pull/102021

Comment 15 Christian Huffman 2021-05-25 12:16:10 UTC
It looks like we were using `pv.Name` instead of `volumeName` when waiting for the NodeUnpublish calls. For CSI volumes, which are the only ones these tests will run against, we should exclusively use volumeName. The upstream PR has merged, and https://github.com/openshift/kubernetes/pull/769 has been submitted to fix this downstream.

Comment 18 Wei Duan 2021-06-03 10:45:36 UTC
Did not met the error in the last five days in 4.8 nightly, verified pass.

Comment 21 errata-xmlrpc 2021-07-27 22:36:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438