Description of problem: Found in CI while analyzing 4.7 release health. This job [0] appears to have a new test case added in the last few weeks: External Storage [Driver: ebs.csi.aws.com] [Testpattern: Dynamic Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion it's failed twice in the past ~20 runs. stdout from the test case is this: fail [k8s.io/kubernetes.0/test/e2e/framework/util.go:598]: Unexpected error: <exec.CodeExitError>: { Err: { s: "error running /tmp/shared/kubectl --server=https://api.ci-op-ivzy36s9-7d661.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/shared/admin.kubeconfig --namespace=e2e-snapshotting-488 exec restored-pvc-tester-4p8lh --namespace=e2e-snapshotting-488 -- cat /mnt/test/data:\nCommand stdout:\n\nstderr:\ncat: can't open '/mnt/test/data': No such file or directory\ncommand terminated with exit code 1\n\nerror:\nexit status 1", }, Code: 1, } error running /tmp/shared/kubectl --server=https://api.ci-op-ivzy36s9-7d661.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/shared/admin.kubeconfig --namespace=e2e-snapshotting-488 exec restored-pvc-tester-4p8lh --namespace=e2e-snapshotting-488 -- cat /mnt/test/data: Command stdout: stderr: cat: can't open '/mnt/test/data': No such file or directory command terminated with exit code 1 error: exit status 1 occurred I noticed https://bugzilla.redhat.com/show_bug.cgi?id=1891459 seemed to be similar, but it is closed and looks like the work there was just to get stdout printed in the logs [1], which is where the above came from. [0] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-aws-csi-4.7&grid=old [1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.7/1347223175288590336/build-log.txt
It looks like this job has stabilized. Just to reiterate, I saw no obvious issues in the original test failures. Closing this, as it's no longer consistently reproducible with only two failures in the past 6 weeks.
agreed that it is likely happening a lot less frequently now, but it is still happening. this 14d search shows it happening 9 times in the last 14days. Most of those were on pre-submit jobs so getting in the way of folks trying to get a passing CI on their pull requests. Also, every job is somehow related to CSI and running on AWS (has those strings in it's name at least) so maybe that is some basic clue? here's the search I did: https://search.ci.openshift.org/?search=cat%3A+can%27t+open+%27%2Fmnt%2Ftest%2Fdata%27%3A+No+such+file+or+directory&maxAge=336h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job and all the jobs that showed up: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25848/pull-ci-openshift-origin-master-e2e-aws-csi/1356675660923277312 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25810/pull-ci-openshift-origin-master-e2e-aws-csi/1352388116643581952 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25813/pull-ci-openshift-origin-master-e2e-aws-csi/1352014510130466816 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/510/pull-ci-openshift-kubernetes-master-e2e-aws-csi/1356560051409522688 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/531/pull-ci-openshift-kubernetes-master-e2e-aws-csi/1351768076806262784 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_aws-ebs-csi-driver-operator/111/pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-aws-csi/1356554769547661312 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15187/rehearse-15187-pull-ci-openshift-aws-ebs-csi-driver-master-e2e-aws-csi/1352712267904323584 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.7/1352299413669679104 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15135/rehearse-15135-pull-ci-openshift-aws-ebs-csi-driver-master-e2e-aws-csi/1352300267629973504
Reopning this per the previous comment. I was looking at [1] for the test failures. I'll investigate the reported test fails in the previous comment as well. [1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-aws-csi-4.7&grid=old
I just reproduced and I filing the times into the test steps: > 1. Create a PVC and a Pod that consumes it. > 2. Write a file to /mnt/test/data that records the Pod's namespace. Expanding on step 2., it does: 2.1 Create a Pod that writes to /mnt/test/data 2.2 Delete the Pod. In the driver logs on the node, I can see: I0419 14:32:54.508906 1 node.go:393] NodeUnpublishVolume: called with args {VolumeId:vol-006a875808eaf3905 TargetPath:/var/lib/kubelet/pods/0b4cca95-3fea-4bac-889e-9d1d65afb5f3/volumes/kubernetes.io~csi/pvc-3f6e57e7-0c3b-44d9-b07b-425536cd307b/mount XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0419 14:32:54.620054 1 node.go:222] NodeUnstageVolume: called with args {VolumeId:vol-006a875808eaf3905 StagingTargetPath:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3f6e57e7-0c3b-44d9-b07b-425536cd307b/globalmount XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} > 3. Create a VolumeSnapshot from the PVC. In the driver logs (in the controller pod) I can see: I0419 14:32:54.326754 1 controller.go:471] CreateSnapshot: called with args source_volume_id:"vol-006a875808eaf3905" name:"snapshot-47e60f81-a81b-44f7-a9c8-fdb2b0a10894" I0419 14:32:54.750763 1 controller.go:471] CreateSnapshot: called with args source_volume_id:"vol-006a875808eaf3905" name:"snapshot-47e60f81-a81b-44f7-a9c8-fdb2b0a10894" I0419 14:32:54.826142 1 controller.go:490] Snapshot snapshot-47e60f81-a81b-44f7-a9c8-fdb2b0a10894 of volume vol-006a875808eaf3905 already exists; nothing to do CreateSnapshot was called *before* the CSI driver unmounted the volume in step 2.2 (NodeUnpublishVolume/NodeUnstageVolume). At that point, the node may not have written its caches to the volume, so the snapshot can contain an empty ext4 volume. Or clocks are out of sync - the difference is only 0.2s and Pods should not be deleted from the API server until at least NodeUnstageVolume returns.
Correction: NodeUnpublishVolume should succeed before a Pod is deleted.
I've spoken with Jan further, and he appears to be correct. Forcing a `sync` to occur remedies this issue in my tests. With no changes we can reproduce this failure over 50% of the time, and after adding a `sync` or forcing a wait until NodeUnstageVolume occurs I do not see it reoccurring. I've filed [1] to track the issue upstream. We already had logic in the test that waited for NodeUnstageVolume to complete in Windows environments, and I've required this for all distros upstream in [2]. [1] https://github.com/kubernetes/kubernetes/issues/101279 [2] https://github.com/kubernetes/kubernetes/pull/101280
Just a reminder: I'm disabling snapshot tests for Azure CSI in this PR: https://github.com/openshift/azure-disk-csi-driver-operator/pull/17. Once this issue is fixed, please re-enabled them.
I'm not certain what changed from my initial testing to now, but the patch that was merged in https://github.com/kubernetes/kubernetes/pull/101280 doesn't appear to fully be sufficient to prevent this issue, as these tests are still failing. Adding a `sync` to the original command (to force the cache write) does prevent the issue, but from previous discussions we don't want to take this route. Alternatively, I've also had success adding a brief sleep immediately after we wait for the node to unstage the volume to give the volume more time for the cache write to conclude. I've submitted [1] to add a 5 second sleep after we ensure the node has unstaged the volume. So far this test has repeatedly passed without issues after simply sleeping for 5 seconds. [1] https://github.com/kubernetes/kubernetes/pull/102021
It looks like we were using `pv.Name` instead of `volumeName` when waiting for the NodeUnpublish calls. For CSI volumes, which are the only ones these tests will run against, we should exclusively use volumeName. The upstream PR has merged, and https://github.com/openshift/kubernetes/pull/769 has been submitted to fix this downstream.
Did not met the error in the last five days in 4.8 nightly, verified pass.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438