Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1900239

Summary: Skip "subPath should be able to unmount" NFS test
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Kubernetes QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: acaringi, airlied, aos-bugs, bskeggs, eparis, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jokerman, jonathan, josef, jsafrane, kernel-maint, lgoncalv, linville, masami256, mchehab, mjg59, ngompa13, piqin, steved, tbarron, walters
Version: 4.7Keywords: Regression, Reopened
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1854379
: 1900241 (view as bug list) Environment:
Last Closed: 2021-02-24 15:35:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1900241    

Description Vadim Rutkovsky 2020-11-21 16:59:47 UTC
+++ This bug was initially created as a clone of Bug #1854379 +++

OKD promotion jobs are tracking `testing-devel` stream, which regularly gets fresh kernels.

Recently a single storage test related to NFS has started failing - see https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/promote-release-openshift-okd-machine-os-content-e2e-aws-4.5/1280012242783309824

Latest test pass was on https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/promote-release-openshift-okd-machine-os-content-e2e-aws-4.5/1278975437036326912 on Jul 02 using image https://builds.coreos.fedoraproject.org/prod/streams/testing-devel/builds/32.20200702.20.0/x86_64/meta.json. pkg diff - https://github.com/coreos/fedora-coreos-config/compare/03c1fe0d1b09db7494e240a535fe8ce3fde18ab6...190b592282cc89d2f50045033763f65c492cbc8d

It appears its most likely related to kernel 5.7.7 update - see https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.7.7

--- Additional comment from Jan Safranek on 2020-07-13 14:42:19 UTC ---

Reproduced with kernel 5.7.8-200.fc32.x86_64. Reformatted for readability.

E0713 14:30:19.540124    5301 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/nfs/15bc6521-d0be-459b-8e1b-37e307510db9-pvc-cb938e62-792c-41a2-b7a8-ec9650c97d6e podName:15bc6521-d0be-459b-8e1b-37e307510db9 nodeName
:}" failed. No retries permitted until 2020-07-13 14:32:21.54008141 +0000 UTC m=+944.066210092 (durationBeforeRetry 2m2s).

Error: "error cleaning subPath mounts for volume \"test-volume\" (UniqueName: \"kubernetes.io/nfs/15bc6521-d0be-459b-8e1b-37e307510db9-pvc-cb938e62-792c-41a2-b7a8-ec9650c97d6e\") pod \"15bc6521-d0be-459b-8e1b-37e307510db9\" (UID: \"15bc6521-d0be-459b-8e1b-37e307510db9\") :
error processing /var/lib/kubelet/pods/15bc6521-d0be-459b-8e1b-37e307510db9/volume-subpaths/pvc-cb938e62-792c-41a2-b7a8-ec9650c97d6e/test-container-subpath-dynamicpv-wjts:
error cleaning subpath mount /var/lib/kubelet/pods/15bc6521-d0be-459b-8e1b-37e307510db9/volume-subpaths/pvc-cb938e62-792c-41a2-b7a8-ec9650c97d6e/
test-container-subpath-dynamicpv-wjts/0:
  unmount failed: exit status 16
  Unmounting arguments: /var/lib/kubelet/pods/15bc6521-d0be-459b-8e1b-37e307510db9/volume-subpaths/pvc-cb938e62-792c-41a2-b7a8-ec9650c97d6e/test-container-subpath-dynamicpv-wjts/0
  Output: umount.nfs4: /var/lib/kubelet/pods/15bc6521-d0be-459b-8e1b-37e307510db9/volume-subpaths/pvc-cb938e62-792c-41a2-b7a8-ec9650c97d6e/test-container-subpath-dynamicpv-wjts/0: Stale file handle

Brief summary of the test:
1. create a pod with a NFS volume, with 2 containers:
  - The first uses subdirectory of the PV as subpath
  - The seconds uses the whole volume

2. Exec into the second container and remove the subpath directory.

3. Delete the pod.

--- Additional comment from Jan Safranek on 2020-07-17 10:41:01 UTC ---

Starting with 5.7.x, kernel does not allow users to unmount NFS mounts with "Stale file handle"

I tested with 5.7.4-200.fc32.x86_64, the first 5.7.x kernel in Fedora 32.

Steps to reproduce (basically, get "Stale file handle" error on bind-mounted nfs dir):

1. Use this dummy /etc/exports:
/var/tmp 127.0.0.1(rw,sync,all_squash,anonuid=1000)

2. Mount it to /mnt/test:
$ mkdir /mnt/test
$ mount localhost:/var/tmp /mnt/test

3. Bind-mount a subdirectory of it to /mnt/test2:
$ mkdir /mnt/test/reproduce
$ mkdir /mnt/test2
$ mount --bind /mnt/test/reproduce /mnt/test2

4. Remove the bind-mounted dir
$ rmdir /mnt/test/reproduce

5. Check that /mnt/test2 is not happy about that
$ ls /mnt/test2
ls: cannot access '/mnt/test2': Stale file handle

This is expected.

6. Try to unmount /mnt/test2
$ umount /mnt/test2
umount.nfs4: /mnt/test2: Stale file handle

This is not expected! There is no way how to unmount the directory. It's mounted forever. Even reboot gets stuck.


With kernel-core-5.6.19-300.fc32.x86_64 (the last 5.6.x in Fedora 32), step 6. succeeds.

--- Additional comment from Vadim Rutkovsky on 2020-07-23 07:37:29 UTC ---

Steve, could you have a look? Reproducible in latest 5.7.x kernel in F32

--- Additional comment from Vadim Rutkovsky on 2020-08-20 07:48:56 UTC ---

No longer happening in 5.7.15-200.fc32.x86_64

--- Additional comment from Vadim Rutkovsky on 2020-09-23 13:53:51 UTC ---

I was wrong - the test didn't pass, instead it was skipped.

The issue still occurs on 5.8.10-200.fc32.x86_64

--- Additional comment from Colin Walters on 2020-09-23 14:20:11 UTC ---

Probably the best way to get traction on this is to bisect it and report to the linux-nfs@ email list: https://linux-nfs.org/wiki/index.php/Main_Page

Speaking with a very broad brush, Fedora kernel BZs are mostly triaged to upstream bugs and that's the best way to address them.

--- Additional comment from Colin Walters on 2020-09-23 14:30:56 UTC ---

Looking at changes and code, at a vague guess this may be related to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=779df6a5480f1307d51b66ea72352be592265cad
Specifically 
```
	if (ctx->clone_data.sb) {
		if (d_inode(fc->root)->i_fop != &nfs_dir_operations) {
			error = -ESTALE;
```

has the right appearance for this problem at least.

--- Additional comment from Vadim Rutkovsky on 2020-10-09 15:29:57 UTC ---

Still occurs on 5.8.10-200.fc32.x86_64 (same test failing in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/promote-release-openshift-okd-machine-os-content-e2e-gcp-4.5/1314504432246853632)

--- Additional comment from Vadim Rutkovsky on 2020-11-12 15:57:22 UTC ---

kernel-5.9.8-200.fc33 from updates-testing is also affected

Comment 3 Jan Safranek 2021-01-05 15:05:07 UTC
*** Bug 1912720 has been marked as a duplicate of this bug. ***

Comment 4 Jan Safranek 2021-01-05 15:06:13 UTC
*** Bug 1912906 has been marked as a duplicate of this bug. ***

Comment 5 Jan Safranek 2021-01-05 15:07:28 UTC
The "fix" was to disable the faulty test in CI.

Comment 6 Wei Duan 2021-01-07 01:35:55 UTC
Jan, I see the pr disable this case:
"[Top Level] [sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should be able to unmount after the subpath directory is deleted": "should be able to unmount after the subpath directory is deleted [Disabled:Broken] [Suite:k8s]",

But I can still find it:
https://search.ci.openshift.org/?search=In-tree+Volumes+%5C%5BDriver%3A+nfs%5C%5D+%5C%5BTestpattern%3A+Inline-volume+%5C%28default+fs%5C%29%5C%5D+subPath+should+be+able+to+unmount+after+the+subpath+directory+is+deleted&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 7 Jan Safranek 2021-01-12 12:06:54 UTC
There are two issues:

1. This test was disabled;
> "[Top Level] [sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should be able to unmount after the subpath directory is deleted": "should be able to unmount after the subpath directory is deleted [Disabled:Broken] [Suite:k8s]",

But the search linked in the previous comment looks for "Testpattern: Inline-volume (default fs)". This is a different tests. It is skipped too, but later in the process - the test goes through common test initialization (i.e. a namespace is created, nr. of Ready nodes is checked) and then it discovers it's a NFS test and NFS supports dynamic provisioning and skips itself (so the feature is tested only once and not for every pre-provisioned/inline/dynamic PV combination to save time). What you see in the search are errors from the test initialization itself, usually because the whole cluster is not really working ("connection refused" from API server, "timeout" and similar errors). This is not related to NFS nor storage in any way.

2. Searching for the disabled test "[Driver: nfs] [Testpattern: Dynamic PV (default fs)]" in past 7 days in "4.7" or "master", I can see it failed only once in a job that's called "periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere", https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1346818807812853760

The job name contains both "master" and "4.6", which is quite confusing, but the sources says it's 4.6: https://github.com/openshift/release/blob/master/ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml#L1581

Comment 8 Wei Duan 2021-01-12 13:10:39 UTC
Sorry for the bad query, I had a wrong copy and paste.
I checked there is no 4.7 ci running this case. 
Mark it as VERIFIED.

Comment 10 errata-xmlrpc 2021-02-24 15:35:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633