Description of problem: When attempting to mount an NFSv4 share from a server on a RHEL5 client, the file system exhibits odd behaviour, primarily visible through the /proc/self/exe symlink. After an fstat on a file from an NFSv4 share, the phrase ' (deleted)' is appended to what the file system thinks the files name is. Thus, /proc/self/exe for '/path/to/process' points to '/path/to/process (deleted)' (which is an invalid symlink). This is particularly problematic when running programs off of an NFSv4 share, as programs (such as the java JDK) attempt to use /proc/self/exe to work out where they are, and fail when the (deleted) is added. A temporary solution is to copy the file with the phrase (deleted) in the filename, but this is far from ideal. The problem is client side, and not server side, and appears to be in the kernel (see additional info). A Fedora 10 client can mount the server without problems. Likewise, a Fedora 10 server (or any other setup running an NFsv4 server) mounted on a RHEL client produces the same problem. How reproducible: Always on a RHEL5 client - accross various different versions and configutrations of Red Hat Steps to Reproduce: 1. Set up an NFSv4 server and export, for example, /test 2. Mount /test on a RHEL5 client using NFSv4 3. Execute a program that demonstrates the broken behaviour (see below) Demonstrating the broken behaviour: The broken behaviour can easily be demonstrated with a simple C program: #include <stdio.h> #include <unistd.h> #include <sys/stat.h> #include <sys/types.h> int main() { char buf[1024]; int len = readlink("/proc/self/exe", buf, sizeof(buf)); printf("/proc/self/exe = %s\n",buf); } Compiling and running the program produces the following output (note how it alternates between the correct response and broken response) test@tester /test $ ./a.out /proc/self/exe = /test/a.out (deleted)· test@tester /test $ ./a.out /proc/self/exe = /test/a.out test@tester /test $ ./a.out /proc/self/exe = /test/a.out (deleted)· test@tester /test $ ./a.out /proc/self/exe = /test/a.out test@tester /test $ ./a.out /proc/self/exe = /test/a.out (deleted)· The broken behaviour can also be seen with java. For example, extract the java JDK into /test and attempt to run 'java' or 'javac'. The first execution will fail, subsequent executions until a stat will work, then failure will again occur. test@tester test $ ./java -version execve(): No such file or directory Error trying to exec test/java (deleted). Check if file exists and permissions are set correctly. test@tester test $ ./java -version java version "1.6.0_14" Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) Client VM (build 14.0-b16, mixed mode) test@tester test $ ./java -version java version "1.6.0_14" Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) Client VM (build 14.0-b16, mixed mode) test@tester test $ ./java -version java version "1.6.0_14" Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) Client VM (build 14.0-b16, mixed mode) test@tester test $ stat ./java > /dev/null test@tester test $ ./java -version execve(): No such file or directory Error trying to exec test/java (deleted). Check if file exists and permissions are set correctly. test@tester test $ ./java -version java version "1.6.0_14" Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) Client VM (build 14.0-b16, mixed mode) --- Actual results: /path/to/file (deleted) Expected results: /path/to/file Additional info: It seems that when the problem occurs, no request to the NFSv4 server is made from the client (when deleted is appended). When the attempt occurs again, a request is made and it works (no deleted). There does not seem to be anything else obvious or apparent to the problem in the NFS server log or client log different to a working system. The problem appears to be in the kernel, not in the software based nfs-utils. Using the same nfs-utils that work on a Fedora 10 system on the RHEL system still results in the same problem. The problem does not occur on a Fedora 10 client mounting the same server.
As an additional comment, the problem only occurs when mounting as NFSv4. Mouting through a previous version of NFS works fine.
Looks like this is probably being done by __d_path(). The "(deleted)" string means that the dentry has been unhashed. nfs4 does do some d_drops in the open codepath so that's probably where this needs to be resolved.
This is also fixed in rawhide. Trawling through the changesets to see if anything stands out at me...
...also not a problem in RHEL4.
...and only occurs on RHEL5 on every 2nd run of the reproducer.
I believe this is a regression that was probably introduced by the patch for bug 321111. -77.el5 doesn't show this behavior, -79.el5 does. The only NFS patch that went in during that period was the patch for bug 321111.
Created attachment 358236 [details] patch -- fix regression in nfs_open_revalidate I think I found the bug. This seems to fix it and a quick run with cthon didn't show any regressions. I'll add it to my test kernels to get it some testing exposure, but I think it's correct.
I've added the above patch to my test kernels here: http://people.redhat.com/jlayton ...could you test it and report back whether it fixes the issue for you (and whether you see any other problems)?
Thankyou for your work in fixing this bug. I have tested the test kernel you supplied (kernel-2.6.18-164.el5.jtltest.84.x86_64.rpm) and I can confirm that the problem seems to be gone. No unexpected problems arose, NFSv4 worked as expected, and I was unable to reproduce the problem. This seems to fix the problem with no other side affects. Many thanks for your time
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-168.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html