Description of problem:
NFSv4 file locking fails when running the 2.6.18-164.el5 kernel on our primary home directory server but works fine if I revert to the previous 2.6.18-128.7.1.el5 kernel.
How reproducible: Always
Steps to Reproduce:
1.boot to 2.6.18-164.el5 on the nfsv4 server
2.on an nfsv4 client, things relying on file locking fail
3. boot to 2.6.18-128.7.1.el5 on the nfsv4 server
4. on the nfsv4 client, all is well again
We rebooted our primary home directory server today to the latest
RHEL5.4 kernel (2.6.18-164.el5) and things went very badly. It looks
like file locking was broken with the symptoms being that
firefox/thunderbird were failing in ways I've seen before when file
locking was misbehaving. Also, ssh connections were complaining about
being unable to lock .Xauthority. We rebooted back to the previous
kernel (2.6.18-128.7.1.el5) and that solved the problem. We were able
to verify that the exact same thing happened to another similarly
configured system so it wasn't limited to this one machine. In both
cases, the only change required to fix it was to revert to the previous
I should note that the nfsv4 *client* systems are running a variety of
kernels (including 2.6.18-164) and are all working fine now so the only
issue seems to be with 2.6.18-164 on the servers.
I'm afraid we weren't able to do much debugging in the heat of the
battle so don't have any more information than this. If further debugging is needed I could bring up a test server that we can play with.
We see this too. vim is unusable. Rolling the server kernel back to 2.6.18-128.7.1.el5 makes the problem go away.
Are there any error messages being logged to /var/log/messages?
I just tested this on my rhel5 test box and didn't see an issue...
A bit more info would be helpful. We need to understand what's happening at the system call level. You say:
"on an nfsv4 client, things relying on file locking fail"
...what's happening here, exactly? Are fcntl calls returning errors when they shouldn't? An strace of such a program would be helpful.
Just wondering if this is the issue seen here is actuallythe regression seen in bz 524520.
I don't see anything of interest being logged on the server or the client. However, I have a test server and client set up so it would be trivial for me to enable any type of debugging that might give useful information. Just let me know.
BTW, the simplest demonstration of the problem I have is to just ssh to a client that is getting my homedir via nfsv4 from a server running 2.6.18-164. I have ForwardX11 set so it looks like it tries to lock the .Xauthority and fails:
[robh@robwilco robh]$ ssh test
Last login: Thu Oct 8 12:09:46 2009 from robwilco.cs.indiana.edu
/usr/bin/xauth: error in locking authority file /u/robh/.Xauthority
I will attach a tcpdump showing the client<->server nfs traffic for one such ssh login.
Created attachment 364152 [details]
In this dump, the nfsv4 server is curie.cs.indiana.edu (126.96.36.199) and the client is test.cs.indiana.edu (188.8.131.52). This was captured on the server with:
tcpdump -s 0 -w /tmp/nfsv4locks.pcap host test.cs.indiana.edu
while I logged into test via ssh.
Created attachment 364158 [details]
strace of sshd at login
I generated this strace by running the following on the client while I logged in via ssh:
strace -f -v -o /tmp/strace.out -p PID_OF_SSHD
Also of note is that after the login and the error about locking .Xauthority I'm left with the following in my homedir:
---------- 1 robh staff 0 Jan 14 1970 .Xauthority-c
Perhaps this is similar to bz 524520... ???
bug 524520 was what I was thinking too...the RHEL5 test kernels on my people.redhat.com page have patches to fix that bug:
...would you be able to test those someplace non-critical and let us know if they fix the problem?
I just booted my development server to 2.6.18-166.el5.jtltest.88 and a quick test seems to indicate that this fixes the problem. I didn't change the kernel on the client so it is still running the stock 2.6.18-164 kernel. I haven't done extensive testing but it definitely looks like this addresses the issue. Thanks!
Thanks for testing it. I'll go ahead and close this as a duplicate. Please reopen if it looks like it's not.
*** This bug has been marked as a duplicate of bug 524520 ***