Bug 523797

Summary: RHEL5.4 kernel (2.6.18-164.el5) breaks nfsv4 file locking
Product: Red Hat Enterprise Linux 5 Reporter: Rob Henderson <robh>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: low    
Version: 5.4CC: jlayton, mb--redhat, sprabhu, steved
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-10-08 18:34:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tcpdump output
none
strace of sshd at login none

Description Rob Henderson 2009-09-16 17:27:24 UTC
Description of problem:


NFSv4 file locking fails when running the 2.6.18-164.el5 kernel on our primary home directory server but works fine if I revert to the previous 2.6.18-128.7.1.el5 kernel.


How reproducible:  Always


Steps to Reproduce:
1.boot to 2.6.18-164.el5 on the nfsv4 server
2.on an nfsv4 client, things relying on file locking fail
3. boot to 2.6.18-128.7.1.el5 on the nfsv4 server
4. on the nfsv4 client, all is well again
  

Additional info:

We rebooted our primary home directory server today to the latest
RHEL5.4 kernel (2.6.18-164.el5) and things went very badly.  It looks
like file locking was broken with the symptoms being that
firefox/thunderbird were failing in ways I've seen before when file
locking was misbehaving.  Also, ssh connections were complaining about
being unable to lock .Xauthority.  We rebooted back to the previous
kernel (2.6.18-128.7.1.el5) and that solved the problem.  We were able
to verify that the exact same thing happened to another similarly
configured system so it wasn't limited to this one machine.   In both
cases, the only change required to fix it was to revert to the previous
kernel.

I should note that the nfsv4 *client* systems are running a variety of
kernels (including 2.6.18-164) and are all working fine now so the only
issue seems to be with 2.6.18-164 on the servers.

I'm afraid we weren't able to do much debugging in the heat of the
battle so don't have any more information than this. If further debugging is needed I could bring up a test server that we can play with.

Comment 1 Matt Bernstein 2009-10-08 07:23:56 UTC
We see this too. vim is unusable. Rolling the server kernel back to 2.6.18-128.7.1.el5 makes the problem go away.

Comment 2 Steve Dickson 2009-10-08 15:26:27 UTC
Are there any error messages being logged to /var/log/messages?

Comment 3 Jeff Layton 2009-10-08 15:50:39 UTC
I just tested this on my rhel5 test box and didn't see an issue...

A bit more info would be helpful. We need to understand what's happening at the system call level. You say:

"on an nfsv4 client, things relying on file locking fail"

...what's happening here, exactly? Are fcntl calls returning errors when they shouldn't? An strace of such a program would be helpful.

Comment 4 Sachin Prabhu 2009-10-08 16:01:32 UTC
Just wondering if this is the issue seen here is actuallythe regression seen in bz 524520.

Comment 5 Rob Henderson 2009-10-08 16:34:04 UTC
I don't see anything of interest being logged on the server or the client.  However, I have a test server and client set up so it would be trivial for me to enable any type of debugging that might give useful information.  Just let me know.

BTW, the simplest demonstration of the problem I have is to just ssh to a client that is getting my homedir via nfsv4 from a server running 2.6.18-164.  I have ForwardX11 set so it looks like it tries to lock the .Xauthority and fails:

    [robh@robwilco robh]$ ssh test
    robh@test's password: 
    Last login: Thu Oct  8 12:09:46 2009 from robwilco.cs.indiana.edu
    /usr/bin/xauth:  error in locking authority file /u/robh/.Xauthority
    -bash-3.2$

I will attach a tcpdump showing the client<->server nfs traffic for one such ssh login.

Comment 6 Rob Henderson 2009-10-08 16:38:16 UTC
Created attachment 364152 [details]
tcpdump output

In this dump, the nfsv4 server is curie.cs.indiana.edu (129.79.246.140) and the client is test.cs.indiana.edu (129.79.245.31).  This was captured on the server with:

  tcpdump -s 0 -w /tmp/nfsv4locks.pcap host test.cs.indiana.edu

while I logged into test via ssh.

Comment 7 Rob Henderson 2009-10-08 16:57:29 UTC
Created attachment 364158 [details]
strace of sshd at login

I generated this strace by running the following on the client while I logged in via ssh:

  strace -f -v -o /tmp/strace.out -p PID_OF_SSHD

Also of note is that after the login and the error about locking .Xauthority I'm left with the following in my homedir:

  ---------- 1 robh staff    0 Jan 14  1970 .Xauthority-c

Perhaps this is similar to bz 524520... ???

Comment 8 Jeff Layton 2009-10-08 17:11:18 UTC
bug 524520 was what I was thinking too...the RHEL5 test kernels on my people.redhat.com page have patches to fix that bug:

http://people.redhat.com/jlayton/

...would you be able to test those someplace non-critical and let us know if they fix the problem?

Comment 9 Rob Henderson 2009-10-08 18:11:16 UTC
I just booted my development server to 2.6.18-166.el5.jtltest.88 and a quick test seems to indicate that this fixes the problem.  I didn't change the kernel on the client so it is still running the stock 2.6.18-164 kernel.  I haven't done extensive testing but it definitely looks like this addresses the issue.  Thanks!

Comment 10 Jeff Layton 2009-10-08 18:34:12 UTC
Thanks for testing it. I'll go ahead and close this as a duplicate. Please reopen if it looks like it's not.

*** This bug has been marked as a duplicate of bug 524520 ***