Bug 503862

Summary: NFS locks up randomly
Product: [Fedora] Fedora Reporter: Gordon Messmer <gordon.messmer>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: cmc, itamar, kernel-maint, prgarcial
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-08-07 19:20:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tcpdump packet capture file
none
kernel messages from /var/log/messages none

Description Gordon Messmer 2009-06-03 04:25:16 UTC
Created attachment 346351 [details]
tcpdump packet capture file

Description of problem:
NFS seems very flaky on kernel-2.6.29.4-167.fc11.x86_64.  The NFS server is CentOS, kernel-2.6.18-128.1.10.el5.x86_64.  Several times today, the NFS client has locked up.  The rest of the system seems fine, but any process accessing files on the NFS mount will be frozen.  The logs contain very little information:

Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:36:43 herald kernel: nfs: server ascension not responding, still trying

Oddly, after that sequence of messages, the client does not appear to actually be "still trying".  Network packet capture reveals only ACK packets between the two hosts, about every two minutes.  I don't think there's anything revealing in there, but I'll include the capture file anyway.

I hadn't seen the problem on kernel-2.6.29.3-140.fc11.x86_64 yet, so I'm trying that kernel out now.  However, I can trigger a similar hang by restarting the "nfslock" service on the NFS server when the client is running that version.

Version-Release number of selected component (if applicable):
kernel-2.6.29.3-140.fc11.x86_64

How reproducible:
Randomly

Comment 1 Bug Zapper 2009-06-09 17:00:46 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 2 Gordon Messmer 2009-06-12 06:32:33 UTC
I enabled debugging with:

echo 1 > /proc/sys/sunrpc/nfs_debug 
echo 2048 > /proc/sys/sunrpc/rpc_debug 

... and for whatever reason, the system was stable for much longer with debugging enabled.  Probably coincidence.

I'm attaching a file that contains kernel entries from the messages log starting just before NFS hung until the system shut down.

NFS hung at about 22:27:06.  About 35 seconds later I switched to a tty, which caused some [drm] messages.  All such messages were caused by me switching from X to a tty or visa versa.  I tried to use "df" in a console login session, which hung.  When I cancelled that with "ctrl+c", the kernel printed nfs_statfs errors.  All such errors in the log were the result of "df", cancelled with "ctrl+c".

I don't know if it's related, but I'm completely unable to use NFS over UDP.  I thought it might be another useful debugging step, but when I try that, I can see (using tcpdump) the client sending readdir packets to the server, which never replies.

Comment 3 Gordon Messmer 2009-06-12 06:33:29 UTC
Created attachment 347512 [details]
kernel messages from /var/log/messages

Comment 4 Gordon Messmer 2009-07-02 05:16:18 UTC
After turning off beagle, my system was stable for what seemed like a longer than average period.  Only a couple of hours after thinking about that fact, it locked up again.  This time, I was looking at a file while it was downloading.  I tried to repeat this, and sure enough, it seems like a reliable test case.

Test case:

1: reboot NFS client
2: log in on two tty consoles
3: in one console, begin copying a large file (like the Fedora DVD image) to a new file (cp Fedora-11-x86_64.iso tmpa)
4: in the second console, copy the same file to a different new file (cp Fedora-11-x86_64.iso tmpb)

I had to do this twice to cause the system to hang.  This is the simplest test case that I can provide to demonstrate the bug.

Comment 5 Gordon Messmer 2009-07-23 18:22:50 UTC
As an update, kernel 2.6.29.6-213.fc11.x86_64 does not resolve this issue.

Comment 6 Gordon Messmer 2009-08-07 19:20:05 UTC
Going back to the UDP problem helped me figure this out.  The client's MTU had been reset to 1500 from 9000.  NFS seems to be the only thing that ever broke as a result.  This was a local configuration error, not a bug.