Bug 503862 - NFS locks up randomly
Summary: NFS locks up randomly
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 11
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-03 04:25 UTC by Gordon Messmer
Modified: 2009-08-07 19:20 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-08-07 19:20:05 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
tcpdump packet capture file (352 bytes, application/octet-stream)
2009-06-03 04:25 UTC, Gordon Messmer
no flags Details
kernel messages from /var/log/messages (13.46 KB, text/plain)
2009-06-12 06:33 UTC, Gordon Messmer
no flags Details

Description Gordon Messmer 2009-06-03 04:25:16 UTC
Created attachment 346351 [details]
tcpdump packet capture file

Description of problem:
NFS seems very flaky on kernel-2.6.29.4-167.fc11.x86_64.  The NFS server is CentOS, kernel-2.6.18-128.1.10.el5.x86_64.  Several times today, the NFS client has locked up.  The rest of the system seems fine, but any process accessing files on the NFS mount will be frozen.  The logs contain very little information:

Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:36:43 herald kernel: nfs: server ascension not responding, still trying

Oddly, after that sequence of messages, the client does not appear to actually be "still trying".  Network packet capture reveals only ACK packets between the two hosts, about every two minutes.  I don't think there's anything revealing in there, but I'll include the capture file anyway.

I hadn't seen the problem on kernel-2.6.29.3-140.fc11.x86_64 yet, so I'm trying that kernel out now.  However, I can trigger a similar hang by restarting the "nfslock" service on the NFS server when the client is running that version.

Version-Release number of selected component (if applicable):
kernel-2.6.29.3-140.fc11.x86_64

How reproducible:
Randomly

Comment 1 Bug Zapper 2009-06-09 17:00:46 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 2 Gordon Messmer 2009-06-12 06:32:33 UTC
I enabled debugging with:

echo 1 > /proc/sys/sunrpc/nfs_debug 
echo 2048 > /proc/sys/sunrpc/rpc_debug 

... and for whatever reason, the system was stable for much longer with debugging enabled.  Probably coincidence.

I'm attaching a file that contains kernel entries from the messages log starting just before NFS hung until the system shut down.

NFS hung at about 22:27:06.  About 35 seconds later I switched to a tty, which caused some [drm] messages.  All such messages were caused by me switching from X to a tty or visa versa.  I tried to use "df" in a console login session, which hung.  When I cancelled that with "ctrl+c", the kernel printed nfs_statfs errors.  All such errors in the log were the result of "df", cancelled with "ctrl+c".

I don't know if it's related, but I'm completely unable to use NFS over UDP.  I thought it might be another useful debugging step, but when I try that, I can see (using tcpdump) the client sending readdir packets to the server, which never replies.

Comment 3 Gordon Messmer 2009-06-12 06:33:29 UTC
Created attachment 347512 [details]
kernel messages from /var/log/messages

Comment 4 Gordon Messmer 2009-07-02 05:16:18 UTC
After turning off beagle, my system was stable for what seemed like a longer than average period.  Only a couple of hours after thinking about that fact, it locked up again.  This time, I was looking at a file while it was downloading.  I tried to repeat this, and sure enough, it seems like a reliable test case.

Test case:

1: reboot NFS client
2: log in on two tty consoles
3: in one console, begin copying a large file (like the Fedora DVD image) to a new file (cp Fedora-11-x86_64.iso tmpa)
4: in the second console, copy the same file to a different new file (cp Fedora-11-x86_64.iso tmpb)

I had to do this twice to cause the system to hang.  This is the simplest test case that I can provide to demonstrate the bug.

Comment 5 Gordon Messmer 2009-07-23 18:22:50 UTC
As an update, kernel 2.6.29.6-213.fc11.x86_64 does not resolve this issue.

Comment 6 Gordon Messmer 2009-08-07 19:20:05 UTC
Going back to the UDP problem helped me figure this out.  The client's MTU had been reset to 1500 from 9000.  NFS seems to be the only thing that ever broke as a result.  This was a local configuration error, not a bug.


Note You need to log in before you can comment on or make changes to this bug.