Bug 503862 - NFS locks up randomly
NFS locks up randomly
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
All Linux
low Severity medium
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2009-06-03 00:25 EDT by Gordon Messmer
Modified: 2009-08-07 15:20 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-08-07 15:20:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
tcpdump packet capture file (352 bytes, application/octet-stream)
2009-06-03 00:25 EDT, Gordon Messmer
no flags Details
kernel messages from /var/log/messages (13.46 KB, text/plain)
2009-06-12 02:33 EDT, Gordon Messmer
no flags Details

  None (edit)
Description Gordon Messmer 2009-06-03 00:25:16 EDT
Created attachment 346351 [details]
tcpdump packet capture file

Description of problem:
NFS seems very flaky on kernel-  The NFS server is CentOS, kernel-2.6.18-128.1.10.el5.x86_64.  Several times today, the NFS client has locked up.  The rest of the system seems fine, but any process accessing files on the NFS mount will be frozen.  The logs contain very little information:

Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:35:43 herald kernel: nfs: server ascension not responding, still trying
Jun  2 20:36:43 herald kernel: nfs: server ascension not responding, still trying

Oddly, after that sequence of messages, the client does not appear to actually be "still trying".  Network packet capture reveals only ACK packets between the two hosts, about every two minutes.  I don't think there's anything revealing in there, but I'll include the capture file anyway.

I hadn't seen the problem on kernel- yet, so I'm trying that kernel out now.  However, I can trigger a similar hang by restarting the "nfslock" service on the NFS server when the client is running that version.

Version-Release number of selected component (if applicable):

How reproducible:
Comment 1 Bug Zapper 2009-06-09 13:00:46 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
Comment 2 Gordon Messmer 2009-06-12 02:32:33 EDT
I enabled debugging with:

echo 1 > /proc/sys/sunrpc/nfs_debug 
echo 2048 > /proc/sys/sunrpc/rpc_debug 

... and for whatever reason, the system was stable for much longer with debugging enabled.  Probably coincidence.

I'm attaching a file that contains kernel entries from the messages log starting just before NFS hung until the system shut down.

NFS hung at about 22:27:06.  About 35 seconds later I switched to a tty, which caused some [drm] messages.  All such messages were caused by me switching from X to a tty or visa versa.  I tried to use "df" in a console login session, which hung.  When I cancelled that with "ctrl+c", the kernel printed nfs_statfs errors.  All such errors in the log were the result of "df", cancelled with "ctrl+c".

I don't know if it's related, but I'm completely unable to use NFS over UDP.  I thought it might be another useful debugging step, but when I try that, I can see (using tcpdump) the client sending readdir packets to the server, which never replies.
Comment 3 Gordon Messmer 2009-06-12 02:33:29 EDT
Created attachment 347512 [details]
kernel messages from /var/log/messages
Comment 4 Gordon Messmer 2009-07-02 01:16:18 EDT
After turning off beagle, my system was stable for what seemed like a longer than average period.  Only a couple of hours after thinking about that fact, it locked up again.  This time, I was looking at a file while it was downloading.  I tried to repeat this, and sure enough, it seems like a reliable test case.

Test case:

1: reboot NFS client
2: log in on two tty consoles
3: in one console, begin copying a large file (like the Fedora DVD image) to a new file (cp Fedora-11-x86_64.iso tmpa)
4: in the second console, copy the same file to a different new file (cp Fedora-11-x86_64.iso tmpb)

I had to do this twice to cause the system to hang.  This is the simplest test case that I can provide to demonstrate the bug.
Comment 5 Gordon Messmer 2009-07-23 14:22:50 EDT
As an update, kernel does not resolve this issue.
Comment 6 Gordon Messmer 2009-08-07 15:20:05 EDT
Going back to the UDP problem helped me figure this out.  The client's MTU had been reset to 1500 from 9000.  NFS seems to be the only thing that ever broke as a result.  This was a local configuration error, not a bug.

Note You need to log in before you can comment on or make changes to this bug.