Bug 168894 - df occasionally hangs when accessing a netapp filer
Summary: df occasionally hangs when accessing a netapp filer
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 4
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-09-21 00:57 UTC by Chris Walker
Modified: 2007-11-30 22:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-05-05 21:20:00 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
bzip'ed output of 'echo t > /proc/sysrq-trigger' (13.21 KB, application/octet-stream)
2005-09-21 01:02 UTC, Chris Walker
no flags Details
ethereal packet trace during execution of 'df' (2.79 KB, application/octet-stream)
2005-09-21 01:04 UTC, Chris Walker
no flags Details
sysrq-m memory trace and sysrq-t process trace (bzipped) on a 2.6.12 kernel machin (9.83 KB, application/binary)
2005-10-14 20:42 UTC, Chris Walker
no flags Details
excerpt of /var/log/messages with nfs_debug set to 16 during an NFS hang (273 bytes, application/binary)
2005-10-17 23:17 UTC, Chris Walker
no flags Details
packet trace during NFS hang (18.29 KB, application/octet-stream)
2005-10-17 23:18 UTC, Chris Walker
no flags Details

Description Chris Walker 2005-09-21 00:57:00 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6

Description of problem:
We have a 1500 node Linux cluster. Occasionally, processes will hang accessing files on a NetApp filer. If I run 'df' on these machines, it will hang trying to statfs() the filesystem that's hung. The following error is printed in the messages file:

Sep 20 17:55:34 u6311 kernel: nfs_statfs: statfs error = 512

These machines have 8GB of memory and are 3.4GHz Intel Xeons. I'm attaching a couple of packet traces while df is being run on an affected machine. I'm also including the output produced by 'echo t > /proc/sysrq-trigger'.

Version-Release number of selected component (if applicable):
kernel-2.6.8-521

How reproducible:
Sometimes

Steps to Reproduce:
1. This can't really be reproduced on purpose. Statistically, it will occur every day.
2.
3.
  

Additional info:

Comment 1 Chris Walker 2005-09-21 01:02:09 UTC
Created attachment 119055 [details]
bzip'ed output of 'echo t > /proc/sysrq-trigger'

Comment 2 Chris Walker 2005-09-21 01:04:46 UTC
Created attachment 119056 [details]
ethereal packet trace during execution of 'df'

The command I used to generate this is:

tethereal -w ~/pkts2 "host share-master and (port nfs or port rpc)"

Comment 3 Steve Dickson 2005-09-23 00:09:42 UTC
Is that the entire ethreal trace??? All I'm seeing is NFS_FSSTATs and they
are all successful... What we need to look for is NFS_FSTATS that
are not successful. Try using the following display filter:
nfs.procedure_v3 == 18 and nfs.status != 0
to try and catch the not successful one

In that sysreq trace it show to df stuck in the exact same place. The
both appear to trying to a buffer allocation so I'm wondering if
this is a memory issue...  to find I would like to increase verbosity
of the kernel logging  by doing a sysctl -w kernel.printk=7.

Then, when this happens again, I would you to turn on some
sunrpc debugging by doing "echo 2 > /proc/sys/sunrpc/rpc_debug'
to see if df process is spinning trying to get memory...

Finally when this happens, use the sysrq-m command to dump the
memory usages and I'll have Larry take at a look at them to see
what he thinks...




Comment 4 Chris Walker 2005-09-23 18:12:12 UTC
I'll do more specific trace, but that is the entire trace. On a machine where
the first NFS filesystem is the problematic, no packets are sent, yet strace
shows that statfs() was called.

This was printed in /var/log/messages after turning on sunrpc debugging and
running df:

Sep 23 11:07:00 u6354 kernel: RPC: 34382 call_start nfs3 proc 18 (sync)
Sep 23 11:07:00 u6354 kernel: RPC: 34382 call_reserve

I think the "nfs_statfs: statfs error = 512" error might be a red herring. This
error appears after I hit ctrl-C when df is hung. This is expected behavior when
ctrl-C interrupts an NFS operation.

Comment 7 Steve Dickson 2005-09-26 13:13:22 UTC
Ok... lets assuming the statfs error is a red herring, since
ESYSRESTART is returned for an interrupts, but  this
does not answer the question of why the df is hanging in
the first place.

In the network trace also look for either RPC or TCP
retransmits. Its a long shot but maybe the server or network
is dropping packets under load....  Plus that SysRq-m
output could show if there are memory issues...

Finally, would it be possible to try a more recent kernel for testing
purposes? As the guy from Angstrom points out, there has been a
lot of work done both the NFS and RPC code which would be in
included in a later FC3 or even FC4 kernel. Please note it is very
possible to run later kernel on an older installed base. Currently I'm
running an FC5 kernel on an FC3 installed base without problems...


Comment 8 Chris Walker 2005-09-26 16:17:58 UTC
Will do. We have a 2.6.12 kernel that's just coming out of testing here. (I
think it might be from FC4). We'll try this on some of the affected machines and
see if it helps. Do you know of any specific changes between 2.6.8 and 2.6.12
that would address this problem?

Comment 9 Steve Dickson 2005-09-27 15:05:47 UTC
No... Although I do monitor all the upstream patches that are added
to both the -mm and main upstream kernels as well as scan webs
sites, like http://client.linux-nfs.org/ for patches that have not yet
made (or never will) into an upstream kernel (to "cherry pick" for
RHEL releases), I really don't pay too much attention to what patches
are or are not in the FC stream since its always pretty close
the current upstream kernel....

Comment 10 Chris Walker 2005-10-11 00:36:04 UTC
It looks like 2.6.12-1.1372_FC3 resolves this in our limited test group. We're
rolling this out to a larger group, but it looks promising. Can you keep this
bug open until we're sure it's fixed?

Comment 11 Steve Dickson 2005-10-11 08:22:10 UTC
Good! Its good to hear your making progress... As far as closing
this bug, do it whenever you feel its necessary

Comment 12 Chris Walker 2005-10-13 16:06:39 UTC
Spoke too soon, unfortunately. After rolling 2.6.12-1.1372_FC3 to several
machines, we've now seen this behavior on those machines as well. What can we
try next?

Comment 13 Steve Dickson 2005-10-14 19:47:52 UTC
Boy your not making my day... :)

Could you please post a new sysrq-t process trace as 
well as sysrq-m memory trace

Comment 14 Chris Walker 2005-10-14 20:42:15 UTC
Created attachment 119998 [details]
sysrq-m memory trace and sysrq-t process trace (bzipped) on a 2.6.12 kernel machin

Comment 16 Steve Dickson 2005-10-15 10:35:51 UTC
Boy it sure appears the NFS client is fat, dumb and happy waiting for
a response from the server... Is that nfs_statfs failure still being logged
to /var/log/messages? Are there any other types of errors being logged?

In the previous ethereal trace, it showed the nfs_fsstat being  responded
to, but is not clear if those responses were for the dfs that worked or
the dfs that got hung... so once your in this state I want you to
1) echo 16 > /proc/sys/sunrpc/nfs_debug which  will log NFS requests
    and responses to /var/log/messages.
2) start up a  'tethereal -w /tmp/data.pcap host server' on the client and
3 ) try another df (which should also hang).


Comment 17 Chris Walker 2005-10-17 23:15:05 UTC
I'm attaching the chunk of /var/log/messages with debugging on during the df and
another packet trace. This time I only ran df on the hanging filesystem, so
there shouldn't be any successful statfs() calls.

Comment 18 Chris Walker 2005-10-17 23:17:25 UTC
Created attachment 120101 [details]
excerpt of /var/log/messages with nfs_debug set to 16 during an NFS hang

Comment 19 Chris Walker 2005-10-17 23:18:40 UTC
Created attachment 120102 [details]
packet trace during NFS hang

Comment 20 Steve Dickson 2005-10-18 11:24:46 UTC
Well unfortunately that was a bit unfruitful.... I didn't see any
statfs() in either network trace or debug log...

When the df hangs, do all the other process hang as well?
Also is there anything type of errors or warnings being logged
on the server?

Comment 21 Chris Walker 2005-10-18 18:01:20 UTC
Yeah, I don't think statfs is getting sent. Any process that tries to access the
affected mount will hang. There don't seem to be any errors logged.

Comment 23 Chris Walker 2005-11-07 18:26:11 UTC
We've also noticed that it usually takes about 10 days (after boot) for machines
to find themselves in this state.

Comment 25 Chris Walker 2005-12-02 19:38:10 UTC
I think we may have solved this. The Intel Baseboard Management Controller that
we have on our machines shares the IP of the host and uses port 623/tcp for some
things. Apparently, this is occasionally used by NFS as well and can cause these
problems. I've put in some xinetd config files to occupy those ports and the
hanging seems to have stopped.

We had run into this problem with 623/udp and NIS earlier and solved the problem
the same way. I wish I'd thought of it sooner....

Please leave this open for a few days so we can ensure that the problem is solved.

Comment 26 Steve Dickson 2005-12-05 13:25:46 UTC
Interesting... 623 (or reserver ports) are used by both rpc.mountd and the
nfs client during mounts.... But I don't understand why this could cause a
hang after the mount occurs....  Unless the df is be done on a autofs controlled
mount point which means we could be hanging in autofs land... 

Comment 27 Chris Walker 2005-12-05 15:47:27 UTC
That is interesting. Sometimes the affected mount is an autofs mount, but many
times it isn't. I'm not sure how to explain it. After five or so days, the
results are undeniable.

Comment 28 Dave Jones 2006-01-16 22:11:37 UTC
This is a mass-update to all currently open Fedora Core 3 kernel bugs.

Fedora Core 3 support has transitioned to the Fedora Legacy project.
Due to the limited resources of this project, typically only
updates for new security issues are released.

As this bug isn't security related, it has been migrated to a
Fedora Core 4 bug.  Please upgrade to this newer release, and
test if this bug is still present there.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

Thank you.


Comment 29 Dave Jones 2006-02-03 07:09:01 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 30 John Thacker 2006-05-05 21:20:00 UTC
Closing per previous comment.


Note You need to log in before you can comment on or make changes to this bug.