82984 – accessing nfs mounts from the local host may hang filesystem

Bug 82984 - accessing nfs mounts from the local host may hang filesystem

Summary: accessing nfs mounts from the local host may hang filesystem

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Public Beta
Classification:	Retired
Component:	kernel
Sub Component:
Version:	phoebe
Hardware:	athlon
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	79578
TreeView+	depends on / blocked

Reported:	2003-01-29 09:20 UTC by Alexandre Oliva
Modified:	2007-04-18 16:50 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-02-24 04:51:57 UTC
Embargoed:

Attachments	(Terms of Use)
A crash trace that show nfsd hung up int ext3 code (52.74 KB, text/plain) 2003-01-31 21:02 UTC, Steve Dickson	no flags	Details
View All

Description Alexandre Oliva 2003-01-29 09:20:48 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030120

Description of problem:
I use am-utils to get /net/<hostname> to mount all nfs filesystems offered by
hostname.  When I access /net/free on host free itself, it creates a link to
/.automount/free/root and, in there, it mounts /l (I'd have hoped for a --bind
mount, but nfs should work nevertheless)

If I do some heavy access on this filesystem, both reads and writes (say,
building a toolchain), it eventually hangs the entire filesystem, such that not
even accesses that don't go through the nfs mount work any more.  Processes
accessing the filesystem, both through the nfs mount and otherwise, start to
hang on I/O wait, and become unkillable.  The only solution is a hard reset of
the system.

I know it is not the disk subsystem that dies because I can still log in as root
on the console.  The root filesystem is not affected, only accessing the /l
filesystem hangs.

I haven't observed this with 2.4.20-2.2 (phoebe1), even though I did a lot of
builds that used /net/free on it.  Since I updated to phoebe2, I hadn't built it
again, and only today, after upgrading the kernel to 2.4.20-2.24, did I have a
need to run a build like this, and ran into the problem twice.  I can't tell
whether 2.4.20-2.21 has the same bug, but I'll certainly know tomorrow, when I
try the build again with it.

Apparently, no filesystem corruption occurs, and since the root filesystem
doesn't hang, I can look at /var/log/messages, but there's nothing interesting
there.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.Enable am-utils and start service amd
2.Export a filesystem say /l other than root for NFS access, and create a tmp
directory in it
3.Enter /net/<hostname>/l/tmp, copy a toolchain source tree into it and try to
build it.

Actual Results:  At some point, the build halts, and from then on you can no
longer access /l nor /net/<hostname>/l

Expected Results:  Shouldn't happen.

Additional info:

I've done some NFS read and write access to this filesystem from another 8.0 box
in the last few days, even after the upgrade to 2.4.20-2.24, so it doesn't look
like it's the server side.  As for the client side, I've also run into an odd
behavior, when mounting a filesystem from another phoebe2 with 2.4.20-2.24 box:
I had run `ls' in a directory on the NFS client just as the NFS server removed
some files from it.  The client got handles for the files, but they became stale
immediately (ls gave such errors), but the client wouldn't let go of the
handles, still reporting the inability to access the inodes of the removed files
even hours after they'd been removed.  This indicates quite clearly that there's
something wrong with the NFS client, but the problems may be entirely unrelated.
Come to think of it, I'm not sure I'd already rebooted the client on 2.4.20-2.24
when this stale file problem happened, but I'm pretty sure the server had
2.4.20-2.24.  The client might still have 2.4.20-2.21.

Comment 1 Alexandre Oliva 2003-01-29 10:04:13 UTC

Just got it with 2.4.20-2.21 too :-(

Going back to 2.4.20-2.2 to see whether it fixes the problem.

Comment 2 Alexandre Oliva 2003-01-29 10:33:20 UTC

2.4.2.4.20-2.2 fixes the problem, indeed.  Or at least it got way past the point
where 2.4.20-2.2[14] would hang.

Comment 3 Steve Dickson 2003-01-29 22:09:41 UTC

Could you please send me the output of nfstat
from the client (before it hangs obviously :-))
and the server. 

Also, is there anything in /var/log/messages that
is amd or NFS related?

Comment 4 Alexandre Oliva 2003-01-30 01:57:09 UTC

The machine doesn't really hang, it is only access to the affected filesystem
that does, so here's the output of nfsstat after the hang.  The client and the
server are actually the same machine, running 2.4.20-2.24:

Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
344085     0          0          0          0
Server nfs v2:
null       getattr    setattr    root       lookup     readlink
1      100% 0       0% 0       0% 0       0% 0       0% 0       0%
read       wrcache    write      create     remove     rename
0       0% 0       0% 0       0% 0       0% 0       0% 0       0%
link       symlink    mkdir      rmdir      readdir    fsstat
0       0% 0       0% 0       0% 0       0% 0       0% 0       0%
 
Server nfs v3:
null       getattr    setattr    lookup     access     readlink
105     0% 247746 72% 8423    2% 22332   6% 12      0% 9       0%
read       write      create     mkdir      symlink    mknod
3018    0% 44543  12% 7848    2% 392     0% 4       0% 0       0%
remove     rmdir      rename     link       readdir    readdirplus
687     0% 170     0% 275     0% 7       0% 548     0% 0       0%
fsstat     fsinfo     pathconf   commit
4       0% 4       0% 0       0% 7957    2%
 
Client rpc stats:
calls      retrans    authrefrsh
3073121    281        0
Client nfs v2:
null       getattr    setattr    root       lookup     readlink
0       0% 1449057 99% 0       0% 0       0% 92      0% 4       0%
read       wrcache    write      create     remove     rename
0       0% 0       0% 0       0% 0       0% 0       0% 0       0%
link       symlink    mkdir      rmdir      readdir    fsstat
0       0% 0       0% 0       0% 0       0% 2       0% 1       0%
 
Client nfs v3:
null       getattr    setattr    lookup     access     readlink
0       0% 1367841 84% 8417    0% 124425  7% 295     0% 12      0%
read       write      create     mkdir      symlink    mknod
60000   3% 44549   2% 7850    0% 392     0% 4       0% 0       0%
remove     rmdir      rename     link       readdir    readdirplus
689     0% 170     0% 275     0% 7       0% 1051    0% 0       0%
fsstat     fsinfo     pathconf   commit
15      0% 15      0% 0       0% 7958    0%
 
As usual, it hang while installing a file.  In this case:

/usr/bin/install -c /l/tmp/glibc-build/iconvdata/IBM932.so
/net/<hostname>/l/tmp/glibc-install/usr/lib/gconv/IBM932.so.new

/var/log/messages doesn't have much interesting stuff.  Here's the tail, from
the time when the glibc build started until a few minutes after it hang:

Jan 29 23:10:13 libero amd[2792]: recompute_portmap: NFS version 3
Jan 29 23:10:13 libero amd[2792]: Using MOUNT version: 3
Jan 29 23:13:24 libero amd[29869]: amfs_host_fmount: NFS version 3
Jan 29 23:13:24 libero amd[29869]: fetch_fhandle: NFS version 3
Jan 29 23:13:24 libero amd[29869]: mount_nfs_fh: NFS version 3
Jan 29 23:13:24 libero amd[29869]: mount_nfs_fh: using NFS transport udp
Jan 29 23:13:24 libero amd[2792]: "/net/free" on /.automount/free/root still active
Jan 29 23:17:24 libero amd[5350]: amfs_host_fmount: NFS version 3
Jan 29 23:17:24 libero amd[5350]: fetch_fhandle: NFS version 3
Jan 29 23:17:24 libero amd[5350]: mount_nfs_fh: NFS version 3
Jan 29 23:17:24 libero amd[5350]: mount_nfs_fh: using NFS transport udp
Jan 29 23:17:24 libero amd[2792]: "/net/free" on /.automount/free/root still active
Jan 29 23:23:24 libero amd[22464]: amfs_host_fmount: NFS version 3
Jan 29 23:23:24 libero amd[22464]: fetch_fhandle: NFS version 3
Jan 29 23:23:24 libero amd[22464]: mount_nfs_fh: NFS version 3
Jan 29 23:23:24 libero amd[22464]: mount_nfs_fh: using NFS transport udp
Jan 29 23:23:24 libero amd[2792]: "/net/free" on /.automount/free/root still active
Jan 29 23:31:24 libero amd[2314]: amfs_host_fmount: NFS version 3
Jan 29 23:31:24 libero amd[2314]: fetch_fhandle: NFS version 3
Jan 29 23:31:24 libero amd[2314]: mount_nfs_fh: NFS version 3
Jan 29 23:31:24 libero amd[2314]: mount_nfs_fh: using NFS transport udp
Jan 29 23:31:24 libero amd[2792]: "/net/free" on /.automount/free/root still active
Jan 29 23:37:20 libero kernel: nfs: server libero not responding, still trying
Jan 29 23:39:24 libero amd[6332]: amfs_host_fmount: NFS version 3
Jan 29 23:39:24 libero amd[6332]: fetch_fhandle: NFS version 3
Jan 29 23:39:24 libero amd[6332]: mount_nfs_fh: NFS version 3
Jan 29 23:39:24 libero amd[6332]: mount_nfs_fh: using NFS transport udp
Jan 29 23:39:24 libero amd[2792]: "/net/free" on /.automount/free/root still active

I see one of two amd processes are blocked on I/O, and so are 3 of the 8 nfsds,
1 of 6 kjournalds, the install process, an updatedb and the less program that
was monitoring the build&install log file.  None of these can be taken out of
this state.

ls /l & doesn't complete, and neither does ls /net/libero &, or even ls /net &

restarting the nfs service doesn't work: Shutting down NFS daemon fails, and
Shuttong down NFS services blocks forever.  When I do it, /var/log/messages says:

Jan 29 23:47:58 libero rpc.mountd: Caught signal 15, un-registering and exiting.  
Jan 29 23:47:58 libero nfs: rpc.mountd shutdown succeeded
Jan 29 23:48:02 libero nfs: nfsd shutdown failed
Jan 29 23:48:02 libero nfs: rpc.rquotad shutdown succeeded
Jan 29 23:48:19 libero amd[2792]: file server libero.redhat.lsd.ic.unicamp.br,
type nfs, state not responding
Jan 29 23:48:22 libero amd[2792]: file server libero.redhat.lsd.ic.unicamp.br,
type nfs, state not responding
Jan 29 23:48:22 libero amd[2792]: file server libero.redhat.lsd.ic.unicamp.br,
type nfs, state is down
Jan 29 23:48:31 libero amd[2792]: No fs type specified (key = "/defaults", map =
""root"")
Jan 29 23:49:02 libero last message repeated 3 times

Restarting amd completes, but the new process isn't really functional, even
though it claims to mount from host libero (on which I"m running the tests this
time) using NFS version 2 now.  A newly-started ls /net & still fails to
complete, though.  The only way out is to reset the machine.

Please let me know if you need any additional information.  The problem is very
easy to duplicate, and there hasn't been a single time I haven't been able to
duplicate it attempting to install glibc built in /l/tmp/glibc-build into
install_root /net/<local host name>/l/tmp/glibc-install.

Comment 5 Steve Dickson 2003-01-31 21:02:21 UTC

Created attachment 89748 [details]
A crash trace that show nfsd hung up int ext3 code

It appears the NFS server is getting stuck in ext3 land....

Comment 6 Alexandre Oliva 2003-02-23 21:39:13 UTC

FWIW, I haven't run into this problem since the upgrade to phoebe3.  Maybe it's
fixed?

Comment 7 Bill Nottingham 2003-02-24 04:51:57 UTC

Possibly. Also, there's a definite ext3 ACL problems that can cause NFS hangs
that  was recently discovered. Please reopen if this comes back with .53/.54 or
later.

Note You need to log in before you can comment on or make changes to this bug.