From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030120 Description of problem: I use am-utils to get /net/<hostname> to mount all nfs filesystems offered by hostname. When I access /net/free on host free itself, it creates a link to /.automount/free/root and, in there, it mounts /l (I'd have hoped for a --bind mount, but nfs should work nevertheless) If I do some heavy access on this filesystem, both reads and writes (say, building a toolchain), it eventually hangs the entire filesystem, such that not even accesses that don't go through the nfs mount work any more. Processes accessing the filesystem, both through the nfs mount and otherwise, start to hang on I/O wait, and become unkillable. The only solution is a hard reset of the system. I know it is not the disk subsystem that dies because I can still log in as root on the console. The root filesystem is not affected, only accessing the /l filesystem hangs. I haven't observed this with 2.4.20-2.2 (phoebe1), even though I did a lot of builds that used /net/free on it. Since I updated to phoebe2, I hadn't built it again, and only today, after upgrading the kernel to 2.4.20-2.24, did I have a need to run a build like this, and ran into the problem twice. I can't tell whether 2.4.20-2.21 has the same bug, but I'll certainly know tomorrow, when I try the build again with it. Apparently, no filesystem corruption occurs, and since the root filesystem doesn't hang, I can look at /var/log/messages, but there's nothing interesting there. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1.Enable am-utils and start service amd 2.Export a filesystem say /l other than root for NFS access, and create a tmp directory in it 3.Enter /net/<hostname>/l/tmp, copy a toolchain source tree into it and try to build it. Actual Results: At some point, the build halts, and from then on you can no longer access /l nor /net/<hostname>/l Expected Results: Shouldn't happen. Additional info: I've done some NFS read and write access to this filesystem from another 8.0 box in the last few days, even after the upgrade to 2.4.20-2.24, so it doesn't look like it's the server side. As for the client side, I've also run into an odd behavior, when mounting a filesystem from another phoebe2 with 2.4.20-2.24 box: I had run `ls' in a directory on the NFS client just as the NFS server removed some files from it. The client got handles for the files, but they became stale immediately (ls gave such errors), but the client wouldn't let go of the handles, still reporting the inability to access the inodes of the removed files even hours after they'd been removed. This indicates quite clearly that there's something wrong with the NFS client, but the problems may be entirely unrelated. Come to think of it, I'm not sure I'd already rebooted the client on 2.4.20-2.24 when this stale file problem happened, but I'm pretty sure the server had 2.4.20-2.24. The client might still have 2.4.20-2.21.
Just got it with 2.4.20-2.21 too :-( Going back to 2.4.20-2.2 to see whether it fixes the problem.
2.4.2.4.20-2.2 fixes the problem, indeed. Or at least it got way past the point where 2.4.20-2.2[14] would hang.
Could you please send me the output of nfstat from the client (before it hangs obviously :-)) and the server. Also, is there anything in /var/log/messages that is amd or NFS related?
The machine doesn't really hang, it is only access to the affected filesystem that does, so here's the output of nfsstat after the hang. The client and the server are actually the same machine, running 2.4.20-2.24: Server rpc stats: calls badcalls badauth badclnt xdrcall 344085 0 0 0 0 Server nfs v2: null getattr setattr root lookup readlink 1 100% 0 0% 0 0% 0 0% 0 0% 0 0% read wrcache write create remove rename 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% link symlink mkdir rmdir readdir fsstat 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% Server nfs v3: null getattr setattr lookup access readlink 105 0% 247746 72% 8423 2% 22332 6% 12 0% 9 0% read write create mkdir symlink mknod 3018 0% 44543 12% 7848 2% 392 0% 4 0% 0 0% remove rmdir rename link readdir readdirplus 687 0% 170 0% 275 0% 7 0% 548 0% 0 0% fsstat fsinfo pathconf commit 4 0% 4 0% 0 0% 7957 2% Client rpc stats: calls retrans authrefrsh 3073121 281 0 Client nfs v2: null getattr setattr root lookup readlink 0 0% 1449057 99% 0 0% 0 0% 92 0% 4 0% read wrcache write create remove rename 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% link symlink mkdir rmdir readdir fsstat 0 0% 0 0% 0 0% 0 0% 2 0% 1 0% Client nfs v3: null getattr setattr lookup access readlink 0 0% 1367841 84% 8417 0% 124425 7% 295 0% 12 0% read write create mkdir symlink mknod 60000 3% 44549 2% 7850 0% 392 0% 4 0% 0 0% remove rmdir rename link readdir readdirplus 689 0% 170 0% 275 0% 7 0% 1051 0% 0 0% fsstat fsinfo pathconf commit 15 0% 15 0% 0 0% 7958 0% As usual, it hang while installing a file. In this case: /usr/bin/install -c /l/tmp/glibc-build/iconvdata/IBM932.so /net/<hostname>/l/tmp/glibc-install/usr/lib/gconv/IBM932.so.new /var/log/messages doesn't have much interesting stuff. Here's the tail, from the time when the glibc build started until a few minutes after it hang: Jan 29 23:10:13 libero amd[2792]: recompute_portmap: NFS version 3 Jan 29 23:10:13 libero amd[2792]: Using MOUNT version: 3 Jan 29 23:13:24 libero amd[29869]: amfs_host_fmount: NFS version 3 Jan 29 23:13:24 libero amd[29869]: fetch_fhandle: NFS version 3 Jan 29 23:13:24 libero amd[29869]: mount_nfs_fh: NFS version 3 Jan 29 23:13:24 libero amd[29869]: mount_nfs_fh: using NFS transport udp Jan 29 23:13:24 libero amd[2792]: "/net/free" on /.automount/free/root still active Jan 29 23:17:24 libero amd[5350]: amfs_host_fmount: NFS version 3 Jan 29 23:17:24 libero amd[5350]: fetch_fhandle: NFS version 3 Jan 29 23:17:24 libero amd[5350]: mount_nfs_fh: NFS version 3 Jan 29 23:17:24 libero amd[5350]: mount_nfs_fh: using NFS transport udp Jan 29 23:17:24 libero amd[2792]: "/net/free" on /.automount/free/root still active Jan 29 23:23:24 libero amd[22464]: amfs_host_fmount: NFS version 3 Jan 29 23:23:24 libero amd[22464]: fetch_fhandle: NFS version 3 Jan 29 23:23:24 libero amd[22464]: mount_nfs_fh: NFS version 3 Jan 29 23:23:24 libero amd[22464]: mount_nfs_fh: using NFS transport udp Jan 29 23:23:24 libero amd[2792]: "/net/free" on /.automount/free/root still active Jan 29 23:31:24 libero amd[2314]: amfs_host_fmount: NFS version 3 Jan 29 23:31:24 libero amd[2314]: fetch_fhandle: NFS version 3 Jan 29 23:31:24 libero amd[2314]: mount_nfs_fh: NFS version 3 Jan 29 23:31:24 libero amd[2314]: mount_nfs_fh: using NFS transport udp Jan 29 23:31:24 libero amd[2792]: "/net/free" on /.automount/free/root still active Jan 29 23:37:20 libero kernel: nfs: server libero not responding, still trying Jan 29 23:39:24 libero amd[6332]: amfs_host_fmount: NFS version 3 Jan 29 23:39:24 libero amd[6332]: fetch_fhandle: NFS version 3 Jan 29 23:39:24 libero amd[6332]: mount_nfs_fh: NFS version 3 Jan 29 23:39:24 libero amd[6332]: mount_nfs_fh: using NFS transport udp Jan 29 23:39:24 libero amd[2792]: "/net/free" on /.automount/free/root still active I see one of two amd processes are blocked on I/O, and so are 3 of the 8 nfsds, 1 of 6 kjournalds, the install process, an updatedb and the less program that was monitoring the build&install log file. None of these can be taken out of this state. ls /l & doesn't complete, and neither does ls /net/libero &, or even ls /net & restarting the nfs service doesn't work: Shutting down NFS daemon fails, and Shuttong down NFS services blocks forever. When I do it, /var/log/messages says: Jan 29 23:47:58 libero rpc.mountd: Caught signal 15, un-registering and exiting. Jan 29 23:47:58 libero nfs: rpc.mountd shutdown succeeded Jan 29 23:48:02 libero nfs: nfsd shutdown failed Jan 29 23:48:02 libero nfs: rpc.rquotad shutdown succeeded Jan 29 23:48:19 libero amd[2792]: file server libero.redhat.lsd.ic.unicamp.br, type nfs, state not responding Jan 29 23:48:22 libero amd[2792]: file server libero.redhat.lsd.ic.unicamp.br, type nfs, state not responding Jan 29 23:48:22 libero amd[2792]: file server libero.redhat.lsd.ic.unicamp.br, type nfs, state is down Jan 29 23:48:31 libero amd[2792]: No fs type specified (key = "/defaults", map = ""root"") Jan 29 23:49:02 libero last message repeated 3 times Restarting amd completes, but the new process isn't really functional, even though it claims to mount from host libero (on which I"m running the tests this time) using NFS version 2 now. A newly-started ls /net & still fails to complete, though. The only way out is to reset the machine. Please let me know if you need any additional information. The problem is very easy to duplicate, and there hasn't been a single time I haven't been able to duplicate it attempting to install glibc built in /l/tmp/glibc-build into install_root /net/<local host name>/l/tmp/glibc-install.
Created attachment 89748 [details] A crash trace that show nfsd hung up int ext3 code It appears the NFS server is getting stuck in ext3 land....
FWIW, I haven't run into this problem since the upgrade to phoebe3. Maybe it's fixed?
Possibly. Also, there's a definite ext3 ACL problems that can cause NFS hangs that was recently discovered. Please reopen if this comes back with .53/.54 or later.