From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 Description of problem: Using the RedHat 7.3 2.4.18-5 kernel on a i686 client, and mounting from a server running RawHide 2.4.18-7.80 on a i686, nfs accesses lock up the nfs daemon on the server and the process accessing the nfs mount on the client. The nfs server processes get stuck in a "D" state and so does the client process. The server cannot be shut down properly as the nfsd processes get stuck and can't be killed by the shutdown scripts. Therefore bug is marked as "severe". Using 2.4.18-5 on the server works fine, as does running standard 2.4.19-rc2 (Linus kernel). We also tried 2.4.18-7.86 on the server, but this failed to install (depmod dependency problems). Both the machines use eepro100 cards on 100Mps ethernet. Mount options are: rw,rsize=4096,wsize=4096,hard,intr (also rsize,wsize=8192 tried) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Run 2.4.18-7.80 on a machine exporting nfs mounts 2. Mount export on client running 2.4.18-5 3. ls /mnt/nfs Actual Results: Client process hangs. Server processes hang. Expected Results: Client should be able to access nfs export. Additional info:
This is also broken with RedHat 2.4.18-7.93 (updated modutils to install this).
Sorry for more email, but this is also broken with 2.4.18-7.94, which has most of the nfs patchset in.
Could you obtain a backtrace of the stuck processes via Ctrl-Scroll Lock on the console? That will give us an idea where the processes are getting stuck to help track down the problem.
Okay, I'll attach the trace I managed to grab from dmesg after pressing ctrl+scroll lock. Basically the stuck nfsd processes are like: nfsd D F68108E0 5976 1141 1 1140 1142 (L-TLB) Call Trace: [<c0107f7a>] __down [kernel] 0x6a (0xf6afbde4)) [<c01080d4>] __down_failed [kernel] 0x8 (0xf6afbe08)) [<f8824aa0>] ext3_readdir [ext3] 0x0 (0xf6afbe10)) [<c014fdce>] .text.lock.readdir [kernel] 0x5 (0xf6afbe18)) [<f89997e3>] nfsd_readdir [nfsd] 0xc3 (0xf6afbe38)) [<f89a1070>] nfs3svc_encode_entry_plus [nfsd] 0x0 (0xf6afbe40)) [<f8837ca0>] ext3_dir_operations [ext3] 0x0 (0xf6afbe80)) [<f899ee6e>] nfsd3_proc_readdirplus [nfsd] 0xde (0xf6afbef0)) [<f89a1070>] nfs3svc_encode_entry_plus [nfsd] 0x0 (0xf6afbf04)) [<f89a61c4>] nfsd_procedures3 [nfsd] 0x264 (0xf6afbf24)) [<f89935c0>] nfsd_dispatch [nfsd] 0xd0 (0xf6afbf30)) [<f89a5898>] nfsd_version3 [nfsd] 0x0 (0xf6afbf44)) [<f89754cc>] svc_process_R6eda96b1 [sunrpc] 0x43c (0xf6afbf50)) [<f89a61c4>] nfsd_procedures3 [nfsd] 0x264 (0xf6afbf78)) [<f89a58b8>] nfsd_program [nfsd] 0x0 (0xf6afbf7c)) [<f89933b0>] nfsd [nfsd] 0x1d0 (0xf6afbf98)) [<c010765e>] kernel_thread [kernel] 0x2e (0xf6afbff0)) [<f89931e0>] nfsd [nfsd] 0x0 (0xf6afbff8))
Created attachment 69705 [details] most of traces on system
Created attachment 69706 [details] tcpdump -vv -s0 on traffic when trying to mount export
The above tcpdump output is from trying to mount the export on a 2.4.18-5 system (server running rawhide), using tcpdump -vv -s0. Mount command fails this time with: [root@xpc1 jss]# mount -t nfs xserv1.ast.cam.ac.uk:/soft3 /mnt/nfs mount: RPC: Timed out
As the trace suggests this looks like an ext3-nfs interaction bug. I remounted the exported partition as ext2, restarted the nfs daemon, and it worked fine. Remounting as ext3 provoked the bug again.
Reproduced; Workaround found, now for the real bugfix
Workaround or fixes in kernel-2.4.18-10.98 solve the problem here.