157436 – nfs client hangs

Bug 157436 - nfs client hangs

Summary: nfs client hangs

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	nfs-utils
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-05-11 17:26 UTC by Brian Bray
Modified:	2007-11-30 22:11 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-12-07 02:00:20 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/messages (29.58 KB, text/plain) 2005-05-20 18:40 UTC, Brian Bray	no flags	Details
more debugging info (135.95 KB, text/plain) 2005-05-23 19:14 UTC, Brian Bray	no flags	Details
Show Obsolete (1) View All

Description Brian Bray 2005-05-11 17:26:11 UTC

Description of problem:
Heavy nfs read access to many small files causes the nfs client to hang for 
future accesses to that partition for this client machine. Access to that 
partition even by df no longer is possible for this client.  Access by this 
client to other partitions on the same or other servers works.  Access from 
other clients to the server's partition that the particular client is having 
problems with works.

This never happens for Red Hat 9 clients buts happens for Fedora Core 3 
clients and Fedora Core 4 test 3 clients.

Version-Release number of selected component (if applicable):
nfs-utils-1.0.7-6, kernel-2.6.11-1.1286_FC4

How reproducible:
Every time

Steps to Reproduce:
1.Have many files on a server in multiple subdirectories (~1MB per file, ~1000 
files per subdirectory, ~1000 subdirectores)
2.Have script on client read these files and do some small computation like 
search for a string in each file. Print to screen when you go to another 
subdirectory
3. Within a few to 10 minutes, the script will stop running even though it is 
far from done (usually before 300 subdirectories). When you do df in another 
window the df will not complete or will other attempts to access files on that 
partition from this client machine
  
Actual results:
Works for Red Hat 9, fails for Fedora Core 3 and Fedora Core 4 test 3.
Script stops before completion and access to that partition is not possible 
from that client anymore, until after reboot.
Access to that partition is possible from other machines so it doesn't appear 
to be a problem on the server.

Expected results:
The script would complete and access to that partition would still be possible 
without rebooting the client machine.

Additional info:

Comment 1 Steve Dickson 2005-05-19 12:54:16 UTC

When the hang occurs, could you get a system trace my
doing a "echo t > /proc/sysrq-trigger"
(Note: you may have to enable System Request debugging by
echo 1 > /proc/sys/kernel/sysrq or editing /etc/sysctl.conf)

Also if its feeble, could you post a bzip2 binary tethereal trace
(i.e. tethereal -w /tmp/data.pcap) of the problem?

Comment 2 Brian Bray 2005-05-20 18:40:02 UTC

Created attachment 114647 [details]
/var/log/messages

/var/log/messages file,
tracing enabled, machine rebooted, script run which causes nfs client to hang
for the partition being accessed (hang occurred)

Comment 3 Steve Dickson 2005-05-23 10:10:46 UTC

I appears you may not have SysReq debugging enabled since
there was no backtracks in the messages file you posted.
Please enable the debugging (see comment #) and trying
it again. Note you should see output similar to :

lockd         S 00000000   112  3236      1          3780  3235 (L-TLB)
Call Trace:   [<c0124124>] schedule [kernel] 0x2f4 (0xd342bed8)
[<c01390f5>] __group_send_sig_info [kernel] 0x325 (0xd342bee8)
[<c01351dc>] schedule_timeout [kernel] 0xbc (0xd342bf1c)
[<e098726a>] svc_recv_Rsmp_4f7fb8d3 [sunrpc] 0x3ca (0xd342bf54)
[<e09a0912>] .rodata.str1.1 [lockd] 0x121 (0xd342bfac)
[<e09983cf>] lockd [lockd] 0x15f (0xd342bfc4)
[<e0998270>] lockd [lockd] 0x0 (0xd342bfe0)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xd342bff0)

cupsd         R 00000000     0  3780      1                3236 (NOTLB)
Call Trace:   [<c01591c4>] __alloc_pages [kernel] 0xb4 (0xdbe35e8c)
[<c0124124>] schedule [kernel] 0x2f4 (0xdbe35ea0)
[<c0135185>] schedule_timeout [kernel] 0x65 (0xdbe35ee4)
[<c0135110>] process_timeout [kernel] 0x0 (0xdbe35f04)
[<c017986b>] do_select [kernel] 0x13b (0xdbe35f1c)
[<c0179d0e>] sys_select [kernel] 0x34e (0xdbe35f60)
[<c01ffa1b>] audit_intercept [kernel] 0x4b (0xdbe35fac)

when the debugging is configured correctly.

Comment 4 Brian Bray 2005-05-23 19:14:07 UTC

Created attachment 114731 [details]
more debugging info

Comment 5 Steve Dickson 2005-05-24 10:49:35 UTC

This appears to be a system trace of the server and the NFS server (i.e. nfsds)
seem to be doing fine but I can not see where the client process are hung.
So you post a client system trace that would allow me to see where those
process are hung.

But there are a couple of anomalies I see. The first one is:
rpc.idmapd: nfsdreopen: Opening 'ib/nfs/rpc_pipefs/nfs' failed: errno 2 (No such
file or directory)
rpc.idmapd: nfsdreopen: Opening '' failed: errno 2 (No such file or directory)

Which is ok since idmapd has to start before the nfs server, but what I don't see
is the nfs init script sending idmapd a signal telling it to reopen those files
because the server just started.  What nfs-utils version are you using?

The second thing I see is the process scsi_eh_1 is in the scsi_error_handler
routine:

scsi_eh_1     S C4120160  3676   436      1           437   423 (L-TLB)
f7927f84 00000046 c011b2ea c4120160 f7b75020 c0401200 00000002 00000020 
        f4e8f1c0 000f4200 00000000 00000000 f791a674 f791a550 c035ac00 c4118160 
        f4f83400 000f4200 00000000 f7927000 c014c6ff f7af0ebc 00000000 f7ba9ea8 
Call Trace:
  [<c011b2ea>] activate_task+0x8c/0x9b
  [<c014c6ff>] cache_free_debugcheck+0x138/0x222
  [<c0306878>] __down_interruptible+0x83/0x113
  [<c011d1ed>] default_wake_function+0x0/0xc
  [<c030691b>] __down_failed_interruptible+0x7/0xc
  [<f884a214>] .text.lock.scsi_error+0x3b/0x47 [scsi_mod]
  [<f8849c5f>] scsi_error_handler+0x0/0x18c [scsi_mod]
  [<c01023d1>] kernel_thread_helper+0x5/0xb

Which may or may not be normal...

Comment 7 Tom Coughlan 2005-05-24 11:26:41 UTC

scsi_eh_1 is an error handler thread, currently idle.  This is normal. You will
see these on any 2.6 system that has a SCSI adapter. It does not indicate a SCSI
error has occurred. Best to look in /var/log/messages for those indications.

Comment 8 Brian Bray 2005-06-06 17:34:54 UTC

I modified the csh script that performs the nfs accesses that causes nfs to 
hang on the client machine.   I added the trigger line, echo t > /proc/sysrq-
trigger, before the line that reads the data.  I tested it several times and 
the scripts now complete (although slower and creates a 
massive /var/log/message file).  If I remove the line, it runs faster but 
after a while nfs will hang on the client.  What should I do to help you get 
the debugging info you need?

There are a little over a thousand of files in each directory where each file 
is around 1MB.

#!/bin/csh -f
foreach d (/project/nas1a/UHRR_products/chips/[0-9][0-9][0-9])
    echo "Checking $d ..."
    foreach f ($d/*.fc)
	echo t > /proc/sysrq-trigger
        set n = `hexdump -n 64 $f | grep "0000 ffc0" | wc -l`
        if ($n > 0) then
            echo "NaN found in $f"
        endif
    end
end

Thanks,
-BKB

Note You need to log in before you can comment on or make changes to this bug.