Bug 168897 - NFS clients hang with a RHEL3 U5 nfs server
Summary: NFS clients hang with a RHEL3 U5 nfs server
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-09-21 01:13 UTC by Yoshihiro Tsuchiya
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-26 22:57:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sysrq-t output (10.95 KB, application/octet-stream)
2005-09-26 06:36 UTC, Yoshihiro Tsuchiya
no flags Details

Description Yoshihiro Tsuchiya 2005-09-21 01:13:50 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.7.8) Gecko/20050517 Firefox/1.0.4 (Debian package 1.0.4-2)

Description of problem:
NFS clients(I tried with Solaris 5.8 and Redhat EL3 U4) hang
during performance tests with iozone. Iozone does not finish.

NFS server is running on RedhatEL3 U5.
It had been nice with an EL3 U4.
 


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.mount nfs with rsize/wsize 32KB or 16KB
2.run iozone benchmark 
3.wait a few hours
  

Actual Results:  Iozone does not finish.
Command 'df' hangs and never returns.

vmstat says the server cpu is 100% idle. 
EL3U4 client's cpu was 100% "wait". 
Tcpdump shows no traffic between the client and the server.

"rpcinfo -p" from the nfs client and from other machines sometime shows
nfs protocols(mountd, nfsd...), sometime not(only portmaper, ypbind, etc). 
After I restart the portmapper by '/etc/init.d/portmap restart' on the nfs 
server, it answers correctly.
The nfs server can be mounted from other machines, and you can see the files.      


Expected Results:  iozone finishes usually.

Additional info:

iozone command is something like: 
/opt/iozone/bin/iozone -a -i0 -i1 -i2 -s 16g -r 64k  -f fileA
or
/opt/iozone/bin/iozone -i0 -i1 -i2 -s 2g -r 64k -t 8 -F file1 file2 ... file8


nfs mount options are:
-o hard,intr,rw,vers=3,rsize=$RWSIZE,wsize=$RWSIZE,timeo=11,retrans=5

In both case below(32KB and 16KB), it hangs easily, say, in a few hours.  
RWSIZE=32768
RWSIZE=16384

With default 8192, I am not very sure. I once hit a hang, but not easy to 
reproduce.


I had changed NFSSVC_MAXBLKSIZE(include/linux/nfsd/const.h) to 32KB.


The NFS server is on a Redhat EL3 U5 machine:
It's a single cpu machine and it runs smp kernel. I have not tried with
up kernel, though I could try.     

The number of nfs daemons is set to 168.

Comment 1 Steve Dickson 2005-09-22 11:47:52 UTC
Could you please post a SysRq-t system backtrace 
of the server when this hang occurs

Comment 2 Yoshihiro Tsuchiya 2005-09-26 06:36:20 UTC
Created attachment 119244 [details]
sysrq-t output

Processes, name starting with CL, are clustering software.
xlan is something like the bonding driver. I am not sure 
whether they are involved in the nfs hang, though they worked 
nicely with RedHatEL3U4.

It looks like some processes are sleeping in __alloc_pages?	
I wonder how they are called from schedule().

Comment 3 Steve Dickson 2005-09-26 12:51:30 UTC
How much memory is on this server? By increasing the max block size to 32k
and increasing the number of nfsd, you can easily run the machine out of
memory since each nfsd will allocate 32K for a place for incoming messages...
Please post the output of a SysRq-M, which will show the state of memory.

Comment 4 Yoshihiro Tsuchiya 2005-09-26 22:29:01 UTC
It has 2gigbytes.

             total       used       free     shared    buffers     cached
Mem:       2055444     486108    1569336          0     146644      94768
-/+ buffers/cache:     244696    1810748
Swap:      2040244          0    2040244



------

Mem-info:
Zone:DMA freepages:  2892 min:     0 low:     0 high:     0
Zone:Normal freepages:  1922 min:  1278 low:  4543 high:  6303
Zone:HighMem freepages:  1236 min:   255 low:  4606 high:  6909
Free pages:        6050 (  1236 HighMem)
( Active: 23097/354529, inactive_laundry: 97873, inactive_clean: 8512, free: 6050 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2892
  aa:0 ac:14036 id:136504 il:36767 ic:4190 fr:1922
  aa:3742 ac:5319 id:218025 il:61106 ic:4322 fr:1236
2*4kB 3*8kB 5*16kB 4*32kB 3*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB
2*4096kB = 11568kB)
46*4kB 84*8kB 75*16kB 14*32kB 3*64kB 3*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB
1*4096kB = 7688kB)
32*4kB 26*8kB 16*16kB 24*32kB 26*64kB 7*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB
0*4096kB = 4944kB)
Swap cache: add 0, delete 0, find 0/0, race 0+0
20448 pages of slabcache
482 pages of kernel stacks
0 lowmem pagetables, 428 highmem pagetables
32 bounce buffer pages, 32 are on the emergency list
Free swap:       2040244kB
524272 pages of RAM
294896 pages of HIGHMEM
10411 reserved pages
384100 pages shared
0 pages swap cached

Comment 7 Yoshihiro Tsuchiya 2005-10-26 06:26:25 UTC
I think the problem happened because of the switching hub.
With a new hub, NFS works nicely. 

Thank you, Steve. I appreciate your help.
I will use TCP rather than UDP with 32KB buffer, anyway. 

Comment 8 Ernie Petrides 2005-10-26 20:00:17 UTC
Hello, Yoshihiro.  Should this bug report be closed as NOTABUG?

Comment 9 Yoshihiro Tsuchiya 2005-10-26 22:57:12 UTC
Hi, Ernie. I am going to post the status change with this message.


Note You need to log in before you can comment on or make changes to this bug.