Bug 218146

Summary:	NFS throughput low, high NFS util locks screen updates for a few minutes
Product:	[Fedora] Fedora	Reporter:	Saikat Guha <sg266>
Component:	nfs-utils	Assignee:	Steve Dickson <steved>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Ben Levenson <benl>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rawhide
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	rawhide	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-03-31 09:39:03 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Saikat Guha 2006-12-02 01:06:14 UTC

On Rawhide (kernel-2.6.18-1.2849.fc6 and .2798.fc6) NFS client, I am unable to
achieve more than 2MBPS write throughput to a NFS server. Older FC hosts (FC5)
can achieve 10MBPS (network saturated). Furthermore, when NFS utilization is
high, Xorg locks up for 3 miutes at a stretch and then is responsive again for a
couple of minutes before locking up again until the NFS utilization subsides. 

The host, however, responds to network logins while Xorg is locked -- CPU is
idle, there is little or no IO wait (even though it should be NFS utilization is
high). However, running ls, cp, mv, tab-expansion etc. on the NFS volume blocks
for a long time.

NFS is running over TCP (to the server on a FC5 host). Dmesg is clean. NFS
options in fstab are "tcp,defaults,soft"


[root@sioux ~]# nfsstat -rc; sleep 10;  nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
1080783    0          0

Client rpc stats:
calls      retrans    authrefrsh
1080824    0          0


Last few lines of tethereal capture (roughly 11 seconds, only ~1000 packets per
second on a 100mbps network):
 11.078425 xxx.yy.zzz.152 -> xxx.yy.aaa.36 TCP nfs > netviewdm3 [ACK] Seq=160200
Ack=13390152 Win=501 Len=0 TSV=384699903 TSER=4297056
 11.078643 xxx.yy.zzz.152 -> xxx.yy.aaa.36 TCP nfs > netviewdm3 [ACK] Seq=160200
Ack=13392720 Win=501 Len=0 TSV=384699903 TSER=4297056
 11.091453 xxx.yy.zzz.152 -> xxx.yy.aaa.36 NFS V2 WRITE Reply (Call In 10908)
 11.091474 xxx.yy.aaa.36 -> xxx.yy.zzz.152 NFS V2 WRITE Call, FH:0x6c027d0e
BeginOffset:2277376 Offset:2277376 TotalCount:8192[Unreassembled Packet
[incorrect TCP checksum]]
 11.091480 xxx.yy.aaa.36 -> xxx.yy.zzz.152 RPC Continuation
 11.091485 xxx.yy.aaa.36 -> xxx.yy.zzz.152 RPC Continuation
 11.091914 xxx.yy.zzz.152 -> xxx.yy.aaa.36 TCP nfs > netviewdm3 [ACK] Seq=160300
Ack=13395616 Win=501 Len=0 TSV=384699906 TSER=4297071
 11.092159 xxx.yy.zzz.152 -> xxx.yy.aaa.36 TCP nfs > netviewdm3 [ACK] Seq=160300
Ack=13398512 Win=501 Len=0 TSV=384699906 TSER=4297071
 11.092378 xxx.yy.zzz.152 -> xxx.yy.aaa.36 TCP nfs > netviewdm3 [ACK] Seq=160300
Ack=13401080 Win=501 Len=0 TSV=384699906 TSER=4297071
11 packets dropped
11102 packets captured


Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.2%id,  0.0%wa,  0.2%hi,  0.5%si,  0.0%st


The Xorg free behavior is always reproducable under high NFS utilization on my
setup. The low throughput to the NFS server is always reproducable.

Comment 1 Steve Dickson 2006-12-07 14:49:13 UTC

hmm... [incorrect TCP checksum] is a bit worrisome.... I would
guess thats the cause of the slow down... So this is between 
a FC6 (or rawhide) client and a FC5 server?

Comment 2 Saikat Guha 2006-12-07 22:03:31 UTC

Correct. Between a rawhide client and a FC5 server.
Switching NFS from TCP to UDP results in the same symptoms -- low throughput,
intermittent display lockups, lots of idle CPU etc.

Comment 3 Steve Dickson 2006-12-11 13:50:30 UTC

yeah on a congested network, UDP would be worse...

Comment 4 Saikat Guha 2006-12-11 14:21:29 UTC

True, however, I can achieve the full network bandwidth using SCP or TTCP/IPERF
etc. NFS seems to be achieving a factor of 5 less.

In addition, intense NFS activity completely locks up Xorg for tens of seconds,
sometimes several minutes -- no mouse cursor update, no panel clock updates, no
system monitor graph updates etc. 

On the system, /home is mounted from a _different_ NFS server than the server to
which the "high"-bandwidth transfer is taking place. 

/ -- local
/home -- FC1 NFS server A      <---- home directory
/mnt/nfs2 -- FC5 NFS server B  <---- destination of large file copy

The large background file copy to B 
should cause Xorg/gnome etc to freeze for minutes.

The freeze is not observed when the copy to B is performed using SCP. Also, as
mentioned, SCP bandwidth is much higher.

If there is any sort of diagnostics you'd like me to run please let me know. Thanks.

Comment 5 Steve Dickson 2006-12-15 11:27:45 UTC

> True, however, I can achieve the full network bandwidth using SCP or TTCP/IPERF
> etc. NFS seems to be achieving a factor of 5 less.
Well there will always be much less protocol overhead with streams like that
Plus the NFS client can go wire speed... I've seen it...  

I just noticed "NFS V2 WRITE Call" Why are you using v2? How does V3 using
TCP work? 

> In addition, intense NFS activity completely locks up Xorg for tens of 
> seconds, sometimes several minutes -- no mouse cursor update, no panel clock 
> updates, no system monitor graph updates etc. 
Although NFS may contribute to it.... it very rare that NFS (or any 
other filesystem) causes mouses and displays to lock up... Try opening 
up another console terminal (Alt-Ctrl-F2)  and run top to see who is 
graping your CPU...

Comment 6 Saikat Guha 2006-12-15 12:15:31 UTC

(In reply to comment #5)
> Well there will always be much less protocol overhead with streams like that
> Plus the NFS client can go wire speed... I've seen it...  
I've seen NFS at wirespeed as well (from the very same rawhide box to an FC1 NFS
server for example).

> I just noticed "NFS V2 WRITE Call" Why are you using v2? How does V3 using
> TCP work? 
Hmmm. I don't recall setting it to V2; should be using the default. Will try to
force it to V3.

> Although NFS may contribute to it.... it very rare that NFS (or any 
> other filesystem) causes mouses and displays to lock up... Try opening 
> up another console terminal (Alt-Ctrl-F2)  and run top to see who is 
> graping your CPU... 
Can't Ctrl-Alt-F2 (console is completely stuck) but can log into the stuck host
from another host as mentioned in the original post; top shows 0% cpu and 0% io
wait.

Comment 7 Saikat Guha 2008-03-31 09:39:03 UTC

Seems to be working well these last few months.
Closing this bug for now.