Bug 869260

Summary:	NFS is very slow
Product:	[Fedora] Fedora	Reporter:	Edouard Bourguignon <madko>
Component:	kernel	Assignee:	nfs-maint
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	17	CC:	bfields, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-08-01 05:37:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Edouard Bourguignon 2012-10-23 12:27:09 UTC

Description of problem:
When mounting a nfs share from a NAS, the client seems to be very very slow on certains operations, especially on untarring.

untaring linux kernel sources on local disk on the NAS: 6s
untaring same archive from nfs client: +1hour and more

Version-Release number of selected component (if applicable):
kernel 3.6.2-4.fc17.x86_64

How reproducible:
static

Steps to Reproduce:
1. export nfs share: /srv/stockage/devel	192.168.2.0/24(rw,insecure,fsid=0)
2. mount -t nfs4 on the client
3. try to untar some files
  
Actual results:
very very slow


Expected results:
almost as fast as the local disk


Additional info:
 When I uncompress a file on my nfs mount, iotop shows 80KB/s max write rate, 50% to 100% in the IO> column (iowait?) on the client side.

20715 be/4 edouard     0.00 B/s   70.99 K/s  0.00 % 17.83 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s   50.97 K/s  0.00 % 39.00 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s   58.78 K/s  0.00 % 37.35 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s   19.61 K/s  0.00 % 37.81 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s  117.63 K/s  0.00 % 36.78 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s   70.58 K/s  0.00 % 30.27 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s  176.51 K/s  0.00 % 38.20 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s   39.22 K/s  0.00 % 28.67 % tar zxf linux-2.6.32.9-apc.tar.gz
20715 be/4 edouard     0.00 B/s   31.37 K/s  0.00 % 28.69 % tar zxf linux-2.6.32.9-apc.tar.gz

On the server side, some nfsd take some io but I have 100% IOwait on jbd2/md2-8

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s    0.00 B/s  0.00 % 58.05 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s  276.69 K/s  0.00 %  0.00 % [nfsd]
Total DISK READ: 5.52 M/s | Total DISK WRITE: 499.34 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s    3.87 K/s  0.00 % 99.99 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s   23.23 K/s  0.00 % 13.14 % [nfsd]
  483 be/3 root        0.00 B/s    7.74 K/s  0.00 %  5.02 % [jbd2/dm-0-8]
Total DISK READ: 6.12 M/s | Total DISK WRITE: 464.52 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s    3.87 K/s  0.00 % 99.99 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s   30.97 K/s  0.00 % 14.22 % [nfsd]
Total DISK READ: 5.23 M/s | Total DISK WRITE: 537.80 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s   33.97 K/s  0.00 % 97.43 % [jbd2/md2-8]
 1708 be/4 root        0.00 B/s   86.80 K/s  0.00 % 21.05 % [nfsd]
Total DISK READ: 6.71 M/s | Total DISK WRITE: 576.86 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s   38.72 K/s  0.00 % 96.89 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s  104.53 K/s  0.00 %  7.19 % [nfsd]
 1708 be/4 root        0.00 B/s   19.36 K/s  0.00 %  5.84 % [nfsd]
Total DISK READ: 6.28 M/s | Total DISK WRITE: 488.07 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s   19.14 K/s  0.00 % 99.99 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s   45.94 K/s  0.00 % 12.44 % [nfsd]
Total DISK READ: 5.60 M/s | Total DISK WRITE: 642.64 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s   11.61 K/s  0.00 % 99.90 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s  147.11 K/s  0.00 % 11.63 % [nfsd]
 1018 be/3 root        0.00 B/s    3.87 K/s  0.00 %  3.28 % [jbd2/dm-2-8]
Total DISK READ: 6.76 M/s | Total DISK WRITE: 543.90 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s   15.48 K/s  0.00 % 95.35 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s  135.49 K/s  0.00 % 11.20 % [nfsd]
  483 be/3 root        0.00 B/s    3.87 K/s  0.00 %  2.11 % [jbd2/dm-0-8]
Total DISK READ: 5.85 M/s | Total DISK WRITE: 509.07 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s    7.74 K/s  0.00 % 99.99 % [jbd2/md2-8]
 1702 be/4 root        0.00 B/s   54.20 K/s  0.00 %  8.87 % [nfsd]
Total DISK READ: 6.52 M/s | Total DISK WRITE: 530.46 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1021 be/3 root        0.00 B/s    7.74 K/s  0.00 % 94.30 % [jbd2/md2-8]
 1704 be/4 root        0.00 B/s   85.18 K/s  0.00 %  7.22 % [nfsd]
 1702 be/4 root        0.00 B/s    7.74 K/s  0.00 %  5.92 % [nfsd]

== NFS Server ==

kernel 2.6.32-279.11.1.el6.x86_64

Here is my exports:
/srv/stockage/devel	192.168.2.0/24(rw,insecure,fsid=0)

vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000

dd inside the export directory shows ~100MB/s 

== NFS Client ==

kernel 3.6.2-4.fc17.x86_64

On my client, here are my mount options:
stockage:/ on /mnt/stockage type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.2.30,local_lock=none,addr=192.168.2.111)

vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500

Workaround: export NFS share with async option

Comment 1 J. Bruce Fields 2012-10-23 15:19:12 UTC

File or directory creation is a synchronous operation under NFS: the client can't create a new object without first hearing back from the server, and the server can't respond to the client's create request without first committing the new object to disk.

The linux kernel source has 40,000-some files and directories.  If each create requires a disk seek, and the average disk seek takes 10 milliseconds, then creating all of them would take at least 400 seconds.  In practice, more--but an hour is surprising.  Write latency is what matters the most here--what kind of device is your exported filesystem on?

You can work around the problem with async exports, but that is likely to result in data loss if, for example, there is a power failure at the wrong moment.

Comment 2 Edouard Bourguignon 2012-10-23 21:19:25 UTC

NFS server is a HP Proliant N40L 8Gio RAM, exported filesystem is ext4 on 2HDD in software RAID1.
NFS client is Intel i5 650 8Gio RAM.

It takes only 1m22 on the same share exported via samba, on the same client.

Comment 3 J. Bruce Fields 2012-10-25 01:29:48 UTC

1m22 likely isn't possible with that hardware.  I'm not sure how SMB works, but it may just not be designed to survive reboots in the same way NFS is, in which case they can be less careful about committing changes to disk before replying.  I don't know whether it's safer than an "async" nfs export.

For best performance you'll need something with very low write latency, like an array with a battery-backed write cache.

For comparison, on ext4 over software RAID0 across two SATA drives on an F16 box, I'm getting:

  $ time tar -xjf linux-3.7-rc2.tar.bz2
  real    5m53.545s
  ...

I'd expect RAID1 to be a little slower, but if it's taking an hour then there's a bug.

In my case if I do

  sync; time for f in $(seq 1 100); do touch $f; sync; done

on the server, in the exported filesystem, I find it takes about a second, so about 10ms per create.  A 4000-file tarball would therefore be expected to take about 400 seconds, and that's in about the same ballpark as I saw above.  That's kind of a dumb benchmark, there may be something better, but you might try something similar to make sure there isn't a problem with performance of synchronous write operations on the exported filesystem.

Comment 4 Edouard Bourguignon 2012-10-25 18:11:55 UTC

On the server:
$ time (tar zxf linux-2.6.32.9-apc.tar.gz && sync); du -sh linux-2.6.32.9-apc

real	0m18.648s
user	0m5.242s
sys	0m3.627s
426M	linux-2.6.32.9-apc


On the client (CIFS/NFS async):
$ sync;time (tar zxf linux-2.6.32.9-apc.tar.gz && sync)

real	1m47.177s
user	0m6.476s
sys	0m14.490s

$ sync; time for f in $(seq 1 100); do touch $f; sync; done

real    0m2.362s
user    0m0.028s
sys     0m1.978s


On the client (NFS sync):
$ sync;time (tar zxf linux-2.6.32.9-apc.tar.gz && sync)

real    53m15.908s
user    0m6.510s
sys     0m11.541s

$  sync; time for f in $(seq 1 100); do touch $f; sync; done

real    0m4.287s
user    0m0.046s
sys     0m2.282s


about 33000 files in this tar. bz2 uses more cpu, may be this is why we have some differences? With NCQ the seek time could be around 1ms instead of 10ms? (if tracks are adjacents) Less than 2min could then be realistic?

Comment 5 J. Bruce Fields 2012-10-25 20:26:32 UTC

$  sync; time for f in $(seq 1 100); do touch $f; sync; done

real    0m4.287s
user    0m0.046s
sys     0m2.282s

This was done on the client?  Actually I was curious what that would look like done on the server side.

Anyway if that's right, it's taking almost 40ms to create a file, about 4 times slower than in my case.  But your untar is taking about 10 times longer.

I doubt cpu time is an issue, but you could try watching in top.  And if it's mostly synchronous operations, then I doubt NCQ would make a big difference, but I don't know.

What does "ping server" on the client report as the round-trip time to the server?

Output of "mountstats --nfs" and "mountstats --rpc" after the untar might also be interesting.

Comment 6 Edouard Bourguignon 2012-10-26 06:47:00 UTC

Oops sorry forgot the one on the server side:

$ sync; time for f in $(seq 1 100); do touch $f; sync; done

real	0m16.336s
user	0m0.061s
sys	0m0.243s

160ms per file, ~88min for 33000 files... not so good. Could explain why it's so slow on the client. I will try to tune this part first.

I will post the mountstats as soon as possible. 

The ping rtt should be good since there is only one Gbps switch between the server and the client.

Thanks for your direction and time.

Comment 7 J. Bruce Fields 2012-10-26 11:39:28 UTC

Note one of the reasons I say that's kind of a lousy benchmark is the "sync" sync's everything on every filesystem, so if there just happened to be some other activity at the same time that would throw off the measurement--so might want to try a few of those just to make sure that one 16s measurement wasn't a fluke.

But yeah, if that's right, it would definitely be a problem.

Comment 8 Fedora End Of Life 2013-07-04 01:18:35 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 9 Fedora End Of Life 2013-08-01 05:37:45 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.