Bug 790232

Summary: Untarring incredibly slow over NFS even worse with BTRFS export
Product: [Fedora] Fedora Reporter: Colin.Simpson
Component: kernelAssignee: nfs-maint
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: bfields, covex, gansalmon, itamar, jbacik, jjardon, jonathan, kernel-maint, madhu.chinakonda, madko, zab
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-23 18:07:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/proc/self/mountstats BEFORE and AFTER untar
none
/proc/self/mountstats BEFORE and AFTER untar btrfs
none
C test program none

Description Colin.Simpson 2012-02-14 00:17:02 UTC
Description of problem:

When you untar a file on an NFS export hosted by Fedora it takes a remarkably long amount of time .

Here are my times untarring a vlc bz source tarball (tar file and destination on the export):

F16 to RHEL5 - 0m 28.170s
F16 to F16 ext4 -  4m 12.450s
F16 to F16 btrfs - 14m 31.252s
RHEL5 to F16 ext4 - 4m 17.166s

ext4 is bad but btrfs is gruesome, not sure why so much worse.

RHEL5 is using NFSv3 so not purely an NFSv4 issue.

Interestingly RHEL6 seems to have a similar issue.


Version-Release number of selected component (if applicable):


How reproducible:
Everytime


Steps to Reproduce:
1.cd /mountonF16
2.time tar jvf  vlc-1.1.11.tar.bz2

Comment 1 Dave Jones 2012-02-14 16:35:05 UTC
so this might be an nfs or a btrfs bug (or both!), so I'm going to assign it to the NFS guys to begin with. I cc'd Josef for the btrfs angle just in case.

Comment 2 J. Bruce Fields 2012-02-14 21:27:02 UTC
Mainly hand-waving, but:

NFS file and directory creates are synchronous operation: before the create can return, the client must get a reply from the server saying not only that it has created the new object, but that the create has actually hit the disk.

This makes something like this a worst-case for NFS, since it's basically just a long series of serialized creates.

vlc-1.1.11.tar.bz2 has 2870 files and 231 directories.  Doesn't look like most of the files are very big.  So the time is probably dominated by the time to create those ~3100 objects.

So the time required to do the untar is at least roughly 3100 * (ping-time to server + time for fs to do the create and sync the result to disk).

Well, and then there will be at least one write per file (which the nfs NFS protocol requires to be sync'd to disk before close can return), so it's probably at least double that.

NFSv4 has an on-the-wire close which might explain the few seconds difference between RHEL5 and F16 client--that could be about 2870 * (ping-time-to-server).

I think RHEL5 may have been sloppy about disk barriers, which might explain its better performance as a server?

The difference between filesystems is probably explained by the amount of time the filesystem takes to do a synchronous create--maybe Josef could explain in this case.

You could try to confirm all this conjecture by looking at the timing the client gives you in /proc/self/mountstats, by timing creates on the server (I'm not sure how--would timing "touch foo; sync" be about right?), measuring the ping time, and doing some back-of-the-envelope calculations.

You may be able to improve things by using lower-latency storage, like an array with a battery-backed cache, or an SSD.

Trond has also proposed a protocol extension: http://tools.ietf.org/html/draft-myklebust-nfsv4-unstable-file-creation-01

Comment 3 Colin.Simpson 2012-02-15 11:52:37 UTC
Thanks for getting back

Not sure I understand how to interpret /proc/self/mountstats to understand what I should be looking for.

The touch and sync takes between 0.2s and 0.5s on average, sometimes 2.5s. There doesn't seem to be a massive difference between this on ext4 or btrfs (maybe there is on average).

The ping time I have is on average 0.165ms. 


So if I calculate this out:

3100 x 0.165ms / 1000 = 0.5s Network time

3100 x 0.5s = 1500s (fairly bad case I'd have thought)

I get 25m, which is worse than the btrfs case. 

So maybe your hypothesis holds (if my calculation methodology is correct) given this would be an upper bound. 

But wouldn't this only apply to NFSv4 or does NFSv3 on F16 also expect to have written to the drive before returning i.e why would RHEL5 to F16 be so terrible (using NFSv3).

This maybe a worse case, but it's quite a common case e.g compiling this VLC over NFS would be unusable (as a compile will create a lots more files).

Comment 4 J. Bruce Fields 2012-02-15 13:50:36 UTC
(In reply to comment #3)
> Thanks for getting back
> 
> Not sure I understand how to interpret /proc/self/mountstats to understand what
> I should be looking for.

Yeah, it's tricky.  The code that prints each line of the rpc statistics is in the kernel in net/sunrpc/stats.c:rpc_print_iostats.  I think we'd be interested in om_ops and om_rtt (the first and next-to-last number on each row), which give total number of calls and total time spent (from network send to receive) in those calls, in milliseconds.  (That's just based on a quick look at the stats.c code.)

If those rtt times add up to about the wall-clock time you measured then they should give us some idea what we were spending our time doing.

> The touch and sync takes between 0.2s and 0.5s on average, sometimes 2.5s.

That sounds really high, so perhaps we're not measuring the right thing.

Maybe a C program that created a file, fsync'd it, then fsync'd the parent directory would be a better test.

And, yes, you'd probably want an average over a bunch of those.

> There doesn't seem to be a massive difference between this on ext4 or btrfs
> (maybe there is on average).
> 
> The ping time I have is on average 0.165ms. 
> 
> 
> So if I calculate this out:
> 
> 3100 x 0.165ms / 1000 = 0.5s Network time
> 
> 3100 x 0.5s = 1500s (fairly bad case I'd have thought)
> 
> I get 25m, which is worse than the btrfs case. 
> 
> So maybe your hypothesis holds (if my calculation methodology is correct) given
> this would be an upper bound. 

I'd expect this to be a *lower* bound, so something's wrong.

> But wouldn't this only apply to NFSv4 or does NFSv3 on F16 also expect to have
> written to the drive before returning i.e why would RHEL5 to F16 be so terrible
> (using NFSv3).

v3 has the same requirement, yep.

The small v3/v4 difference is probably due to the v4 protocol being a little chattier--having to send over-the-wire OPEN and CLOSE, for example.

> This maybe a worse case, but it's quite a common case e.g compiling this VLC
> over NFS would be unusable (as a compile will create a lots more files).

That isn't as likely to be IO-bound, and it's also usually easier to parallelize.  (It's usually easy to do a make -j<threads>, but tar doesn't have the same option; so other threads can keep compiling while some are waiting on IO.)

Comment 5 Josef Bacik 2012-02-15 15:59:05 UTC
fwiw doing a fsync on btrfs over and over is going to suck hard compared to ext4 atm, so it doesn't surprise me we're slower.

Comment 6 Colin.Simpson 2012-02-15 19:09:01 UTC
Created attachment 562308 [details]
/proc/self/mountstats BEFORE and AFTER untar

Comment 7 Colin.Simpson 2012-02-15 19:11:23 UTC
Created attachment 562310 [details]
/proc/self/mountstats BEFORE and AFTER untar btrfs

Comment 8 J. Bruce Fields 2012-02-15 19:42:18 UTC
(In reply to comment #5)
> fwiw doing a fsync on btrfs over and over is going to suck hard compared to
> ext4 atm, so it doesn't surprise me we're slower.

That would explain the difference, yep.  It'll need to be fixed at some point if we want btrfs to be any good for NFS exports.

Comment 9 Colin.Simpson 2012-02-20 14:02:55 UTC
Created attachment 564433 [details]
C test program

Comment 10 Colin.Simpson 2012-02-20 14:03:45 UTC
I have attempted to write a quick test C program. This opens the directory, opens a file, writes a single line, fsync's it, fsync's, close file and fsync's the directory. Repeats for 3100 files. I have run this locally on the test F16 server.

This takes very little time to run on btrfs or ext4, between 0.095 and 0.140 (for all 3100 files). So I'm thinking my test is invalid? 

I'll attach the C program I quickly put together. I'm sure I'm probably missing something.

Comment 11 Colin.Simpson 2012-07-11 16:49:55 UTC
This seems to be better on F17.

Untarring the same file as before, F17 NFSv4 exported BTRFS is faster now than NFSv4 exported ext4:

F17 to F17 btrfs - 3m18.065s
F17 to F17 ext4 - 4m21.422s

, for me at least.

Comment 12 Dave Jones 2012-10-23 15:39:16 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 13 Colin.Simpson 2012-10-23 17:06:22 UTC
As I said before this is better on F17.

My times with 3.6.1-1 on both ends are:

F17 to F17 btrfs - 4m20.496s
F17 to F17 ext4 - 4m21.422s


The slow down compared to previous OS versions is now I'd imagine down to NFS protocol changes, disk barriers etc

So I'd suggest this can close as being fixed in the current release.

Comment 14 J. Bruce Fields 2012-10-23 18:07:32 UTC
OK, we'll assume for now there was a btrfs bug fixed between F16 and F17; closing.  Thanks for following up.