Bug 730139 - iozone + fsync with IObarriers on performs poorly with small files (up to 4MB) especially for read operations
Summary: iozone + fsync with IObarriers on performs poorly with small files (up to 4MB...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.2
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: 6.2
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-08-11 21:06 UTC by Barry Marson
Modified: 2013-06-06 15:38 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-06-06 15:38:03 UTC


Attachments (Terms of Use)
Iozone + fsync analysis comparing rhel 6.0, 6.1, and a 2.6.32-171.bz725816 kernels (17.96 KB, text/plain)
2011-08-11 21:06 UTC, Barry Marson
no flags Details
tarball of all 6 runs, includes full raw iozone data plus more info about the system (214.69 KB, application/x-gzip)
2011-08-12 13:47 UTC, Barry Marson
no flags Details

Description Barry Marson 2011-08-11 21:06:36 UTC
Created attachment 517894 [details]
Iozone + fsync analysis comparing rhel 6.0, 6.1, and a 2.6.32-171.bz725816 kernels

Description of problem:

Performing additional testing because of regressions noted with bz 725816 have shown that iozone with small files ( up to 4MB ) perform poorly when barriers are enabled even to enterprise based storage.  Disabling barriers makes it all go away.  This was tested with ext4, but I'm fairly sure I have data that shows it occurs on multiple file systems.

Deadline scheduler was used always.

The typical iozone matrix operations which perform the worst are re-reads followed by random reads.  Performance is upwards of 95+% off.

I tried the fix in bz 725816 but saw no difference.

The configuration used was perf7, a relatively old AMD server with a FC-HBA presented LUN from an HP MSA2000.  dm multipath is being used but quite loosely.

# multipath -ll
mpathe (3600c0ff000d52efb3bbf094901000000) dm-7 HP,MSA2212fc
size=140G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 10:0:4:5 sdg 8:96  active ready running
  `- 10:0:5:5 sdn 8:208 active ready running
...

The file system is rebuilt for each run.

Attaching the iozone analysis for a 6.0, 6.1, and the kernel from bz 725816.  One thing seems clear.  This problem probably has little or nothing to do with barrier changes in 6.1 or the newer changes for 6.2 since the attachment shows issues with 6.0.

Barry

Version-Release number of selected component (if applicable):
2.6.32-71
2.6.32-131.0.15
2.6.32-171.bz725816

How reproducible:
every time

Steps to Reproduce:
1. run iozone to file system with/without barriers enabled with option to include fsync at end.
2.
3.
  
Actual results:
see attachement

Expected results:


Additional info:

Just in case the problem was related to tons of metadata activity, I tried a run with noatime,nodiratime but saw no big change with the reads.

Comment 2 Eric Sandeen 2011-08-11 22:33:35 UTC
(In reply to comment #0)
> This was tested with ext4, but I'm fairly sure I have data that shows
> it occurs on multiple file systems.

It'd be great if you could provide that data, too.

Comment 3 Eric Sandeen 2011-08-11 22:44:19 UTC
(In reply to comment #0)

> Steps to Reproduce:
> 1. run iozone to file system with/without barriers enabled with option to
> include fsync at end.

I guess that means -e ?

Which version of iozone do you use?  For iozone3_397, that option doesn't even come into play for read tests (as I'd expect).

Perhaps the storage is flushing cache internally after fsync returns, slowing down the subsequent reads?  This doesn't make much sense to me at all...

Comment 4 Dave Chinner 2011-08-12 02:13:43 UTC
I'm assumming these are the overall numbers being compared (chopped off after read tests for easier reading):

I/O  FILE & REC      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD
BARR SIZES (KB)      IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ
=====--------------------------------------------------------------------
 OFF       ALL       328     61     68    524   1873   1695     57    504
 ON        ALL       315     64     70    576   1101   1024     60    556
           ALL         .   +5.2      .  +10.0  -41.2  -39.5      .  +10.3


That to me looks like the read from disk is 10% faster with barriers on, but cache hit workloads (reread and random read) are significantly slower with barriers on.

I say "cache hit workloads" because I see no reason for sequential re-reads being 3x faster than the initial read unless it is coming out of cache. Hence I can't see how barriers affect these results at all. Changes in cache hit ranges are usually caused by changes in VM alogrithms, not barriers.

Indeed - why is the initial write rate only ~60MB/s, yet the read rates almost an order of magnitude faster @ 575MB/s? Is the initial read also partially hitting the cache?

These numbers from the -131 number (once again trimmed):

     FILE            ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  
     SIZE (KB)       IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  
=====----------------------------------------------------------------------
             4     -53.8  -17.6  -21.4  -33.3  -98.3  -97.3  -21.8  -27.5  
             8     -51.9  -19.6  -23.6  -30.0  -97.5  -95.9  -23.2  -31.7  
            16     -47.8  -23.6   -6.5  -28.9  -96.0  -94.7  -11.6  -18.6  
            32     -40.9      .  -10.9   -5.9  -93.3  -91.8   -5.1  -45.1  

Tend to imply that the re-read speed is 20-30x slower than the barrier-off case, which if we are doing 1500MB/s with barriers off then we're doing 50-60 MB/s with barriers on which, according to the initial write rate is about disk speed. IOWs, the difference between a 100% cache hit workload and a 100% cache miss workload.

How big is the data set being tested? Does it fit in memory, or a significant portion of it fit in memory? I can see if the file set size is roughly the same or smaller than memory then that adding barriers might change the order of IO completion and hence potentially the order of pages being reclaimed from the page cache off the LRU. That would then affect the subsequent read rates as that then depends on what stayed resident in cache and what hasn't. And that then affects the subsequent re-read, because some pages would now be on the active LRU rather than the inactive LRU, so reclaim patterns would change again and hence more of the file might need to be re-read from disk...

Comment 5 Barry Marson 2011-08-12 04:03:56 UTC
Eric,

Im having trouble locating the multi file system testing with barrier/nobarrier testing.  I wasn't the one to run it.  I may have to rerun it to verify.

As far as the raw iozone data, I'll post that in a tarball ASAP.

The version of iozone is 3.327 obviously not the newest but what we have been using for a while.  The latest code does do an fsync even after a set of reads.

Dave,

As for file sizes ... The analysis shows files up to 2GB for the entire test.  This represents approx. half of memory.  Indeed it is an in cache test but forces the fsync to measure time time to commit to stable storage.  The intent of re-read is to measure cached behaviour and hence should be near memory speed, but the barrier run does less.

Reads were better with barriers on with the -173-bz kernel but it's not totally clear that it's the patch from the bz or even recent barrier code updates.

I ran the release kernel comparisons mostly to see if the re-read/random read was a regression vs always been in at least RHEL6.

Barry

Comment 6 Barry Marson 2011-08-12 13:47:46 UTC
Created attachment 518031 [details]
tarball of all 6 runs, includes full raw iozone data plus more info about the system

Comment 7 Eric Sandeen 2011-08-12 19:08:44 UTC
So, I retested with a smaller matrix, and got similar results:

OPTIONS="-az -f /RHTSspareLUN1/iozone-ext4 -n 128k -g 1024k -y 1k -q 128k -e"

read & reread were way down with barriers.

But we're not hitting the disk for reads in either case:

Total (nobarrier-trace):
 Reads Queued:           4,       16KiB	 Writes Queued:      21,399,   85,596KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        4,       16KiB	 Writes Completed:   21,399,   85,596KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 IO unplugs:         1,182        	 Timer unplugs:           0

Total (barrier-trace):
 Reads Queued:           4,       16KiB	 Writes Queued:      22,167,   85,596KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        4,       16KiB	 Writes Completed:   22,167,   85,596KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 IO unplugs:         1,174        	 Timer unplugs:           0

I also retested the same above options w/ no "-e" and the slowdown on read/reread goes away, in fact it speeds up.

So, um ....?

Comment 8 Jeff Moyer 2011-08-12 19:27:40 UTC
If you're testing reads, then it makes zero sense to fsync.  What's the rationale for that?

Comment 9 Ric Wheeler 2011-08-12 19:40:13 UTC
Does iozone issue fsync() calls on files it reads during the read phase?

Comment 10 Eric Sandeen 2011-08-12 19:44:51 UTC
It does, oddly enough (at least the one bmarson runs does)

5860  open("/RHTSspareLUN1/iozone-ext4", O_RDONLY) = 3
5860  fsync(3)                          = 0
5860  read(3, "        \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
5860  lseek(3, 0, SEEK_SET)             = 0
5860  gettimeofday({1313176642, 871297}, NULL) = 0
5860  read(3, "        \0\0\0\0\0\0\0\0\0\0\0\0\0\\0\0\0"..., 131072) = 131072
...
5860  fsync(3)                          = 0
5860  gettimeofday({1313176642, 872652}, NULL) = 0
5860  fsync(3)                          = 0
5860  close(3)                          = 0

Comment 11 Barry Marson 2011-08-12 20:08:39 UTC
iozone does an fsync on flush to make sure that all metadata is synchronized ...  At least thats what the authors told me years back when I asked then.

Barry

Comment 12 Ric Wheeler 2011-08-12 20:39:57 UTC
The pattern makes no sense when using noatime (should be no dirty data?) unless the file it is reading is still dirty from the initial write and still cached?

Comment 13 chellwig@redhat.com 2011-08-12 21:35:06 UTC
Note that fsync will always flush the cache, as that is required for proper O_DIRECT write semantics.  On extN I also wouldn't be surprised it it flushes out metadata not really required.

Comment 14 Barry Marson 2011-08-12 22:01:12 UTC
well the man page claims it will flush out metadata.

Note we are not testing noatime.  That was exercised to make sure it wasn't related to the performance problem (data not posted).

Remember iozone gets run on many different platforms and modes ...

I wonder why reads of a cached file are so much slower than the rereads.  The data is already cached from the rewrite and it was fsynced.  Is this the cost potentially of the vm having to flip the page cache pages to read only ?  I've seen this before in different scenarios configs and even OS's.

Barry

Comment 15 chellwig@redhat.com 2011-08-12 22:13:02 UTC
if you are not running with noatime fsync will flush out the actual inode updates, which now actually hit the platter.

Comment 16 chellwig@redhat.com 2011-08-12 22:13:47 UTC
and the rereads are faster because the default norelatime mode won't flush the atime on the second update, unless the ctime changed.

Comment 17 Eric Sandeen 2011-08-12 22:25:43 UTC
Mounting with noatime, reads & rereads are both down about 50% in my testing, when barriers are on.

Without noatime, read is down 96%, reread down about 50%.

(this is on a single slow spindle, FWIW)

Comment 18 Dave Chinner 2011-08-15 00:28:54 UTC
(In reply to comment #10)
> It does, oddly enough (at least the one bmarson runs does)
> 
> 5860  open("/RHTSspareLUN1/iozone-ext4", O_RDONLY) = 3
> 5860  fsync(3)                          = 0
> 5860  read(3, "        \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
> 5860  lseek(3, 0, SEEK_SET)             = 0
> 5860  gettimeofday({1313176642, 871297}, NULL) = 0
> 5860  read(3, "        \0\0\0\0\0\0\0\0\0\0\0\0\0\\0\0\0"..., 131072) = 131072
> ...
> 5860  fsync(3)                          = 0
> 5860  gettimeofday({1313176642, 872652}, NULL) = 0
> 5860  fsync(3)                          = 0
> 5860  close(3)                          = 0

So for a small file, say 4k, there's 3 fsync() calls for one 4k read? If fsync() is doing any extra work even when the inode is clean, then that's going to make a huge difference to performance....

Does anyone really do this in a performance critical loop?

Comment 19 RHEL Product and Program Management 2011-10-07 15:44:20 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 20 Ric Wheeler 2012-08-05 20:26:30 UTC
It seems that we came to an understanding of this issue and can close out this BZ.

Probably worth summarizes the above in a more widely viewed place for users to get at.

Comment 21 Ric Wheeler 2013-06-06 15:38:03 UTC
Closing this out - a lot of code has changed since this was originally opened.

If this is still an issue, let's reopen it with refreshed detail.

Thanks!


Note You need to log in before you can comment on or make changes to this bug.