Bug 750551

Summary: IOzone incache write performance drop on xfs file system
Product: Red Hat Enterprise Linux 6 Reporter: Kamil Kolakowski <kkolakow>
Component: kernelAssignee: Dave Chinner <dchinner>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: bmarson, bsarathy, dchinner, esandeen, lczerner, perfbz, rwheeler
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-07 05:19:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 717224    
Attachments:
Description Flags
Comparison 6.1GA vs SNAP4
none
IOzone xfs regression comparing 6.1 vs 6.2RC1
none
multifile simzone runs tarball with src and test runs from perf4.
none
gnuplot script for viewing simzone output none

Description Kamil Kolakowski 2011-11-01 14:56:13 UTC
Description of problem:
I see performance drop on WRITE on xfs (6-9%). As baseline I have RHEL6.1GA first saw regression was in RHEL6.2 SNAP2.

I run this test as well on different boxes but I see performance drop only on this HW. Drop is visible on incache and direct io tests.

HW summary:
Testing drive intel SSD, no raid no lvm, cfq scheduler.


Test environment
  
Test                                      = /performance/iozone_devel_with_library/certification
Date                                      = Tue Nov 1 00:57:47 CET 2011

Hostname                                  = ibm-x3650m3-01.lab.eng.brq.redhat.com
Arch                                      = x86_64
Distro                                    = RHEL6.2-20111026.2
Kernel                                    = 2.6.32-214.el6.x86_64
SElinux mode                              = Permissive

CPU count : speeds MHz                    = 12 : 12 @ 2793.000
CPU model name                            = Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
CPU cache size                            = 12288 KB
BIOS Information   Version : Date         = -D6E145FUS-1.07- : 04/26/2010
Total Memory                              = 11899 (MB)
NUMA is Enabled.  # of nodes              = 1 nodes (0)

Tuned profile                             = default

RHTS information
  RHTS JOBID                              = 149436
  RHTS RECIPEID                           = 311443
  RHTS RECIPETESTID                       = 3418324

FILESYSTEM configuration
  Filesystems to test (requested)         = xfs
  Filesystems to test (actual)            = xfs
  Mount point of filesystem under test    = /RHTSspareLUN1
  LUN for filesystem under test           = /dev/sdc1
  readahead for LUN above                 = 128 KB
  Speed test: hdparm -tT /dev/sdc1
 Timing buffered disk reads => 249.05  MB/sec
 Timing cached reads => 8737.88  MB/sec
Free Diskspace on /RHTSspareLUN1          = 20GB
I/O scheduler on testing device           = cfq

IOZONE version                            = 3.327
  Page Size [smallest file to work on]    = 4 (KB)
  90% of Free disk space available        = 17066 (MB)
  Command line options                    = -r 3 --rsync --eatmem --swap --iozone_umount --iozone_rec_size --incache --directio --outofcache -f xfs
  In Cache test maximum file size         = 4096 (MB)
  Out of Cache test minimum file size     = 512 (MB)
  Out of Cache test maximum file size     = 2048 (MB)
  Free memory after running eatmem        = 512 (MB)
  Direct I/O test maximum file size       = 4096 (MB)
  Number of sequential runs               = 3

Comment 1 Kamil Kolakowski 2011-11-01 14:57:16 UTC
Created attachment 531146 [details]
Comparison 6.1GA vs SNAP4

Comment 3 RHEL Program Management 2011-11-07 06:47:35 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Dave Chinner 2011-11-11 02:31:13 UTC
(In reply to comment #0)
> Description of problem:
> I see performance drop on WRITE on xfs (6-9%). As baseline I have RHEL6.1GA
> first saw regression was in RHEL6.2 SNAP2.
> 
> I run this test as well on different boxes but I see performance drop only on
> this HW. Drop is visible on incache and direct io tests.

Given the known problems of using these IOZone scripts for performance regression testing, your experience is typical - it's not a benchmark that can be used for tracking down performance regressions because the results are so platform specific.

I've already wasted a week recently trying (unsuccessfully) to reproduce a similar report on a completely different set of hardware. The closest anyone got to working out what the problem might be related to was that using the noatime mount option made some of the regression go away on the specific machine that was seeing it. Perhaps you might try that and see if has teh same effect on your test box.

if you can reproduce a similar regression just using dd, then it might be much easier to understand than trying to use iozone to reproduce the problem.

Comment 5 Barry Marson 2011-11-23 15:47:53 UTC
We have systems which have demonstrated great repeatability using iozone.  To carry the torch further, I wrote a small program which effectively runs the first four columns of iozone and reports in the same fashion.  To handle timer granularity (which is only an issue on the very smallest files), this program has been enhanced to work on many files serially at once.  Never the less, this BZ problem happens on larger files where timer granularity is not an issue.

We see virtually the same regression Kamil sees with HBA fiber channel attached storage.  For us its not specific to smaller files, it hits all files that fit into memory (in our case up to 1GB file size) before synchronous writes (dirty_ratio) kick in.  In our case it is more prevalent with smaller record sizes (up to 32KB) which doesn't surprise me given the code enhancements 6.2


The real question becomes, is this regression acceptable given the benefits it brings to xfs?  ie how does xfs compare to ext4 doing the same workload/config. 

On our testbed, even with the regression, the overall performance of xfs for all write related operations notably exceeds ext4 (even the 6.1 results).

Barry

Comment 6 Dave Chinner 2011-11-24 01:52:57 UTC
(In reply to comment #5)
> We have systems which have demonstrated great repeatability using iozone.  To
> carry the torch further, I wrote a small program which effectively runs the
> first four columns of iozone and reports in the same fashion.

Can you share that test and the results you've seen, Barry? If you are getting reliable results, can you also run the same tests with noatime to see if that makes the regression go away?
 
> The real question becomes, is this regression acceptable given the benefits it
> brings to xfs?  ie how does xfs compare to ext4 doing the same workload/config. 
> 
> On our testbed, even with the regression, the overall performance of xfs for
> all write related operations notably exceeds ext4 (even the 6.1 results).

I don't think in-cache write speeds are all that important as most people care about how fast the data gets to disk, not to memory. So my question really is how much relevance does this benchmark have to real application performance? i.e. have any other application benchmarks shown regressions that can be traced to this same problem?

Cheers,

Dave.

Comment 7 Ric Wheeler 2011-11-28 17:25:57 UTC
I would argue that a lot of high performance apps spend time trying to write to disk only when they need to - i.e., they spend a lot of time working in DRAM and write back when the commit state that needs to persist.

That means that both in core and write to disk performance are critical :)

Comment 8 Barry Marson 2011-11-28 17:57:27 UTC
Created attachment 537579 [details]
IOzone xfs regression comparing 6.1 vs 6.2RC1

IOzone xfs regression on perf4 (intel based) comparing 6.1 vs 6.2RC1.txt

This is the reference iozone run ttracy ran on our system that regularly
produces rock solid results.

While Im not concerned so much with the smaller file performance, there are
regressions which exist with larger ones as well.  Again, its all small record
size I/O's ( < 16KB ).

Additionally I concur with rwheeler.  I have an key ISV (SAS) who makes
significant use of the page cache for performance, and not only relies on great
streaming I/O performance from xfs, but also requires great handling of huge
temp files (many GB's).

Comment 9 Barry Marson 2011-11-28 18:05:45 UTC
Created attachment 537580 [details]
multifile simzone runs tarball with src and test runs from perf4.

This attachement has a version of my simzone test (which mimics iozone's first 4 columns).  It was originally written to provide sample code with far less code than iozone.

The modifications in this version allow it to work on several files at once to reduce, if not eliminate timer granularity on small files.  To keep the code short, it doesn't take args (yet), and is configured to work on 128 files at a time with file sizes up to 16MB (total of 2GB of files).

Essentially it opens all files and then works on each serially, then fsync's, closes all files.  This example does not measure fsync or close operations by design (but it easily could).

These runs ran on a filesystem with noatime flag enabled.  Note barriers just happened to be enabled for both.

Barry

Comment 10 Dave Chinner 2011-11-30 02:05:10 UTC
So the numbers from the simzone tests make for interesting 3d graphs. What is clearly apparent is that for record sizes of 16k and less, the -131 kernel is faster than the -220 kernel for all file sizes and all operations.

For write, -220 is faster when teh record size goes over 16k and file size over 256-512k.

For rewrite, -220 is faster for all sizes at 16MB file size, and faster for
record sizes above 64k and filesizes of above 1MB.

For read, -220 is faster for record sizes above 64k and filesizes of 256-512k, and re-read is very similar.

These observations match with the IOZone results which show all of the regressions at record sizes of less than 64k, and all but 3 at less than or equal to 16k.

So, it appears at first glance that there is a regression here. However, digging deeper it's not clear that it's a regression we should ever care about. Basically, all the surface maps have roughly the same shape, with -220 being faster in the same corner of the test space on all operations. So, the behaviour observed between the -131 and -220 kernels is not specific to the write path.

Further, the shape of the surfaces indicate that CPU cache residency of the userspace IO buffers has -significant- impact on performance. On the read tests, read rates for 4K files with 4k record size reads are topping 4.8GB/s,
whilst the cache read rate for 16M files with 1M record sizes is only half that.
Given that there are 2 orders of magnitude more syscalls being made for the 8k/8k test, than means the memcpy time must be at least 2 orders of magnitude faster than for the 16M/1M test. That implies that at least one of the memcpy buffers is L1 data cache resident in the 4k/4k test to be that much faster than memory speed.

Indeed, looking at the simzone source code, the userspace buffer is a static 16MB buffer that is reused on every syscall without being touched in between calls. That means that the same 4k of memory is being used by the test as the destination for every read. It will definitely be L2 cache resident, and quite possibly L1 cache resident. Hence all the tests of file/record sizes from 1k/1k through to 256/64k show faster than memory speed thoughput - they all have a buffer that is resident in L1/L2 CPU caches.

It's not until the combination of record/file size gets large enough that stuff like page faults on the user buffer during memcpy occur and it doesn't fit in L2 cache that speeds return to something resembling memory speeds.

This behaviour alone indicates the likely cause of the "regression" - the change in code has changed the cache footprint, and so now each syscall is taking extra cache misses and is running a little slower. Some maths:

4k/4k reread speed on -131 = 4830188k/s
reread syscall rate on -131 = 4830188 / 4 = 1207547/s
time per syscall on -131 = 830nS

From anandtech, x5670 L3 cache latency is measured at 53 clocks - roughly 20nS for the CPU on the test machine these numbers were taken from. So if we assume we have an extra l1/l2 cache miss that hits L3 cache, then then we have:

rough approximation of CPU L3 cache hit stall time = 20ns

time per syscall w/ extra miss = 850nS
reread syscall rate w/ extra miss ~= 1176470/s
4k/4k reread speed w/ extra miss = 4705882k/s

4k/4k reread speed w/ 2 extra misses = 4597701k/s

4k/4k reread speed on -220 = 4612612k/s

So, realistically, what the difference between -131 and -220 on these tests is somewhere between 1 and 2 extra L1/L2 CPU cache misses, most likely due to a changed cache footprint. Hell, it might not even be a cache miss - a branch misprediction has roughly the same cost - but either way, it's something to do with the CPU stalling while running flat out because it is operating on the caches closest to the CPU.

The only reason that the test has data that sits resident in the L1/L2 cpu caches is that the syscall loop uses the same buffer for every syscall and does not touch it or do any other work between IO syscalls. Real applications do work between syscalls or read/write to different buffer locations when doing multiple sequential read/write syscalls. Either way, the user buffer is -very unlikely- to be resident in the L1/L2 cpu caches in any meaningful, performance impacting manner.

And regardless of the IO buffer size, if you are writing GBs of data, that has to be pulled from memory -somewhere- because applications don't write GBs of identical data manufactured from thin air. That means we will -always- take cache misses when doing IO on the scale of SAS and other enterprise workloads.
Hence this "hot in the cpu cache IO" really isn't a factor - the performance will reflect the larger file/record size results where -220 is actually faster than -131....

IOWs, this "regression" does not appear to be an "in page cache" performance problem, it appears to be a "in CPU cache" performance regression. As a result, I think this sort of regression is way below our care factor for filesystem performance because, in the real world, "in-CPU cache" IO simply never happens.

Comment 11 Dave Chinner 2011-11-30 02:08:53 UTC
Created attachment 538362 [details]
gnuplot script for viewing simzone output

Change the filenames in the script to the two result files you want to compare,
make sure the header line of the result files start with a "#", and run this script like:

$ gnuplot surface.plot

Pressing <enter> will move cycle through the different operations in teh order initial write, rewrite, read, reread. You can rotate the 3d plot to look at it from different angles with the mouse.....

Comment 13 RHEL Program Management 2012-05-03 05:38:48 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.