574285 – 25% performance regression of concurrent O_DIRECT writes.

Bug 574285 - 25% performance regression of concurrent O_DIRECT writes.

Summary: 25% performance regression of concurrent O_DIRECT writes.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Moyer
QA Contact:	Igor Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	588219
TreeView+	depends on / blocked

Reported:	2010-03-17 03:17 UTC by Lachlan McIlroy
Modified:	2018-10-27 14:19 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When two or more writes which used the O_DIRECT file access interface were run simultaneously on two separate partitions on the same disk, performance of both writes were reduced. This could have caused a write slowdown of approximately 25% when running two simultaneous "dd oflag=direct" commands on two different partitions. This regression has been fixed in this update so that O_DIRECT write performance does not incur a performance penalty.
Clone Of:
Environment:
Last Closed:	2011-01-13 21:18:22 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Set the sync/async flag properly for O_DIRECT write requests (591 bytes, patch) 2010-03-17 19:01 UTC, Jeff Moyer	no flags	Details \| Diff
honor WRITE_SYNC in cfq (522 bytes, patch) 2010-03-18 02:09 UTC, Jeff Moyer	no flags	Details \| Diff
block: introduce the rq_is_sync macro (1.21 KB, patch) 2010-03-24 17:34 UTC, Jeff Moyer	no flags	Details \| Diff
block: Propagate down request sync flag (7.21 KB, patch) 2010-03-24 17:34 UTC, Jeff Moyer	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Lachlan McIlroy 2010-03-17 03:17:47 UTC

Description of problem:
  This is a regression issue and is related to Issue #187283 (BZ#456181, 570814).

  When two or more O_DIRECT writes simultaneously run on different partitions
 of a single disk, each performance slows down.

 The result of multiple dd commands is as follows:

              | number of dds |
              |   2   |   4   |
 -------------+-------+-------+
 RHEL5.5 beta |  2.88 |  2.73 |
 RHEL5.4      |  3.54 |  3.64 |
             write speed (MB/s)

 We investigated this problem and found the patch causing this problem.

 linux-2.6-block-cfq-iosched-get-rid-of-cfqq-hash.patch

 When we removed this patch from 5.5 beta, this write problem disappeared
 and the dump problem of Issue #187283 still kept being fixed.

 We request to fix this problem on RHEL5.5GA.


Version-Release number of selected component:

 Red Hat Enterprise Linux Version Number: 5
 Release Number: 5 beta
 Architecture: x86_64
 Kernel Version: 2.6.18-186.el5
 Related Package Version:
 Related Middleware / Application: None

Drivers or hardware or architecture dependency:
 None

How reproducible:
 Always

Step to Reproduce:
 Run the following 2 or 4 dd commands on different partitions of a
 single disk.

 # dd oflag=direct if=/dev/zero of=<device file> bs=4096 count=100000

 <e.g.>
 # dd oflag=direct if=/dev/zero of=/dev/sdb1 bs=4096 count=100000
 # dd oflag=direct if=/dev/zero of=/dev/sdb2 bs=4096 count=100000
 ...

 NOTE:
 This test destroys the data on the target disk. Therefore, please use a
 blank disk, when you test it.

Actual Results:
 Write performance is 25% slower than 5.4.

Expected Results:
 Same performance as 5.4, at least.

Summary of actions taken to resolve issue:
 None.

Location of diagnostic data:
 None.

Hardware configuration:
 Model: PRIMERGY RX300 S5
 CPU Info: Xeon(R) 2.27GHz x2
 Memory Info: 4GB
 Hardware Component Information: None
 Configuration Info: None
 Guest Configuration Info: None

Business Impact:
 Common database systems usually perform multiple O_DIRECT writes. So if
 this problem is not fixed on RHEL5.5, it will take direct negative effect
 to such customers' business operations, for example, they fail to finish
 their transactions in time, resulting in various damages to them.

Additional Info:
 This problem is related to Issue #187283. So we tested a kernel that applied
 the patch which fixed BZ#570814 to RHEL5.5 beta, but this write problem was
 not fixed.

Comment 2 Lachlan McIlroy 2010-03-17 03:31:51 UTC

I've reproduced this problem using one dd process per partition on the same disk.  Also compared to latest kernel.

                            |           number of dds           |
                            |     1     |     2     |     4     |
 ---------------------------+-----------+-----------+-----------+
 2.6.18-186                 | 23.4 MB/s |  8.0 MB/s |  4.7 MB/s |
 2.6.18-186 (w/o cfq patch) | 23.2 MB/s |  9.9 MB/s |  5.3 MB/s |
 2.6.18-193                 | 22.3 MB/s |  7.9 MB/s |  4.7 MB/s |

From this table we can see that by removing the cfq patch from -186 we gained a 25% boost in throughput for the 2 dd case and to a lesser degree for the 4 dd case.

Comment 3 Jeff Moyer 2010-03-17 14:05:09 UTC


*** This bug has been marked as a duplicate of bug 570814 ***

Comment 4 Jeff Moyer 2010-03-17 14:09:20 UTC

Sorry, I missed the comment stating you tried the fix for 570814 and it didn't resolve this problem.

Comment 5 Jeff Moyer 2010-03-17 14:11:26 UTC

There is one additional CFQ fix that went into -194.el5 that may address this issue.  Could you retest with that kernel?  I'll try to reproduce the problem, myself, as well.

Comment 6 Jeff Moyer 2010-03-17 15:04:35 UTC

(In reply to comment #2)
> I've reproduced this problem using one dd process per partition on the same
> disk.  Also compared to latest kernel.
> 
>                             |           number of dds           |
>                             |     1     |     2     |     4     |
>  ---------------------------+-----------+-----------+-----------+
>  2.6.18-186                 | 23.4 MB/s |  8.0 MB/s |  4.7 MB/s |
>  2.6.18-186 (w/o cfq patch) | 23.2 MB/s |  9.9 MB/s |  5.3 MB/s |
>  2.6.18-193                 | 22.3 MB/s |  7.9 MB/s |  4.7 MB/s |
> 
> From this table we can see that by removing the cfq patch from -186 we gained a
> 25% boost in throughput for the 2 dd case and to a lesser degree for the 4 dd
> case.    

I'm not sure how many runs you did, or what hardware you're using.  If we look at a single rotational disk, then I can definitely reproduce the problem, though an average of 10 runs shows the difference to be 18%, not 25.  I'll investigate this further, but I'd like to point out that databases often do greater than 4k I/O, and that they often use an asynchronous I/O engine, which will have much different performance characteristics than two or four dd processes.  Further, the storage systems used in databases are typically much faster than the storage used for testing in this bug report.  In short, I'm not convinced that this test translates directly to the workload outlined as being ripe for regression.  In fact, we perform targeted testing for database workloads and saw no such regression.

Having said that, I will work to fix this regression.  A fix will not be included in RHEL 5.5 GA, however, as that is just about out the door.

Comment 7 Jeff Moyer 2010-03-17 15:05:23 UTC

I should have made it clear that I tested the -194 kernel and still saw the regression.

Comment 8 Ric Wheeler 2010-03-17 17:12:11 UTC

Let's target 5.6 and consider it for Z stream once we figure out the issue.

Comment 9 Jeff Moyer 2010-03-17 19:01:57 UTC

Created attachment 400859 [details]
Set the sync/async flag properly for O_DIRECT write requests

I haven't even compiled this fix yet, but I'm pretty sure it will address the problem.  I'll test it out, but am attaching it early in case others have a chance to verify it before I do.

Comment 10 Jeff Moyer 2010-03-17 19:46:54 UTC

OK, that patch does not work as the bio may be null.  I'll post another patch when I have a tested fix.

Comment 11 Jeff Moyer 2010-03-18 02:09:19 UTC

Created attachment 400934 [details]
honor WRITE_SYNC in cfq

Before the referenced patch, WRITE_SYNC I/O was treated by CFQ as async I/O, and was put into the async cfqq.  The old code had only a single async cfqq.  The new code has two cfqq's per io context: 1 for sync I/O and one for async I/O.  So, in the test case above, instead of a single queue competing for the device, now we have one per dd process.  Further, if this had been AIO batching sequential O_DIRECT writes, there would be no front merging going on due to the front merge code actually treating sync bios properly, so it would look in the wrong place to find merge candidates.  The dd case should not be affected by this as dd only issues one request at a time.

Now, I changed the logic to treat WRITE_SYNC's properly, and I still observed the slow down.  The issue is just that we are now idling on each cfqq in turn, causing us to issue I/O to the device more slowly. To prove this, I switched to using aio-stress as a test tool.  It has the ability to run single threaded on multiple files, or one thread per file.  I ran both configurations on the kernel with my latest patch, and then on the 164 kernel that ships with RHEL 5.4.

The results show pretty much what I would expect.  For the 164 kernel, there should be no difference between the single thread and two thread cases as they both will be serving requests from a single cfqq.  This is what we find:

number of runs: 5

     1 thread  2 threads
avg: 12.062    12.138
std: 0.56791   0.9351
var: 0.32252   0.87442

Now, when we run the kernel with my latest patch, I would expect to see the same performance degradation we saw before for the 2 thread case.  As a reminder, that was roughly 18%.  What we see is a 19% regression, so that's about right.  Then, when comparing a single thread versus two threads, I would expect a single thread to perform better.  In fact, this is what we see.  What's more, it outperforms the 164 kernel:

     1 thread* 2 threads
avg: 12.8575   9.864
std: 0.55265   0.29846
var: 0.30543   0.08908

* for the single threaded run, I only took 4 data samples

If you have a workload that does the equivalent of the two threaded case above, then it would be best to disable slice idling.  By doing so, cfq will switch queues more quickly and get your performance back up to what it was in the previous kernel.  (I verified this empirically.) Now, I would not expect anyone to actually have to make such an adjustment, as most databases drive queue depths deeper than 1 when doing sequential I/O.  If you can point out a real-world workload that is negatively affected by the changes introduced in this kernel, then I would be more than happy to address them.  However, for now, it appears that there is a net gain for applications that are written in using best practices.

In summary, I will push this latest patch into an async erratum for RHEL 5.5 as a bug fix.  If you can provide a real-world application that is negatively affected by the changes present in the patch 5.5 kernel, then I will be happy to revisit this problem.  For now, I am unconvinced that it will be an issue in the field.  Please understand that I am open to hearing dissenting opinions on this.

Finally, thanks for the bug report.

Comment 12 Jeff Moyer 2010-03-18 13:48:55 UTC

Looking back through the upstream changelog, it seems I should have backported this commit:

commit 7749a8d423c483a51983b666613acda1a4dd9c1b
Author: Jens Axboe <jens.axboe>
Date:   Wed Dec 13 13:02:26 2006 +0100

    [PATCH] Propagate down request sync flag
    
    We need to do this, otherwise the io schedulers don't get access to the
    sync flag. Then they cannot tell the difference between a regular write
    and an O_DIRECT write, which can cause a performance loss.
    
    Signed-off-by: Jens Axboe <jens.axboe>

Comment 14 Jeff Moyer 2010-03-24 17:34:02 UTC

Created attachment 402374 [details]
block: introduce the rq_is_sync macro

Comment 15 Jeff Moyer 2010-03-24 17:34:44 UTC

Created attachment 402375 [details]
block: Propagate down request sync flag

Comment 25 Jarod Wilson 2010-05-03 16:53:51 UTC

in kernel-2.6.18-198.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 31 Douglas Silas 2010-06-28 20:15:40 UTC

Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
When two or more writes which used the O_DIRECT file access interface were run simultaneously on two separate partitions on the same disk, performance of both writes were reduced. This could have caused a write slowdown of approximately 25% when running two simultaneous "dd oflag=direct" commands on two different partitions. This regression has been fixed in this update so that O_DIRECT write performance does not incur a performance penalty.

Comment 35 errata-xmlrpc 2011-01-13 21:18:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.