Description of problem: This is a regression issue and is related to Issue #187283 (BZ#456181, 570814). When two or more O_DIRECT writes simultaneously run on different partitions of a single disk, each performance slows down. The result of multiple dd commands is as follows: | number of dds | | 2 | 4 | -------------+-------+-------+ RHEL5.5 beta | 2.88 | 2.73 | RHEL5.4 | 3.54 | 3.64 | write speed (MB/s) We investigated this problem and found the patch causing this problem. linux-2.6-block-cfq-iosched-get-rid-of-cfqq-hash.patch When we removed this patch from 5.5 beta, this write problem disappeared and the dump problem of Issue #187283 still kept being fixed. We request to fix this problem on RHEL5.5GA. Version-Release number of selected component: Red Hat Enterprise Linux Version Number: 5 Release Number: 5 beta Architecture: x86_64 Kernel Version: 2.6.18-186.el5 Related Package Version: Related Middleware / Application: None Drivers or hardware or architecture dependency: None How reproducible: Always Step to Reproduce: Run the following 2 or 4 dd commands on different partitions of a single disk. # dd oflag=direct if=/dev/zero of=<device file> bs=4096 count=100000 <e.g.> # dd oflag=direct if=/dev/zero of=/dev/sdb1 bs=4096 count=100000 # dd oflag=direct if=/dev/zero of=/dev/sdb2 bs=4096 count=100000 ... NOTE: This test destroys the data on the target disk. Therefore, please use a blank disk, when you test it. Actual Results: Write performance is 25% slower than 5.4. Expected Results: Same performance as 5.4, at least. Summary of actions taken to resolve issue: None. Location of diagnostic data: None. Hardware configuration: Model: PRIMERGY RX300 S5 CPU Info: Xeon(R) 2.27GHz x2 Memory Info: 4GB Hardware Component Information: None Configuration Info: None Guest Configuration Info: None Business Impact: Common database systems usually perform multiple O_DIRECT writes. So if this problem is not fixed on RHEL5.5, it will take direct negative effect to such customers' business operations, for example, they fail to finish their transactions in time, resulting in various damages to them. Additional Info: This problem is related to Issue #187283. So we tested a kernel that applied the patch which fixed BZ#570814 to RHEL5.5 beta, but this write problem was not fixed.
I've reproduced this problem using one dd process per partition on the same disk. Also compared to latest kernel. | number of dds | | 1 | 2 | 4 | ---------------------------+-----------+-----------+-----------+ 2.6.18-186 | 23.4 MB/s | 8.0 MB/s | 4.7 MB/s | 2.6.18-186 (w/o cfq patch) | 23.2 MB/s | 9.9 MB/s | 5.3 MB/s | 2.6.18-193 | 22.3 MB/s | 7.9 MB/s | 4.7 MB/s | From this table we can see that by removing the cfq patch from -186 we gained a 25% boost in throughput for the 2 dd case and to a lesser degree for the 4 dd case.
*** This bug has been marked as a duplicate of bug 570814 ***
Sorry, I missed the comment stating you tried the fix for 570814 and it didn't resolve this problem.
There is one additional CFQ fix that went into -194.el5 that may address this issue. Could you retest with that kernel? I'll try to reproduce the problem, myself, as well.
(In reply to comment #2) > I've reproduced this problem using one dd process per partition on the same > disk. Also compared to latest kernel. > > | number of dds | > | 1 | 2 | 4 | > ---------------------------+-----------+-----------+-----------+ > 2.6.18-186 | 23.4 MB/s | 8.0 MB/s | 4.7 MB/s | > 2.6.18-186 (w/o cfq patch) | 23.2 MB/s | 9.9 MB/s | 5.3 MB/s | > 2.6.18-193 | 22.3 MB/s | 7.9 MB/s | 4.7 MB/s | > > From this table we can see that by removing the cfq patch from -186 we gained a > 25% boost in throughput for the 2 dd case and to a lesser degree for the 4 dd > case. I'm not sure how many runs you did, or what hardware you're using. If we look at a single rotational disk, then I can definitely reproduce the problem, though an average of 10 runs shows the difference to be 18%, not 25. I'll investigate this further, but I'd like to point out that databases often do greater than 4k I/O, and that they often use an asynchronous I/O engine, which will have much different performance characteristics than two or four dd processes. Further, the storage systems used in databases are typically much faster than the storage used for testing in this bug report. In short, I'm not convinced that this test translates directly to the workload outlined as being ripe for regression. In fact, we perform targeted testing for database workloads and saw no such regression. Having said that, I will work to fix this regression. A fix will not be included in RHEL 5.5 GA, however, as that is just about out the door.
I should have made it clear that I tested the -194 kernel and still saw the regression.
Let's target 5.6 and consider it for Z stream once we figure out the issue.
Created attachment 400859 [details] Set the sync/async flag properly for O_DIRECT write requests I haven't even compiled this fix yet, but I'm pretty sure it will address the problem. I'll test it out, but am attaching it early in case others have a chance to verify it before I do.
OK, that patch does not work as the bio may be null. I'll post another patch when I have a tested fix.
Created attachment 400934 [details] honor WRITE_SYNC in cfq Before the referenced patch, WRITE_SYNC I/O was treated by CFQ as async I/O, and was put into the async cfqq. The old code had only a single async cfqq. The new code has two cfqq's per io context: 1 for sync I/O and one for async I/O. So, in the test case above, instead of a single queue competing for the device, now we have one per dd process. Further, if this had been AIO batching sequential O_DIRECT writes, there would be no front merging going on due to the front merge code actually treating sync bios properly, so it would look in the wrong place to find merge candidates. The dd case should not be affected by this as dd only issues one request at a time. Now, I changed the logic to treat WRITE_SYNC's properly, and I still observed the slow down. The issue is just that we are now idling on each cfqq in turn, causing us to issue I/O to the device more slowly. To prove this, I switched to using aio-stress as a test tool. It has the ability to run single threaded on multiple files, or one thread per file. I ran both configurations on the kernel with my latest patch, and then on the 164 kernel that ships with RHEL 5.4. The results show pretty much what I would expect. For the 164 kernel, there should be no difference between the single thread and two thread cases as they both will be serving requests from a single cfqq. This is what we find: number of runs: 5 1 thread 2 threads avg: 12.062 12.138 std: 0.56791 0.9351 var: 0.32252 0.87442 Now, when we run the kernel with my latest patch, I would expect to see the same performance degradation we saw before for the 2 thread case. As a reminder, that was roughly 18%. What we see is a 19% regression, so that's about right. Then, when comparing a single thread versus two threads, I would expect a single thread to perform better. In fact, this is what we see. What's more, it outperforms the 164 kernel: 1 thread* 2 threads avg: 12.8575 9.864 std: 0.55265 0.29846 var: 0.30543 0.08908 * for the single threaded run, I only took 4 data samples If you have a workload that does the equivalent of the two threaded case above, then it would be best to disable slice idling. By doing so, cfq will switch queues more quickly and get your performance back up to what it was in the previous kernel. (I verified this empirically.) Now, I would not expect anyone to actually have to make such an adjustment, as most databases drive queue depths deeper than 1 when doing sequential I/O. If you can point out a real-world workload that is negatively affected by the changes introduced in this kernel, then I would be more than happy to address them. However, for now, it appears that there is a net gain for applications that are written in using best practices. In summary, I will push this latest patch into an async erratum for RHEL 5.5 as a bug fix. If you can provide a real-world application that is negatively affected by the changes present in the patch 5.5 kernel, then I will be happy to revisit this problem. For now, I am unconvinced that it will be an issue in the field. Please understand that I am open to hearing dissenting opinions on this. Finally, thanks for the bug report.
Looking back through the upstream changelog, it seems I should have backported this commit: commit 7749a8d423c483a51983b666613acda1a4dd9c1b Author: Jens Axboe <jens.axboe> Date: Wed Dec 13 13:02:26 2006 +0100 [PATCH] Propagate down request sync flag We need to do this, otherwise the io schedulers don't get access to the sync flag. Then they cannot tell the difference between a regular write and an O_DIRECT write, which can cause a performance loss. Signed-off-by: Jens Axboe <jens.axboe>
Created attachment 402374 [details] block: introduce the rq_is_sync macro
Created attachment 402375 [details] block: Propagate down request sync flag
in kernel-2.6.18-198.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When two or more writes which used the O_DIRECT file access interface were run simultaneously on two separate partitions on the same disk, performance of both writes were reduced. This could have caused a write slowdown of approximately 25% when running two simultaneous "dd oflag=direct" commands on two different partitions. This regression has been fixed in this update so that O_DIRECT write performance does not incur a performance penalty.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html