Description of problem: RHEL 5.3 __blockdev_direct_IO() allocate dio struct using kzalloc( insteads of RHEL5.2 using kmalloc) which cause 0.5% OLTP performance regression Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Measure RHEL5.3 OLTP performance 2. Apply attached patch to revert __blockdev_direct_IO() to use kmalloc for dio struct 3. Measure patched kernel OLTP peformance delta Actual results: Expected results: Additional info: RHEL5.2 use kmalloc to allocate dio and initialises selective dio fields in a downstream __directio_worker() fuction. RHEL 5.3 kzalloc use memset to clear the whole dio struct which create longer code path and more cache eviction.
This was introduced by the following commit. I'll discuss this with Zach. commit 848c4dd5153c7a0de55470ce99a8e13a63b4703f Author: Zach Brown <zach.brown> Date: Mon Aug 20 17:12:01 2007 -0700 dio: zero struct dio with kzalloc instead of manually This patch uses kzalloc to zero all of struct dio rather than manually trying to track which fields we rely on being zero. It passed aio+dio stress testing and some bug regression testing on ext3. This patch was introduced by Linus in the conversation that lead up to Badari's minimal fix to manually zero .map_bh.b_state in commit: 6a648fa72161d1f6468dabd96c5d3c0db04f598a It makes the code a bit smaller. Maybe a couple fewer cachelines to load, if we're lucky: text data bss dec hex filename 3285925 568506 1304616 5159047 4eb887 vmlinux 3285797 568506 1304616 5158919 4eb807 vmlinux.patched I was unable to measure a stable difference in the number of cpu cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads at ~340MB/s. So the resulting intent of the patch isn't a performance gain but to avoid exposing ourselves to the risk of finding another field like .map_bh.b_state where we rely on zeroing but don't enforce it in the code. Signed-off-by: Zach Brown <zach.brown> Signed-off-by: Linus Torvalds <torvalds>
"0.5%" doesn't tell us how stable the measurement is. Is that over a really long run? Is it averaged from a lot of short runs? Standard deviation? If it is indeed a consistent regression then we should fix it. I imagine the regression would come from zeroing the page pointer array -- 64 pointers. I could see that amount of additional writes overwhelming the advantage of the code size savings that the patch introduced. I would rather that we not just revert the patch, though. We could take the opportunity to initialize the fields in memory order and perhaps introduce a helper which initializes a buffer_head. And we'd need a giant comment saying that people are actually spending effort watching the performance impact of this tiny corner of the kernel.
Could Intel please comment on Zach's observations, please? If we do come up with a patch, will you be able to test it for us?
The scaled OLTP database takes some time to prepare for each run. We are doing about 140k reads/100k writes per second. I opt for a longer run (1 hour of measurement period in this case) insteads of many short run. We measured again with EL5.4 (-155.el5) with kzalloc(dio) and zeroing dio fields. Zeroing limited dio fields gave about +0.47% performance in EL5.4. Hope this address the consistent regression concern. Direct_io performance has big impact for OLTP workload as we are going to higher and higher iops with more powerful server. We will be watching closely with each OS release. We are happy to test any patch related to this issue. thanks.
Thanks for your response! I'll see about putting a patch together for this for upstream and RHEL 5.5.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
(In reply to comment #5) > I'll see about putting a patch together for this for upstream Thanks Jeff. Let me know if I can help. We might be able to make a small change that makes everyone happy. We could move the pages[] array, and its associated few fields, to the end of the struct. We can allocate with kmalloc() and clear with an explicit memset() which uses offsetof(struct dio, pages) to clear everything but the array of page pointers. It'd have the added benefit of moving the cachelines with the last few page pointers -- which are almost never referenced -- to the end of the struct instead of having them in the middle between fields that are always referenced :).
Yeah, that will work for upstream. For RHEL I think we can get away with just zeroing out the buffer head. I'll whip up a couple of patches.
@Intel, @Oracle, We need to confirm that there is commitment to test for the resolution of this request during the RHEL 5.5 Test Phase, if it is accepted into the release. Please post a confirmation before Oct 23rd, 2009, including the contact information for testing engineers.
Chinang Ma (chinang.ma) will be the Intel engineer testing OLTP workload. Please let him know where to download the RHEL 5.5 test kernel once it is available.
Created attachment 366357 [details] Don't zero out the page array in the struct dio. Is a patch sufficient for testing, or do you require a built kernel rpm? If the latter, please let me know which architecture you'd like to test. Cheers!
A patch for EL5.4 will be good for performance evaluation. We can measure the performance delta after this patch is applied. We are running x86_64 kernel.
OK, the attached patch (in comment #12) will apply cleanly to a RHEL 5.4 kernel. Thanks a ton for helping to test this!
> Created an attachment (id=366357) > Don't zero out the page array in the struct dio. For what it's worth, that patch looks good. If the testing shows that it does make a difference for performance, and you end up sending it on to mainline, feel free to add my 'Acked-By: Zach Brown <zach.brown>'.
Patch was verified using OLTP workload and RHEL 5.4 (2.6.18-164.el5). This patch give ~0.6% performance improvement over RHEL 5.4 baseline result.
Wow, that was a fast turn-around. Thanks! So, how does the performance of this patch compare to the 5.2 baseline you had reported previously? In other words, is this still a regression, and are you happy with the current performance?
Yes. kzalloc caused ~0.5% performance regression in RHEL 5.2 and 5.3. This patch gives +0.68%. I am very happy that it recovers all the regression related to kzalloc. Thanks.
in kernel-2.6.18-172.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html