Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
The RHEL 6 kernel still suffers from this problem:
+++ This bug was initially created as a clone of Bug #488161 +++
Description of problem:
RHEL 5.3 __blockdev_direct_IO() allocate dio struct using kzalloc( insteads of RHEL5.2 using kmalloc) which cause 0.5% OLTP performance regression
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Measure RHEL5.3 OLTP performance
2. Apply attached patch to revert __blockdev_direct_IO() to use kmalloc for dio struct
3. Measure patched kernel OLTP peformance delta
Actual results:
Expected results:
Additional info:
RHEL5.2 use kmalloc to allocate dio and initialises selective dio fields in a downstream __directio_worker() fuction. RHEL 5.3 kzalloc use memset to clear the whole dio struct which create longer code path and more cache eviction.
--- Additional comment from jmoyer on 2009-03-04 12:38:26 EST ---
This was introduced by the following commit. I'll discuss this with Zach.
commit 848c4dd5153c7a0de55470ce99a8e13a63b4703f
Author: Zach Brown <zach.brown>
Date: Mon Aug 20 17:12:01 2007 -0700
dio: zero struct dio with kzalloc instead of manually
This patch uses kzalloc to zero all of struct dio rather than manually
trying to track which fields we rely on being zero. It passed aio+dio
stress testing and some bug regression testing on ext3.
This patch was introduced by Linus in the conversation that lead up to
Badari's minimal fix to manually zero .map_bh.b_state in commit:
6a648fa72161d1f6468dabd96c5d3c0db04f598a
It makes the code a bit smaller. Maybe a couple fewer cachelines to
load, if we're lucky:
text data bss dec hex filename
3285925 568506 1304616 5159047 4eb887 vmlinux
3285797 568506 1304616 5158919 4eb807 vmlinux.patched
I was unable to measure a stable difference in the number of cpu cycles
spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
~340MB/s.
So the resulting intent of the patch isn't a performance gain but to
avoid exposing ourselves to the risk of finding another field like
.map_bh.b_state where we rely on zeroing but don't enforce it in the
code.
Signed-off-by: Zach Brown <zach.brown>
Signed-off-by: Linus Torvalds <torvalds>
--- Additional comment from zach.brown on 2009-03-04 13:42:09 EST ---
"0.5%" doesn't tell us how stable the measurement is. Is that over a really long run? Is it averaged from a lot of short runs? Standard deviation?
If it is indeed a consistent regression then we should fix it. I imagine the regression would come from zeroing the page pointer array -- 64 pointers. I could see that amount of additional writes overwhelming the advantage of the code size savings that the patch introduced.
I would rather that we not just revert the patch, though. We could take the opportunity to initialize the fields in memory order and perhaps introduce a helper which initializes a buffer_head.
And we'd need a giant comment saying that people are actually spending effort watching the performance impact of this tiny corner of the kernel.
--- Additional comment from jmoyer on 2009-09-30 13:20:10 EDT ---
Could Intel please comment on Zach's observations, please? If we do come up with a patch, will you be able to test it for us?
--- Additional comment from chinang.ma on 2009-10-01 12:58:38 EDT ---
The scaled OLTP database takes some time to prepare for each run. We are doing about 140k reads/100k writes per second. I opt for a longer run (1 hour of measurement period in this case) insteads of many short run. We measured again with EL5.4 (-155.el5) with kzalloc(dio) and zeroing dio fields. Zeroing limited dio fields gave about +0.47% performance in EL5.4. Hope this address the consistent regression concern.
Direct_io performance has big impact for OLTP workload as we are going to higher and higher iops with more powerful server. We will be watching closely with each OS release.
We are happy to test any patch related to this issue. thanks.
--- Additional comment from jmoyer on 2009-10-01 13:10:29 EDT ---
Thanks for your response! I'll see about putting a patch together for this for upstream and RHEL 5.5.
--- Additional comment from pm-rhel on 2009-10-01 14:05:01 EDT ---
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
release.
--- Additional comment from zach.brown on 2009-10-01 15:39:15 EDT ---
(In reply to comment #5)
> I'll see about putting a patch together for this for upstream
Thanks Jeff. Let me know if I can help.
We might be able to make a small change that makes everyone happy. We could move the pages[] array, and its associated few fields, to the end of the struct. We can allocate with kmalloc() and clear with an explicit memset() which uses offsetof(struct dio, pages) to clear everything but the array of page pointers.
It'd have the added benefit of moving the cachelines with the last few page pointers -- which are almost never referenced -- to the end of the struct instead of having them in the middle between fields that are always referenced :).
--- Additional comment from jmoyer on 2009-10-01 15:50:52 EDT ---
Yeah, that will work for upstream. For RHEL I think we can get away with just zeroing out the buffer head. I'll whip up a couple of patches.
--- Additional comment from emcnabb on 2009-10-16 16:20:48 EDT ---
Chris, can we request OtherQA from Intel on this bug, as was discussed in comment #3 / comment #4? That and/or pull in our internal performance folks to help test?
--- Additional comment from cward on 2009-10-19 07:08:32 EDT ---
@Intel, @Oracle,
We need to confirm that there is commitment to test
for the resolution of this request during the RHEL 5.5 Test
Phase, if it is accepted into the release.
Please post a confirmation before Oct 23rd, 2009,
including the contact information for testing engineers.
--- Additional comment from chinang.ma on 2009-10-19 12:39:40 EDT ---
Chinang Ma (chinang.ma) will be the Intel engineer testing OLTP workload. Please let him know where to download the RHEL 5.5 test kernel once it is available.
--- Additional comment from jmoyer on 2009-10-27 19:30:06 EDT ---
Created attachment 366357[details]
Don't zero out the page array in the struct dio.
Is a patch sufficient for testing, or do you require a built kernel rpm? If the latter, please let me know which architecture you'd like to test.
Cheers!
--- Additional comment from chinang.ma on 2009-10-27 19:35:57 EDT ---
A patch for EL5.4 will be good for performance evaluation. We can measure the performance delta after this patch is applied. We are running x86_64 kernel.
--- Additional comment from jmoyer on 2009-10-27 19:48:46 EDT ---
OK, the attached patch (in comment #12) will apply cleanly to a RHEL 5.4 kernel. Thanks a ton for helping to test this!
--- Additional comment from zach.brown on 2009-10-28 17:36:35 EDT ---
> Created an attachment (id=366357)
> Don't zero out the page array in the struct dio.
For what it's worth, that patch looks good. If the testing shows that it does make a difference for performance, and you end up sending it on to mainline, feel free to add my 'Acked-By: Zach Brown <zach.brown>'.
--- Additional comment from chinang.ma on 2009-10-29 22:19:13 EDT ---
Patch was verified using OLTP workload and RHEL 5.4 (2.6.18-164.el5). This patch give ~0.6% performance improvement over RHEL 5.4 baseline result.
Comment 1RHEL Program Management
2011-07-25 14:20:23 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update release.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHSA-2011-1530.html