725444 – (direct_io) __blockdev_direct_IO calls kzalloc for dio struct causes OLTP performance regression

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 725444 - (direct_io) __blockdev_direct_IO calls kzalloc for dio struct causes OLTP performance regression

Summary: (direct_io) __blockdev_direct_IO calls kzalloc for dio struct causes OLTP per...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.1
Hardware:	All
OS:	Other
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Moyer
QA Contact:	Eryu Guan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	739747 (view as bug list)
Depends On:	488161
Blocks:
TreeView+	depends on / blocked

Reported:	2011-07-25 14:06 UTC by Jeff Moyer
Modified:	2011-12-06 13:54 UTC (History)
CC List:	10 users (show)
Fixed In Version:	kernel-2.6.32-183.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:	488161
Environment:
Last Closed:	2011-12-06 13:54:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1530	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update	2011-12-06 01:45:35 UTC

Description Jeff Moyer 2011-07-25 14:06:12 UTC

The RHEL 6 kernel still suffers from this problem:

+++ This bug was initially created as a clone of Bug #488161 +++

Description of problem:
RHEL 5.3 __blockdev_direct_IO() allocate dio struct using kzalloc( insteads of RHEL5.2 using kmalloc) which cause 0.5% OLTP performance regression

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Measure RHEL5.3 OLTP performance
2. Apply attached patch to revert __blockdev_direct_IO() to use kmalloc for dio struct
3. Measure patched kernel OLTP peformance delta
  
Actual results:


Expected results:


Additional info:
RHEL5.2 use kmalloc to allocate dio and initialises selective dio fields in a downstream __directio_worker() fuction. RHEL 5.3 kzalloc use memset to clear the whole dio struct which create longer code path and more cache eviction.

--- Additional comment from jmoyer on 2009-03-04 12:38:26 EST ---

This was introduced by the following commit.  I'll discuss this with Zach.

commit 848c4dd5153c7a0de55470ce99a8e13a63b4703f
Author: Zach Brown <zach.brown>
Date:   Mon Aug 20 17:12:01 2007 -0700

    dio: zero struct dio with kzalloc instead of manually
    
    This patch uses kzalloc to zero all of struct dio rather than manually
    trying to track which fields we rely on being zero.  It passed aio+dio
    stress testing and some bug regression testing on ext3.
    
    This patch was introduced by Linus in the conversation that lead up to
    Badari's minimal fix to manually zero .map_bh.b_state in commit:
    
      6a648fa72161d1f6468dabd96c5d3c0db04f598a
    
    It makes the code a bit smaller.  Maybe a couple fewer cachelines to
    load, if we're lucky:
    
       text    data     bss     dec     hex filename
    3285925  568506 1304616 5159047  4eb887 vmlinux
    3285797  568506 1304616 5158919  4eb807 vmlinux.patched
    
    I was unable to measure a stable difference in the number of cpu cycles
    spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
    ~340MB/s.
    
    So the resulting intent of the patch isn't a performance gain but to
    avoid exposing ourselves to the risk of finding another field like
    .map_bh.b_state where we rely on zeroing but don't enforce it in the
    code.
    
    Signed-off-by: Zach Brown <zach.brown>
    Signed-off-by: Linus Torvalds <torvalds>

--- Additional comment from zach.brown on 2009-03-04 13:42:09 EST ---

"0.5%" doesn't tell us how stable the measurement is.  Is that over a really long run?  Is it averaged from a lot of short runs?  Standard deviation?

If it is indeed a consistent regression then we should fix it.  I imagine the regression would come from zeroing the page pointer array -- 64 pointers.  I could see that amount of additional writes overwhelming the advantage of the code size savings that the patch introduced.

I would rather that we not just revert the patch, though.  We could take the opportunity to initialize the fields in memory order and perhaps introduce a helper which initializes a buffer_head.

And we'd need a giant comment saying that people are actually spending effort watching the performance impact of this tiny corner of the kernel.

--- Additional comment from jmoyer on 2009-09-30 13:20:10 EDT ---

Could Intel please comment on Zach's observations, please?  If we do come up with a patch, will you be able to test it for us?

--- Additional comment from chinang.ma on 2009-10-01 12:58:38 EDT ---

The scaled OLTP database takes some time to prepare for each run. We are doing about 140k reads/100k writes per second. I opt for a longer run (1 hour of measurement period in this case) insteads of many short run. We measured again with EL5.4 (-155.el5) with kzalloc(dio) and zeroing dio fields. Zeroing limited dio fields gave about +0.47% performance in EL5.4. Hope this address the consistent regression concern.

Direct_io performance has big impact for OLTP workload as we are going to higher and higher iops with more powerful server. We will be watching closely with each OS release. 
We are happy to test any patch related to this issue. thanks.

--- Additional comment from jmoyer on 2009-10-01 13:10:29 EDT ---

Thanks for your response!  I'll see about putting a patch together for this for upstream and RHEL 5.5.

--- Additional comment from pm-rhel on 2009-10-01 14:05:01 EDT ---

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

--- Additional comment from zach.brown on 2009-10-01 15:39:15 EDT ---

(In reply to comment #5)
> I'll see about putting a patch together for this for upstream

Thanks Jeff.  Let me know if I can help.

We might be able to make a small change that makes everyone happy.  We could move the pages[] array, and its associated few fields, to the end of the struct.  We can allocate with kmalloc() and clear with an explicit memset() which uses offsetof(struct dio, pages) to clear everything but the array of page pointers.

It'd have the added benefit of moving the cachelines with the last few page pointers -- which are almost never referenced -- to the end of the struct instead of having them in the middle between fields that are always referenced :).

--- Additional comment from jmoyer on 2009-10-01 15:50:52 EDT ---

Yeah, that will work for upstream.  For RHEL I think we can get away with just zeroing out the buffer head.  I'll whip up a couple of patches.

--- Additional comment from emcnabb on 2009-10-16 16:20:48 EDT ---

Chris, can we request OtherQA from Intel on this bug, as was discussed in comment #3 / comment #4? That and/or pull in our internal performance folks to help test?

--- Additional comment from cward on 2009-10-19 07:08:32 EDT ---

@Intel, @Oracle,

We need to confirm that there is commitment to test 
for the resolution of this request during the RHEL 5.5 Test
Phase, if it is accepted into the release. 

Please post a confirmation before Oct 23rd, 2009, 
including the contact information for testing engineers.

--- Additional comment from chinang.ma on 2009-10-19 12:39:40 EDT ---

Chinang Ma (chinang.ma) will be the Intel engineer testing OLTP workload. Please let him know where to download the RHEL 5.5 test kernel once it is available.

--- Additional comment from jmoyer on 2009-10-27 19:30:06 EDT ---

Created attachment 366357 [details]
Don't zero out the page array in the struct dio.

Is a patch sufficient for testing, or do you require a built kernel rpm?  If the latter, please let me know which architecture you'd like to test.

Cheers!

--- Additional comment from chinang.ma on 2009-10-27 19:35:57 EDT ---

A patch for EL5.4 will be good for performance evaluation. We can measure the performance delta after this patch is applied. We are running x86_64 kernel.

--- Additional comment from jmoyer on 2009-10-27 19:48:46 EDT ---

OK, the attached patch (in comment #12) will apply cleanly to a RHEL 5.4 kernel.  Thanks a ton for helping to test this!

--- Additional comment from zach.brown on 2009-10-28 17:36:35 EDT ---

> Created an attachment (id=366357)
> Don't zero out the page array in the struct dio.

For what it's worth, that patch looks good.  If the testing shows that it does make a difference for performance, and you end up sending it on to mainline, feel free to add my 'Acked-By: Zach Brown <zach.brown>'.

--- Additional comment from chinang.ma on 2009-10-29 22:19:13 EDT ---

Patch was verified using OLTP workload and RHEL 5.4 (2.6.18-164.el5). This patch give ~0.6% performance improvement over RHEL 5.4 baseline result.

Comment 1 RHEL Program Management 2011-07-25 14:20:23 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 4 Kyle McMartin 2011-08-09 12:18:45 UTC

Patch(es) available on kernel-2.6.32-183.el6

Comment 8 Eryu Guan 2011-09-15 10:31:02 UTC

Confirmed patch c783a17cd5c617b98136630b7f220592de364576 is applied correctly.
Also passed aiodio and aio-stress tests from LTP on ext3/4 xfs

Comment 9 Jeff Moyer 2011-09-28 17:49:59 UTC

*** Bug 739747 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2011-12-06 13:54:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html

Note You need to log in before you can comment on or make changes to this bug.