488161 – (direct_io) __blockdev_direct_IO calls kzalloc for dio struct causes OLTP performance regression

Bug 488161 - (direct_io) __blockdev_direct_IO calls kzalloc for dio struct causes OLTP performance regression

Summary: (direct_io) __blockdev_direct_IO calls kzalloc for dio struct causes OLTP per...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Other
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Moyer
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	725444
TreeView+	depends on / blocked

Reported:	2009-03-02 23:04 UTC by Chinang Ma
Modified:	2011-07-25 14:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	725444 (view as bug list)
Environment:
Last Closed:	2010-03-30 07:43:48 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Don't zero out the page array in the struct dio. (2.16 KB, patch) 2009-10-27 23:30 UTC, Jeff Moyer	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Chinang Ma 2009-03-02 23:04:12 UTC

Description of problem:
RHEL 5.3 __blockdev_direct_IO() allocate dio struct using kzalloc( insteads of RHEL5.2 using kmalloc) which cause 0.5% OLTP performance regression

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Measure RHEL5.3 OLTP performance
2. Apply attached patch to revert __blockdev_direct_IO() to use kmalloc for dio struct
3. Measure patched kernel OLTP peformance delta
  
Actual results:


Expected results:


Additional info:
RHEL5.2 use kmalloc to allocate dio and initialises selective dio fields in a downstream __directio_worker() fuction. RHEL 5.3 kzalloc use memset to clear the whole dio struct which create longer code path and more cache eviction.

Comment 1 Jeff Moyer 2009-03-04 17:38:26 UTC

This was introduced by the following commit.  I'll discuss this with Zach.

commit 848c4dd5153c7a0de55470ce99a8e13a63b4703f
Author: Zach Brown <zach.brown>
Date:   Mon Aug 20 17:12:01 2007 -0700

    dio: zero struct dio with kzalloc instead of manually
    
    This patch uses kzalloc to zero all of struct dio rather than manually
    trying to track which fields we rely on being zero.  It passed aio+dio
    stress testing and some bug regression testing on ext3.
    
    This patch was introduced by Linus in the conversation that lead up to
    Badari's minimal fix to manually zero .map_bh.b_state in commit:
    
      6a648fa72161d1f6468dabd96c5d3c0db04f598a
    
    It makes the code a bit smaller.  Maybe a couple fewer cachelines to
    load, if we're lucky:
    
       text    data     bss     dec     hex filename
    3285925  568506 1304616 5159047  4eb887 vmlinux
    3285797  568506 1304616 5158919  4eb807 vmlinux.patched
    
    I was unable to measure a stable difference in the number of cpu cycles
    spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
    ~340MB/s.
    
    So the resulting intent of the patch isn't a performance gain but to
    avoid exposing ourselves to the risk of finding another field like
    .map_bh.b_state where we rely on zeroing but don't enforce it in the
    code.
    
    Signed-off-by: Zach Brown <zach.brown>
    Signed-off-by: Linus Torvalds <torvalds>

Comment 2 Zach Brown 2009-03-04 18:42:09 UTC

"0.5%" doesn't tell us how stable the measurement is.  Is that over a really long run?  Is it averaged from a lot of short runs?  Standard deviation?

If it is indeed a consistent regression then we should fix it.  I imagine the regression would come from zeroing the page pointer array -- 64 pointers.  I could see that amount of additional writes overwhelming the advantage of the code size savings that the patch introduced.

I would rather that we not just revert the patch, though.  We could take the opportunity to initialize the fields in memory order and perhaps introduce a helper which initializes a buffer_head.

And we'd need a giant comment saying that people are actually spending effort watching the performance impact of this tiny corner of the kernel.

Comment 3 Jeff Moyer 2009-09-30 17:20:10 UTC

Could Intel please comment on Zach's observations, please?  If we do come up with a patch, will you be able to test it for us?

Comment 4 Chinang Ma 2009-10-01 16:58:38 UTC

The scaled OLTP database takes some time to prepare for each run. We are doing about 140k reads/100k writes per second. I opt for a longer run (1 hour of measurement period in this case) insteads of many short run. We measured again with EL5.4 (-155.el5) with kzalloc(dio) and zeroing dio fields. Zeroing limited dio fields gave about +0.47% performance in EL5.4. Hope this address the consistent regression concern.

Direct_io performance has big impact for OLTP workload as we are going to higher and higher iops with more powerful server. We will be watching closely with each OS release. 
We are happy to test any patch related to this issue. thanks.

Comment 5 Jeff Moyer 2009-10-01 17:10:29 UTC

Thanks for your response!  I'll see about putting a patch together for this for upstream and RHEL 5.5.

Comment 6 RHEL Program Management 2009-10-01 18:05:01 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Zach Brown 2009-10-01 19:39:15 UTC

(In reply to comment #5)
> I'll see about putting a patch together for this for upstream

Thanks Jeff.  Let me know if I can help.

We might be able to make a small change that makes everyone happy.  We could move the pages[] array, and its associated few fields, to the end of the struct.  We can allocate with kmalloc() and clear with an explicit memset() which uses offsetof(struct dio, pages) to clear everything but the array of page pointers.

It'd have the added benefit of moving the cachelines with the last few page pointers -- which are almost never referenced -- to the end of the struct instead of having them in the middle between fields that are always referenced :).

Comment 8 Jeff Moyer 2009-10-01 19:50:52 UTC

Yeah, that will work for upstream.  For RHEL I think we can get away with just zeroing out the buffer head.  I'll whip up a couple of patches.

Comment 10 Chris Ward 2009-10-19 11:08:32 UTC

@Intel, @Oracle,

We need to confirm that there is commitment to test 
for the resolution of this request during the RHEL 5.5 Test
Phase, if it is accepted into the release. 

Please post a confirmation before Oct 23rd, 2009, 
including the contact information for testing engineers.

Comment 11 Chinang Ma 2009-10-19 16:39:40 UTC

Chinang Ma (chinang.ma) will be the Intel engineer testing OLTP workload. Please let him know where to download the RHEL 5.5 test kernel once it is available.

Comment 12 Jeff Moyer 2009-10-27 23:30:06 UTC

Created attachment 366357 [details]
Don't zero out the page array in the struct dio.

Is a patch sufficient for testing, or do you require a built kernel rpm?  If the latter, please let me know which architecture you'd like to test.

Cheers!

Comment 13 Chinang Ma 2009-10-27 23:35:57 UTC

A patch for EL5.4 will be good for performance evaluation. We can measure the performance delta after this patch is applied. We are running x86_64 kernel.

Comment 14 Jeff Moyer 2009-10-27 23:48:46 UTC

OK, the attached patch (in comment #12) will apply cleanly to a RHEL 5.4 kernel.  Thanks a ton for helping to test this!

Comment 15 Zach Brown 2009-10-28 21:36:35 UTC

> Created an attachment (id=366357)
> Don't zero out the page array in the struct dio.

For what it's worth, that patch looks good.  If the testing shows that it does make a difference for performance, and you end up sending it on to mainline, feel free to add my 'Acked-By: Zach Brown <zach.brown>'.

Comment 16 Chinang Ma 2009-10-30 02:19:13 UTC

Patch was verified using OLTP workload and RHEL 5.4 (2.6.18-164.el5). This patch give ~0.6% performance improvement over RHEL 5.4 baseline result.

Comment 17 Jeff Moyer 2009-10-30 12:57:32 UTC

Wow, that was a fast turn-around.  Thanks!  So, how does the performance of this patch compare to the 5.2 baseline you had reported previously?  In other words, is this still a regression, and are you happy with the current performance?

Comment 18 Chinang Ma 2009-10-30 16:45:48 UTC

Yes. kzalloc caused ~0.5% performance regression in RHEL 5.2 and 5.3. This patch gives +0.68%. I am very happy that it recovers all the regression related to kzalloc. Thanks.

Comment 19 Don Zickus 2009-11-03 22:52:25 UTC

in kernel-2.6.18-172.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 23 errata-xmlrpc 2010-03-30 07:43:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.