Bug 463652 - [LTC 6.0 FEAT] 201300:Thread scalability issues with TPC-C
[LTC 6.0 FEAT] 201300:Thread scalability issues with TPC-C
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
All All
high Severity high
: alpha
: 6.0
Assigned To: James Takahashi
Red Hat Kernel QE team
: FutureFeature, OtherQA
Depends On:
Blocks: 356741 RHEL6Kernel2.6.27 531073 554559 555224
  Show dependency treegraph
Reported: 2008-09-24 01:10 EDT by IBM Bug Proxy
Modified: 2010-11-15 09:08 EST (History)
7 users (show)

See Also:
Fixed In Version: kernel-2.6.31-1
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-11-15 09:08:19 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description IBM Bug Proxy 2008-09-24 01:10:58 EDT
=Comment: #1=================================================
Emily J. Ratliff <emilyr@us.ibm.com> - 2008-09-19 13:42 EDT
1. Feature Overview:
Feature Id:	[201300]
a. Name of Feature:	Thread scalability issues with TPC-C
b. Feature Description
Improve thread scalability for TPC-C benchmarking, in particular via reduction
of DIO induced mmap_sem contention and lock contention in follow_hugetlb_page().

Additional Comments:	RHEL 5.3 integration is being tracked in RHBZ
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=447649 in MODIFIED state as
of 9/16/2008 so this will be a validation only request if it makes the 5.3 release.

2. Feature Details:
Sponsor:	LTC

Arch Specificity: Both
Affects Core Kernel: Yes
Delivery Mechanism: Direct from community
Category:	Kernel
Request Type:	Kernel - Performance Enhancement from Upstream
d. Upstream Acceptance:	Accepted
Sponsor Priority	1
f. Severity: High
IBM Confidential:	no
Code Contribution:	3rd party code
g. Component Version Target:	2.6.27
Performance Assistance:	yes

3. Business Case
New threaded performance issues are being discovered today by high end TPC-C
benchmarks.  Addressing these types of bugs as early as possible saves money and
positions the distro to be as scalable as possible during its lifetime, which we
expect to witness proliferation of very multicore processors and threaded
applications. Also key for DB2 customers running the new threaded model. The
performance impacts varies depending on the configuration. This feature could
boost performance for the upcoming 2 node Dunnington TPC-C publish by up to 10%.
In 5.3 early test kernels we measured just over a 2% gain.

4. Primary contact at Red Hat: 
John Jarvis

5. Primary contacts at Partner:
Project Management Contact:
Michael Hohnbaum, hbaum@us.ibm.com, 503-578-5486

Technical contact(s):
Badari Pulavarty, badari@us.ibm.com
Vaidyanathan Srinivasan, svaidyan@in.ibm.com

IBM Manager:
Pat Gaughen, gaughen@us.ibm.com
Comment 1 Bill Nottingham 2008-10-02 16:59:29 EDT
Validation-only request - setting as MODIFIED.

The feature requested has already been accepted into the upstream code base
planned for the next major release of Red Hat Enterprise Linux.

When the next milestone release of Red Hat Enterprise Linux 6 is available,
please verify that the feature requested is present and functioning as
Comment 2 IBM Bug Proxy 2008-10-06 20:45:59 EDT
Changing the bug owner on the IBM side to shaggy@us.ibm.com
Comment 3 IBM Bug Proxy 2009-03-02 16:00:28 EST
upstream in 2.6.27
sha1 id: ce0ad7f0952581ba75ab6aee55bb1ed9bb22cf4f
Comment 4 John Jarvis 2009-11-10 15:57:34 EST
Is IBM planning to run RHEL 6 through TPC-C testing prior to release such that you would be able to provide feedback on this feature?
Comment 5 IBM Bug Proxy 2009-11-11 16:51:21 EST
------- Comment From slpratt@us.ibm.com 2009-11-11 16:46 EDT-------
Have done testing of OLTP workload on RHEL6 Alpha1 base as well as moving up to 2.6.31-rc kernel levels.

Base RHEL6 is 16% regressed from sles10/rhel5
Disabling some of the debug option in kernel config reduce regression to 10%
Changing from SLUB to SLAB reduces regression to 7%

Most of remaining regression appears to be caused by higher CPU consumption in scheduler functions.

An option to revert the process scheduler to O1 would be good.
Comment 6 IBM Bug Proxy 2009-11-13 09:52:23 EST
------- Comment From slpratt@us.ibm.com 2009-11-13 09:41 EDT-------
Setting CONFIG_SCHED_DEBUG which is required to expose the CFS tunables, results in a 2% degradation.
Comment 7 IBM Bug Proxy 2009-11-15 19:10:36 EST
------- Comment From yeohc@au1.ibm.com 2009-11-15 19:09 EDT-------
A couple of things to try:

- Turn off SD_BALANCE_NEWIDLE if its on

- Try this patch that Anton posted a while back http://osdir.com/ml/linux-kernel/2009-08/msg06325.html
but only the second chunk, not the first, to see if it makes any difference. If it does will need to find
something smaller than the INT_MAX
Comment 8 IBM Bug Proxy 2009-11-16 11:00:36 EST
------- Comment From slpratt@us.ibm.com 2009-11-16 10:51 EDT-------
Some comments on tpc-c workload:

All results here are for a 2 socket Nehelam EP with 48GB

No Java.
High Thread count (DB2 process has 1300-1400 threads)
Mostly random memory access
~40GB of shared memory pool
Lots of IO (300,000 io/sec)
Moderate Network traffic (2 x 1GB links)
Comment 10 IBM Bug Proxy 2009-11-16 11:30:54 EST
------- Comment From balbir@in.ibm.com 2009-11-16 11:27 EDT-------
(In reply to comment #15)

Could you check the dirty limit on SLES10 SP2 versus RHEL(might not be relevant right now, but just checking)? I'll take a look at the URL you pointed to as well.
Comment 11 IBM Bug Proxy 2009-11-17 00:00:24 EST
------- Comment From bharata@linux.vnet.ibm.com 2009-11-16 23:52 EDT-------
If turning off CONFIG_CGROUPS helps, then it would be interesting to see if turning off just CONFIG_GROUP_SCHED gives the same benefit instead of turing the entire cgroups off.
Comment 12 IBM Bug Proxy 2009-11-17 23:00:44 EST
------- Comment From bharata@linux.vnet.ibm.com 2009-11-17 22:52 EDT-------
OLTP has been found to be sensitive to sched_shares_ratelimit. Could you try increasing it if you haven't already ?

Does OLTP has any realtime threads ? If so, could you try setting
/proc/sys/kernel/sched_rt_runtime_us to -1 ?
Comment 13 IBM Bug Proxy 2009-12-07 11:51:13 EST
------- Comment From slpratt@us.ibm.com 2009-12-07 11:43 EDT-------
Oprofile was only run during a small portion of the run. We see no real impact from oprofile in the overall score.
Comment 14 John Shakshober 2009-12-07 13:06:06 EST
You can disable cgroup memory function on stock RHEL6 alpha3 and beta1 kernels by specifing cgroup_disable=memory on the kernel grub.conf line

 kernel /vmlinuz-2.6.32-0.54.el6.x86_64 ro root=/dev/mapper/vg_perf4 rhgb cgroup_disable=memory quiet 3

Also note - the beta1 kernel will enable performance optimizations which have been  set to debug in the rhel6 alpha kernels to date.   We assume you are already disabling upto 70 different debug parameters if you are already evaluating RHEL6 performance?

Comment 16 IBM Bug Proxy 2010-02-01 23:20:37 EST
------- Comment From balbir@in.ibm.com 2010-02-01 23:10 EDT-------
On 2.6.32, the disable is not required

Commit id in 2.6.32 0c3e73e84fe3f64cf1c2e8bb4e91e8901cbcdc38 fixes the memory cgroup regression. The changelog is below.

Author: Balbir Singh <balbir@linux.vnet.ibm.com>
Date:   Wed Sep 23 15:56:42 2009 -0700

memcg: improve resource counter scalability

Reduce the resource counter overhead (mostly spinlock) associated with the
root cgroup.  This is a part of the several patches to reduce mem cgroup
overhead.  I had posted other approaches earlier (including using percpu
counters).  Those patches will be a natural addition and will be added
iteratively on top of these.

The patch stops resource counter accounting for the root cgroup.  The data
for display is derived from the statisitcs we maintain via
mem_cgroup_charge_statistics (which is more scalable).  What happens today
is that, we do double accounting, once using res_counter_charge() and once
using memory_cgroup_charge_statistics().  For the root, since we don't
implement limits any more, we don't need to track every charge via
res_counter_charge() and check for limit being exceeded and reclaim.

The main mem->res usage_in_bytes can be derived by summing the cache and
rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
additional data about swapped out memory.  This patch adds a
MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
recursively when hierarchy is enabled.

The tests results I see on a 24 way show that

1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
Data from Prarit (kernel compile with make -j64 on a 64
CPU/32G machine)

For a single run

Without patch

real 27m8.988s
user 87m24.916s
sys 382m6.037s

With patch

real    4m18.607s
user    84m58.943s
sys     50m52.682s

With config turned off

real    4m54.972s
user    90m13.456s
sys     50m19.711s

Please look at http://www.mail-archive.com/fedora-kernel-list@redhat.com/msg02057.html as well.
Comment 17 IBM Bug Proxy 2010-05-05 17:51:25 EDT
------- Comment From shaggy@linux.vnet.ibm.com 2010-05-05 17:45 EDT-------
I don't have the resource to run the benchmarks, but I can verify that the RHEL6 kernel does contain the patches.  No surprise since the code has been in the upstream kernel.
Comment 18 IBM Bug Proxy 2010-07-08 14:31:02 EDT
------- Comment From shaggy@linux.vnet.ibm.com 2010-07-08 14:29 EDT-------
Closing.  The mmap_sem contention has been fixed.  Any addition performance issues are outside the scope of this feature.
Comment 19 releng-rhel@redhat.com 2010-11-15 09:08:19 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.