Red Hat Bugzilla – Bug 463652
[LTC 6.0 FEAT] 201300:Thread scalability issues with TPC-C
Last modified: 2010-11-15 09:08:19 EST
Emily J. Ratliff <firstname.lastname@example.org> - 2008-09-19 13:42 EDT
1. Feature Overview:
Feature Id: 
a. Name of Feature: Thread scalability issues with TPC-C
b. Feature Description
Improve thread scalability for TPC-C benchmarking, in particular via reduction
of DIO induced mmap_sem contention and lock contention in follow_hugetlb_page().
Additional Comments: RHEL 5.3 integration is being tracked in RHBZ
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=447649 in MODIFIED state as
of 9/16/2008 so this will be a validation only request if it makes the 5.3 release.
2. Feature Details:
Arch Specificity: Both
Affects Core Kernel: Yes
Delivery Mechanism: Direct from community
Request Type: Kernel - Performance Enhancement from Upstream
d. Upstream Acceptance: Accepted
Sponsor Priority 1
f. Severity: High
IBM Confidential: no
Code Contribution: 3rd party code
g. Component Version Target: 2.6.27
Performance Assistance: yes
3. Business Case
New threaded performance issues are being discovered today by high end TPC-C
benchmarks. Addressing these types of bugs as early as possible saves money and
positions the distro to be as scalable as possible during its lifetime, which we
expect to witness proliferation of very multicore processors and threaded
applications. Also key for DB2 customers running the new threaded model. The
performance impacts varies depending on the configuration. This feature could
boost performance for the upcoming 2 node Dunnington TPC-C publish by up to 10%.
In 5.3 early test kernels we measured just over a 2% gain.
4. Primary contact at Red Hat:
5. Primary contacts at Partner:
Project Management Contact:
Michael Hohnbaum, email@example.com, 503-578-5486
Badari Pulavarty, firstname.lastname@example.org
Vaidyanathan Srinivasan, email@example.com
Pat Gaughen, firstname.lastname@example.org
Validation-only request - setting as MODIFIED.
The feature requested has already been accepted into the upstream code base
planned for the next major release of Red Hat Enterprise Linux.
When the next milestone release of Red Hat Enterprise Linux 6 is available,
please verify that the feature requested is present and functioning as
Changing the bug owner on the IBM side to email@example.com
upstream in 2.6.27
sha1 id: ce0ad7f0952581ba75ab6aee55bb1ed9bb22cf4f
Is IBM planning to run RHEL 6 through TPC-C testing prior to release such that you would be able to provide feedback on this feature?
------- Comment From firstname.lastname@example.org 2009-11-11 16:46 EDT-------
Have done testing of OLTP workload on RHEL6 Alpha1 base 18.104.22.168r5 as well as moving up to 2.6.31-rc kernel levels.
Base RHEL6 is 16% regressed from sles10/rhel5
Disabling some of the debug option in kernel config reduce regression to 10%
Changing from SLUB to SLAB reduces regression to 7%
Most of remaining regression appears to be caused by higher CPU consumption in scheduler functions.
An option to revert the process scheduler to O1 would be good.
------- Comment From email@example.com 2009-11-13 09:41 EDT-------
Setting CONFIG_SCHED_DEBUG which is required to expose the CFS tunables, results in a 2% degradation.
------- Comment From firstname.lastname@example.org 2009-11-15 19:09 EDT-------
A couple of things to try:
- Turn off SD_BALANCE_NEWIDLE if its on
- Try this patch that Anton posted a while back http://osdir.com/ml/linux-kernel/2009-08/msg06325.html
but only the second chunk, not the first, to see if it makes any difference. If it does will need to find
something smaller than the INT_MAX
------- Comment From email@example.com 2009-11-16 10:51 EDT-------
Some comments on tpc-c workload:
All results here are for a 2 socket Nehelam EP with 48GB
High Thread count (DB2 process has 1300-1400 threads)
Mostly random memory access
~40GB of shared memory pool
Lots of IO (300,000 io/sec)
Moderate Network traffic (2 x 1GB links)
------- Comment From firstname.lastname@example.org 2009-11-16 11:27 EDT-------
(In reply to comment #15)
Could you check the dirty limit on SLES10 SP2 versus RHEL(might not be relevant right now, but just checking)? I'll take a look at the URL you pointed to as well.
------- Comment From email@example.com 2009-11-16 23:52 EDT-------
If turning off CONFIG_CGROUPS helps, then it would be interesting to see if turning off just CONFIG_GROUP_SCHED gives the same benefit instead of turing the entire cgroups off.
------- Comment From firstname.lastname@example.org 2009-11-17 22:52 EDT-------
OLTP has been found to be sensitive to sched_shares_ratelimit. Could you try increasing it if you haven't already ?
Does OLTP has any realtime threads ? If so, could you try setting
/proc/sys/kernel/sched_rt_runtime_us to -1 ?
------- Comment From email@example.com 2009-12-07 11:43 EDT-------
Oprofile was only run during a small portion of the run. We see no real impact from oprofile in the overall score.
You can disable cgroup memory function on stock RHEL6 alpha3 and beta1 kernels by specifing cgroup_disable=memory on the kernel grub.conf line
kernel /vmlinuz-2.6.32-0.54.el6.x86_64 ro root=/dev/mapper/vg_perf4 rhgb cgroup_disable=memory quiet 3
Also note - the beta1 kernel will enable performance optimizations which have been set to debug in the rhel6 alpha kernels to date. We assume you are already disabling upto 70 different debug parameters if you are already evaluating RHEL6 performance?
------- Comment From firstname.lastname@example.org 2010-02-01 23:10 EDT-------
On 2.6.32, the disable is not required
Commit id in 2.6.32 0c3e73e84fe3f64cf1c2e8bb4e91e8901cbcdc38 fixes the memory cgroup regression. The changelog is below.
Author: Balbir Singh <email@example.com>
Date: Wed Sep 23 15:56:42 2009 -0700
memcg: improve resource counter scalability
Reduce the resource counter overhead (mostly spinlock) associated with the
root cgroup. This is a part of the several patches to reduce mem cgroup
overhead. I had posted other approaches earlier (including using percpu
counters). Those patches will be a natural addition and will be added
iteratively on top of these.
The patch stops resource counter accounting for the root cgroup. The data
for display is derived from the statisitcs we maintain via
mem_cgroup_charge_statistics (which is more scalable). What happens today
is that, we do double accounting, once using res_counter_charge() and once
using memory_cgroup_charge_statistics(). For the root, since we don't
implement limits any more, we don't need to track every charge via
res_counter_charge() and check for limit being exceeded and reclaim.
The main mem->res usage_in_bytes can be derived by summing the cache and
rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need
additional data about swapped out memory. This patch adds a
MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed
recursively when hierarchy is enabled.
The tests results I see on a 24 way show that
1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
Data from Prarit (kernel compile with make -j64 on a 64
For a single run
With config turned off
Please look at http://firstname.lastname@example.org/msg02057.html as well.
------- Comment From email@example.com 2010-05-05 17:45 EDT-------
I don't have the resource to run the benchmarks, but I can verify that the RHEL6 kernel does contain the patches. No surprise since the code has been in the upstream kernel.
------- Comment From firstname.lastname@example.org 2010-07-08 14:29 EDT-------
Closing. The mmap_sem contention has been fixed. Any addition performance issues are outside the scope of this feature.
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.