Red Hat Bugzilla – Bug 459520
performance degrade on oltp workload
Last modified: 2008-11-13 15:23:22 EST
KARL RISTER <email@example.com> - 2008-08-14 21:30 EDT
On a large OLTP benchmark, a 4.5% performance degrade was observed when
upgrading the kernel from the RHEL5.2 kernel (2.6.18-92.el5) to a pre-release
RHEL 5.3 kernel (2.6.18-103.el5).
The pre-release kernel was built by installing the source rpm and then running
rpmbuild (source was used instead of a binary because this was being done in
anticipation of testing a patch).
Contact Information = Karl Rister (firstname.lastname@example.org) / Steve Pratt
---Additional Hardware Info---
2 node x3950 M2
8 x Six Core processors (48 cores, 48 threads)
Large Disk Setup
80 block devices (each a 24 disk RAID 0)
8 Dual Port Fiber Channel Adapters
Linux itcopus83.austin.ibm.com 2.6.18-103.el5 #1 SMP Tue Aug 12 13:27:11 CDT
2008 x86_64 x86_64 x86_64 GNU/Linux
Machine Type = x3950 M2 4RZ-7141
---Steps to Reproduce---
This is a large specialized setup.
KARL RISTER <email@example.com> - 2008-08-14 21:31 EDT
sysctl -a output
KARL RISTER <firstname.lastname@example.org> - 2008-08-15 11:28 EDT
The previous kernel was the RHEL5.2 GA distribution binary. I am working on
getting detailed profiling information to make comparisons.
KARL RISTER <email@example.com> - 2008-08-18 10:54 EDT
Here is a diff of the sysctl output from 2.6.18-92.el5 to 2.6.18-103.el5. The
most obvious changes that I see are some new blocks of nfs related parameters.
We do have some nfs interaction in our test because that is where the binaries
we are loading are located, but any data should be faulted in before the
measurement period begins.
Created attachment 314561 [details]
sysctl -a output
Created attachment 314562 [details]
Can we get the exact test being run and some before and after kernel upgrade output?
If I remember right from the first I heard about this "TPC C" was the test but it would be good to get exact details about the test run and the results seen rather than just a high level problem definition. We do need to be able to reproduce in house to possibly fix and verify the any possible fix.
The test being run is on a just disclosed system using Intel Dunnington
processors. A result was published on tpc.org today for a 1,200,632 tpmC score:
The published score is a 3 tier run. For the bug, I have been running in 2 tier
mode which will allow the system to run faster than what the database was built
for. This was done to facilitate testing of the fastgup patch. While testing
fastgup I made a baseline score on the RHEL 5.2 kernel and a baseline score on
the RHEL 5.3 kernel before building the 5.3 kernel with the fastgup patch. The
scores for the 2 baseline runs are:
2 tier TPC-C Score
RHEL 5.2 (2.6.18-92.el5) 1,209,812.48
RHEL 5.3 test (2.6.18-103.el5) 1,157,443.24
The degrade from 5.2 to 5.3 here is measured as 4.3%. The 4.3% number is lower
than the 4.5% earlier reported because I re-ran the test after applying a small
patch to enable Oprofile support for Dunnington. I will attach that patch soon.
The most noticeable difference I have found in the profiling data so far is that
the 5.3 configuration has more idle time (shown in vmstat output as iowait)
indicating that the system is not able to drive as hard. I will also attach
vmstat output for both configs.
Note that the vmstat data includes the rampup period before the measurements are
Created attachment 314580 [details]
dunnington oprofile support
Created attachment 314581 [details]
RHEL 5.2 (2.6.18-92.el5) vmstat
Created attachment 314582 [details]
RHEL 5.3 test (2.6.18-103.el5) vmstat
Created attachment 314583 [details]
RHEL 5.2 (2.6.18-92.el5) oprofile symbols
Created attachment 314584 [details]
RHEL 5.2 (2.6.18-92.el5) oprofile binaries
Created attachment 314585 [details]
RHEL 5.3 (2.6.18-103.el5) oprofile symbols
Created attachment 314586 [details]
RHEL 5.3 (2.6.18-103.el5) oprofile binaries
It turns out that this is not a RHEL 5.3 kernel bug. A bug in a script that
tunes the scheduling priority of the DB2 log writer thread is at fault. When
the thread is tuned correctly the performance returns to where it should be. It
is unclear why this happened when the kernel was changed, but the logic in the
script is sufficiently broken for me to say that this is not a kernel problem.
I am changing the status to not a bug and will wait awhile to close it out in
case anyone has additional comments.