Bug 90116 - unusual performance degredation between RHL 8.0 and RHL 9
unusual performance degredation between RHL 8.0 and RHL 9
Status: CLOSED RAWHIDE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
9
All Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks: CambridgeBlocker
  Show dependency treegraph
 
Reported: 2003-05-02 14:55 EDT by Chris Ricker
Modified: 2015-01-04 17:02 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-10-21 16:08:08 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Chris Ricker 2003-05-02 14:55:01 EDT
We have a classroom full of identical machines. These machines are physically
identical, and have been verified to have same BIOS settings, etc.

On RHL 8.0, the machines all performed similarly. When we upgraded the classroom
to RHL 9 (kickstart install -- all machines get the same configuration), we
discovered that some of the machines now perform much more slowly than others.
On "slow" machines, logging into gdm takes minutes longer than on "fast"
machines, for example.

A good test case we've found is timing "ls -al /etc > /dev/null" ; there's
roughly an order of magnitude difference between "slow" machines and "fast"
machines.

Here's a representative run from a "fast" machine in runlevel 1 running the
latest errata kernel (2.4.20-9.i686) and latest errata glibc (2.3.2-27.9.i686):

execve("/bin/ls", ["ls", "-al", "/etc"], [/* 25 vars */]) = 0
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 64.86    0.012815          44       290           lstat64
 12.68    0.002505          63        40        13 open
  5.49    0.001085          60        18           readlink
  5.04    0.000996          40        25           read
  2.40    0.000474          21        23           munmap
  1.89    0.000374          53         7           getdents64
  1.67    0.000330          15        22           mmap2
  1.11    0.000220           7        31           fcntl64
  1.09    0.000216           7        30           close
  0.80    0.000158          16        10           old_mmap
  0.69    0.000137          69         2         2 connect
  0.69    0.000136           5        28           fstat64
  0.68    0.000134          67         2           socket
  0.52    0.000103           6        17           brk
  0.18    0.000036           7         5           write
  0.07    0.000014           5         3         2 rt_sigaction
  0.06    0.000012           4         3         3 ioctl
  0.05    0.000010          10         1           uname
  0.02    0.000004           4         1           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.019759                   558        20 total

Here's a representative run from a "slow" machine in runlevel 1 running the
latest errata kernel (2.4.20-9.i686) and latest errata glibc (2.3.2-27.9.i686):

execve("/bin/ls", ["ls", "-al", "/etc"], [/* 10 vars */]) = 0
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.85    0.017806          61       290           lstat64
 11.24    0.004565         198        23           read
  8.00    0.003250         125        26         1 open
  6.02    0.002443         136        18           readlink
  5.90    0.002396         109        22           munmap
  4.63    0.001882         269         7           getdents64
  4.55    0.001848          92        20           mmap2
  3.43    0.001392          45        31           fcntl64
  2.69    0.001092          39        28           close
  2.46    0.000998          59        17           brk
  1.66    0.000676          26        26           fstat64
  1.25    0.000509          46        11           old_mmap
  1.15    0.000468         234         2           socket
  0.98    0.000397          79         5           write
  0.93    0.000379         126         3         3 ioctl
  0.71    0.000288         144         2         2 connect
  0.48    0.000193         193         1           set_thread_area
  0.03    0.000012          12         1           uname
  0.02    0.000010           3         3         2 rt_sigaction
  0.01    0.000004           4         1           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.040608                   537         8 total

It appears to be a kernel issue for a couple of reasons:

* If I boot the RHL 8 kernels on those machines, both the "fast" and the "slow"
machines are fast
* If I boot the RHL 9 kernels with the nosysinfo (to disable NPTL) option, the
"fast" machine is fast and the "slow" machine is slow

These machines are older 500 MHz Celeron boxes with 256 megs RAM.

CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0183f9ff 00000000 00000000 00000000
CPU:             Common caps: 0183f9ff 00000000 00000000 00000000
CPU: Intel Celeron (Mendocino) stepping 05

I've also tried the latest Rawhide kernel on these boxes (2.4.20-1.1976) and the
"slow" machines are still slow and the "fast" machines are still fast

Let me know what other information we can supply....
Comment 1 Arjan van de Ven 2003-05-02 14:59:42 EDT
nosysinfo doesn't disable nptl; it just makes glibc use another thread library. 

Could it be that userspace just got bigger? 
Comment 2 Dax Kelson 2003-05-02 15:02:19 EDT
On the "slow" machines, installing the the RHL8.0 kernel fixes the problem.
Comment 3 Chris Ricker 2003-05-02 15:05:28 EDT
In case it's not clear, all machines are running RHL 9. RHL 9 userspace + RHL 9
kernel is slow on "slow" boxes and fast on "fast" boxes. RHL 9 userspace + RHL
8.0 kernel is fast on both "slow" and "fast" boxes
Comment 4 Dax Kelson 2003-05-02 15:08:49 EDT
"slow" boxes and "fast" boxes are identical hardware and software wise (RHL9).

The "slow" boxes are slow because their system calls are 4x-10x slower than the
other boxes.

In a room of 14 systems, about 50% are slow and %50 are fast.

Installing the RHL8.0 2.4.18-14 kernel on the "slow" boxes makes their system
run at a normal speed.

Maybe we should have said "slow" and "normal" (instead of fast).
Comment 5 Arjan van de Ven 2003-05-02 15:13:27 EDT
please try the 2.4.20-9.5 kernel at
http://people.redhat.com/arjanv/testkernels
it has some tweaks for performance
Comment 6 Dax Kelson 2003-05-02 17:16:30 EDT
No change.

# uname -a
Linux station11.example.com 2.4.20-9.5 #1 Thu May 1 12:55:16 EDT 2003 i686 i686
i386 GNU/Linux

# strace -c ls -al /etc > /dev/null

execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 41.08    0.018893          65       290           lstat64
 11.13    0.005118         205        25           read
  8.93    0.004105         103        40        13 open
  6.53    0.003002         131        23           munmap
  5.91    0.002719         124        22           mmap2
  5.38    0.002473          80        31           fcntl64
  4.12    0.001894         105        18           readlink
  3.63    0.001667         238         7           getdents64
  3.32    0.001525          51        30           close
  2.34    0.001078          98        11           old_mmap
  1.95    0.000897          32        28           fstat64
  1.42    0.000654          38        17           brk
  1.04    0.000476         238         2           socket
  1.01    0.000466         233         2         2 connect
  0.90    0.000414          83         5           write
  0.83    0.000380         127         3         3 ioctl
  0.43    0.000196         196         1           uname
  0.03    0.000014           5         3         2 rt_sigaction
  0.02    0.000010          10         1           set_thread_area
  0.01    0.000005           5         1           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.045986                   560        20 total

Just to clarify further, the -8 and -9 kernels are OK on about 50% of the
machines. On those "normal" machines, that third column doesn't have any values
higher than 70.  So this isn't a problem that manifests itself consistently.
Quite a head scratcher since the machines are precisely identical hardware and
software.

BTW, the "ls -al /etc" is just a test case that quantifies the problem. Trying
to use the machine for any work is nearly pointless due to the slowness.

Installing the 2.4.18-14 kernel makes the "slow" machines run with normal speed.
Comment 7 Chris Ricker 2003-05-14 16:48:54 EDT
kernel-2.4.20-13.9.i686 doesn't solve the problem. system calls on "slow" boxes
are still markedly slower
Comment 8 Chris Ricker 2003-05-14 17:00:41 EDT
Hmm, this is interesting

kernel-BOOT-2.4.20-13.9.i386.rpm solves the problem. On both "slow" and "fast"
systems, system calls operate at "fast" speeds (of course, the network cards
don't work, but hey ;-)
Comment 9 Chris Ricker 2003-05-14 19:19:17 EDT
I recompiled kernel-2.4.20-13.9.src.rpm specifically for i386. The
2.4.20-13.9.i386.rpm kernel is slow on the "slow" machines and fast on the
"fast" machines

It appears that something that's different between kernel-BOOT and kernel (and
not the i386 / i686 difference between kernel-BOOT-2.4.20-13.9.i386.rpm and
kernel-2.4.20-13.9.i686.rpm) is what's breaking on these "slow" machines
Comment 10 Alan Cox 2003-06-05 12:10:18 EDT
Try booting with "apm=off" ? and also maybe comparing the bios settings for
power manglement ?
Comment 11 Dax Kelson 2003-06-05 12:37:26 EDT
Booting with apm=off resulted in no change.

I believe the BIOS settings, including APM settings are identical between
machines. We checked them closely when we opened the bug. I'm on the road this
week, so I'm not there in person to check again.

Just inspecting the differences in the strace -c output, the only system class
that don't seem to be impacted are getttimeofday and rt_sigaction.

To recap, RHL8.0 kernels don't exhibit the slowdown, RHL9 -boot kernels don't
exhibit the slowdown.  Some computers in the classroom have no slow down with
RHL9 kernels. All computers are identical hardware wise. 
Comment 12 Dax Kelson 2003-06-05 12:40:09 EDT
Arghhh. I meant "system call" not "system class".
Comment 13 Dax Kelson 2003-09-04 01:31:21 EDT
more data:

Latest RHL9 errata kernel (still) has the slowdown
Latest RHL8.0 errata kernel (kernel-2.4.20-20.8) has the slowdown
Original RHL8.0 kernel (2.4.18-14) *NO SLOWDOWN*
ftp.kernel.org vanilla 2.4.20 *NO SLOWDOWN*

I compiled the vanilla 2.4.20 kernel using:
/usr/src/linux-2.4.20-8/configs/kernel-2.4.20-i686.config

stats from vanilla 2.4.20 kernel:

[root@station9 root]# uname -r
2.4.20
[root@station9 root]# strace -c ls -al /etc > /dev/null
execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 41.25    0.002895          10       289           lstat64
 12.92    0.000907          36        25           read
 12.61    0.000885          22        41        14 open
  5.34    0.000375          21        18           readlink
  5.27    0.000370          12        31           old_mmap
  5.14    0.000361          52         7           getdents64
  5.06    0.000355          15        23           munmap
  2.82    0.000198           7        30           close
  2.11    0.000148           5        28           fstat64
  1.44    0.000101           3        31           fcntl64
  1.44    0.000101          51         2           socket
  1.42    0.000100          50         2         2 connect
  1.27    0.000089           5        17           brk
  0.80    0.000056          56         1           mmap2
  0.57    0.000040           8         5           write
  0.19    0.000013           4         3         2 rt_sigaction
  0.17    0.000012           4         3         3 ioctl
  0.13    0.000009           9         1           uname
  0.06    0.000004           4         1           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.007019                   558        21 total

So the slow down comes from one of the RH kernel patches?
Comment 14 Dax Kelson 2003-09-04 01:43:12 EDT
results from vanilla ftp.kernel.org 2.4.21 kernel *NO SLOWDOWN*

[root@station9 root]# uname -r; strace -c ls -al /etc > /dev/null
2.4.21
execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 39.13    0.002485           9       289           lstat64
 14.52    0.000922          37        25           read
 12.52    0.000795          19        41        14 open
  5.84    0.000371          21        18           readlink
  5.81    0.000369          53         7           getdents64
  5.53    0.000351          11        31           old_mmap
  4.94    0.000314          14        23           munmap
  2.63    0.000167           6        30           close
  1.97    0.000125           4        28           fstat64
  1.75    0.000111          56         2           socket
  1.56    0.000099          50         2         2 connect
  1.34    0.000085           3        31           fcntl64
  1.26    0.000080           5        17           brk
  0.47    0.000030           6         5           write
  0.20    0.000013          13         1           mmap2
  0.17    0.000011           4         3         2 rt_sigaction
  0.16    0.000010           3         3         3 ioctl
  0.14    0.000009           9         1           uname
  0.06    0.000004           4         1           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.006351                   558        21 total
Comment 15 Dax Kelson 2003-09-04 01:46:08 EDT
results from vanilla ftp.kernel.org 2.4.22 kernel *NO SLOWDOWN*

[root@station9 root]# uname -r; strace -c ls -al /etc > /dev/null
2.4.22
execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.11    0.003011          10       289           lstat64
 12.76    0.000891          22        41        14 open
 12.37    0.000864          35        25           read
  5.33    0.000372          53         7           getdents64
  5.18    0.000362          20        18           readlink
  5.03    0.000351          11        31           old_mmap
  4.42    0.000309          13        23           munmap
  2.78    0.000194           6        30           close
  2.12    0.000148           5        28           fstat64
  1.56    0.000109           4        31           fcntl64
  1.50    0.000105          53         2         2 connect
  1.42    0.000099          50         2           socket
  1.23    0.000086           5        17           brk
  0.49    0.000034           7         5           write
  0.19    0.000013           4         3         2 rt_sigaction
  0.16    0.000011           4         3         3 ioctl
  0.16    0.000011          11         1           mmap2
  0.13    0.000009           9         1           uname
  0.07    0.000005           5         1           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.006984                   558        21 total
Comment 16 Dax Kelson 2003-09-04 02:05:48 EDT
Latest RHL9 errata kernel *SLOWDOWN PRESENT*

Take a look at that third column.

[root@station9 root]# uname -r; strace -c ls -al /etc > /dev/null
2.4.20-20.9
execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.57    0.019390          67       289           lstat64
 10.88    0.004841         194        25           read
  9.82    0.004369         109        40        13 open
  5.84    0.002601         145        18           readlink
  5.78    0.002574         112        23           munmap
  4.51    0.002007          72        28           fstat64
  4.17    0.001857         265         7           getdents64
  3.12    0.001387          45        31           fcntl64
  2.53    0.001124          37        30           close
  2.42    0.001078          98        11           old_mmap
  1.87    0.000834          49        17           brk
  1.07    0.000475         238         2         2 connect
  1.06    0.000473         237         2           socket
  1.00    0.000446          20        22           mmap2
  0.91    0.000405          81         5           write
  0.52    0.000233          78         3         2 rt_sigaction
  0.44    0.000197          66         3         3 ioctl
  0.44    0.000197         197         1           uname
  0.02    0.000011          11         1           set_thread_area
  0.01    0.000003           3         1           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.044502                   559        20 total
Comment 17 Dax Kelson 2003-09-04 13:08:16 EDT
Culprit found!! While I was in the shower this morning I was pondering why the
-boot kernel (which has the RH patches) had no slowdown while the normal does. I
concluded that it was likely one of the kernel modules.

Using trial and error I narrowed it down to the lm_sensors modules. The
ftp.kernel.org kernels don't have lm_sensors (hence, no slowdown with them). 

We have the following in our rc.local:

modprobe i2c-isa; modprobe w83781d

This has worked fine ever since RH has included lm_sensors until the RH 2.4.20
based kernels.

Without those loaded at boot time all the "slow" machines return to "normal"
speed and the strace test confirms it. 

In testing I found that unloading lm_sensors via rmmod is *not sufficient* to
return to the machine to normal speed. Once loaded (and even unloaded) the
slowdown is in effect until a reboot.

To reiterate, lm_sensors in RH 2.4.18 kernels OK, lm_sensors in RH 2.4.20
kernels NOT OK.

The funny thing is that *some* of these "identical" computers are OK with the
lm_sensors in RH 2.4.20 kernels.
Comment 18 Dax Kelson 2003-09-04 15:13:44 EDT
I suspect that the normal-speed-with-RH-2.4.20 kernel boxes actually have a new
rev of motherboard. Since over the years we've RMA'd some of the motherboards
and gotten replacements. 

The replacement, while same model, were probably a newer revision.

I'm just guessing.

I also note that the lm_sensors shipped in RH kernels is several revisions
out-of-date.
Comment 19 Dr Eric del Guardia 2003-09-04 17:19:12 EDT
I observed similar behaviour on an Asus CUSL2 with an Intel Celeron 850.

With once loaded i2c-i801 -> slow, without normal speed.

Once i installed an Intel Pentium3 866 it went away. I never figured out what 
the problem was.
Comment 20 Dave Jones 2003-09-05 08:20:46 EDT
Thanks for tracking this down.  Could you grab a newer lm_sensors, and see if
that has the same problem ? If it does, working with upstream to fix that would
be the easiest way forward, as you have the hardware to reproduce and chase this.

If this is fixed already in upstream, I'll look at getting it in for an errata
update.

Thanks again.
Comment 21 Michael K. Johnson 2003-09-05 09:33:29 EDT
Dax said:
>I also note that the lm_sensors shipped in RH kernels is several revisions
>out-of-date.

It is not appropriate for Red Hat to go randomly pulling updates of lots of
different drivers into our kernel.  The right place for things to go when
they are updated is the upstream kernel, which we then pull from.

If a driver isn't good enough for the upstream kernel, why should we think
that it is good enough for a Red Hat product?
Comment 22 Dave Jones 2003-09-05 09:41:29 EDT
lm_sensors isn't in mainline 2.4
it's not so much a 'not good enough' situation, it was just never pushed for
inclusion. It's now in 2.6
Comment 23 Chris Ricker 2003-09-05 09:45:13 EDT
> If a driver isn't good enough for the upstream kernel, why should we think
> that it is good enough for a Red Hat product?

This statement makes no sense. You ship lots of drivers that aren't in the
upstream kernel (including the lm_sensors stuff in question here)
Comment 24 Michael K. Johnson 2003-09-05 10:58:25 EDT
Sorry, mea culpa.  I forgot that this was one of the demonstrations of
why we have that policy of following upstream.  Spoke too early in the
morning, I guess...  My deep apologies.  I'll go crawl back into my
hole now...
Comment 25 Dax Kelson 2003-09-05 13:21:10 EDT
I tested lm_sensors 2.8.0 (the most current version), the slowdown is still there.

I installed the following RPMs to test:

http://atrpms.physik.fu-berlin.de/dist/rh9/atrpms/atrpms-13-16.rh9.at.noarch.rpm
http://atrpms.physik.fu-berlin.de/dist/rh9/lm_sensors/lm_sensors-2.8.0-1_15.rh9.at.i386.rpm
http://atrpms.physik.fu-berlin.de/dist/rh9/kernel/kernel-2.4.20-20_23.1.rh9.at.i686.rpm

I have narrowed down specifically which lm_sensor kernel module causes the slowdown.

These modules are OK, no slowdown:

eeprom
i2c-proc
i2c-isa
i2c-piix4
i2c-core
dmi_scan

When I load the following kernel module is when the slowdown occurs:

w83781d

Unloading it with rmmod does not fix the slowdown, a reboot is required.
Comment 26 Richard Allen 2003-09-06 00:32:38 EDT
Im feeling the same thing on my workstations.  Im not using lm_sensors.
Its gotten to the point where I cant run Quake3 while mozilla is open on a page
that has a animated gif on it.  (These are fully updated RH 9 machines)

Comment 27 petrosyan 2003-09-09 15:11:29 EDT
I also have performance degradation and I don't use lm_sensors. XMMS starts
skipping OGG/MP3 files if I browse Internet or do any other work on the computer
( PIII 700 ).
Comment 28 petrosyan 2003-09-26 16:55:04 EDT
this bug has been fixed in Fedora Core release 0.94
kernel-2.4.22-1.2061.nptl

Note You need to log in before you can comment on or make changes to this bug.