Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 3 product line. The current stable release is 3.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 114553

Summary:	Bad performance with Q1 update kernel (-9EL)
Product:	Red Hat Enterprise Linux 3	Reporter:	Stephen Drye <sdrye>
Component:	kernel	Assignee:	Larry Woodman <lwoodman>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.0	CC:	bark, dstewart, jbs, jr-redhatbugs2, mcrawford, nmurray, petrides, riel, terjekv, wms
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-05-12 01:08:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen Drye 2004-01-29 15:27:10 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5)
Gecko/20031007

Description of problem:
Using this kernel in a workstation setup is unusably slow.  Switching
back to the 4.0.2.EL kernel corrects the problem, so it is version
specific.

Symptoms:
- Incredibly slow loading of apps, and anything that involves memory
management (switching from one app to another, attaching from one
process to another for debugging).
- 100% reproducible by running Mozilla, Eclipse with Sun JDK 1.4.2_02,
and using it to debug our app server also running on JDK 1.4.2_03 (you
can probably use RedHat's Java app server to get the same effect).

Using "top" with each kernel (9 and 4.0.2) shows exactly the same
memory usage for the apps and for the whole system.  IMHO, it looks
and acts like something is wrong with the memory page management system :)

During this problem there's limited if any disk activity.

P4, 1G memory, IDE, using the motherboard video (i845) (Dell Optiplex
GX260)

Version-Release number of selected component (if applicable):
kernel-2.4.21-9.EL

How reproducible:
Always

Steps to Reproduce:
1. Start Eclipse
2. Start Mozilla
3. Start the Java app server
4. Use Eclipse to develop the app for a while, and Mozilla to test it.
 Debug every once and a while.  It's really un-missable.
    

Actual Results:  Stupid-slow performance.  1-2 minutes to change from
one app to another, "hangs" that take upwards of 10 minutes to clear
(particularly when attaching the Eclipse debugger to the app server).
 Occasional app crashes on permanent non-responsiveness.

Expected Results:  It should perform at least as well as kernel
2.4.21-4.0.2.EL does.

Additional info:

Comment 1 Larry Woodman 2004-01-29 16:08:44 UTC

Stephen, can you please try one quick thing to see if it helps
relieve the performance problems yuo are seeing?

Try "echo 100 > /proc/sys/vm/inactive_clean_percent".  No reboot
is necessecary, this might hopefully eliminate the sluggishness you
are seeing on your workstation running this kernel.

Thanks, Larry Woodman

Comment 2 Larry Woodman 2004-01-29 18:56:24 UTC

Stephen, can you please try one quick thing to see if it helps
relieve the performance problems yuo are seeing?

Try "echo 100 > /proc/sys/vm/inactive_clean_percent".  No reboot
is necessecary, this might hopefully eliminate the sluggishness you
are seeing on your workstation running this kernel.

Thanks, Larry Woodman

Comment 3 Doug Stewart 2004-01-30 16:26:26 UTC

I too experience extreme lag and slowness with the update kernel,
particularly when switching virtual desktops.  Windows (particularly
web browsers) will take a very long time to redraw themselves.

Also, I have installed Xine for viewing DVDs.  When I used to watch
DVDs under RedHat 9, there was never a trace of slowdown or
flickering, no matter what I was doing.  Now, when I load/reload a
webpage or do anything involving moving large windows around, I get
extreme stuttering and frame loss in Xine.

I have a dual AthlonMP 1900+ workstation with SCSI disks and 1GB of
RAM.  Performance of ANY previous RH distro has never been an issue.

Comment 4 Jordan Russell 2004-02-01 08:16:10 UTC

I've also encountered poor VM-related performance. Under 2.4.21-9.EL, 
htdig would take anywhere from 2 to 8 hours to rebuild its database 
(it varied wildly despite the system load being constant), compared 
to a consistent 45 minutes with vanilla 2.4.24.

Good news is that "echo 100 > /proc/sys/vm/inactive_clean_percent" 
definitely helped my particular case. Database rebuilds on 2.4.21-
9.EL with inactive_clean_percent=100 take roughly 60 minutes. Not 
quite as good as vanilla 2.4.24 but more within the realm of 
acceptibility.

Are there any bad side effects to changing inactive_clean_percent? I 
take it setting it to 100 must disable something?

Comment 5 Jordan Russell 2004-02-02 07:35:05 UTC

Just a followup to my previous post:
I got around to testing 2.4.21-4.0.2.EL, and the results are very 
similar to 2.4.21-9.EL with inactive_clean_percent=100.

Comment 6 William Shubert 2004-03-24 20:41:14 UTC

I can't believe that this bug was first reported almost 2 months ago,
and still there is no patch. It hit me on my new RHEL3 system. If you
look at the forums on my ISP (servermatrix), you'll see that people
are really disgusted with RHEL - and from the reports, it sounds like
most of the people are upset because of this very bug!!! The ISP
itself has even adopted a policy of replacing the RHEL kernel with the
Fedora kernel whenever people report that their system becomes to slow
that it is unusable.

So this tweak makes Oracle run a few percent faster. On a lot of
systems, it makes them freeze up regularly and become totally
unstable. If seems like the only reasonable thing is to release a
kernel with the default value set to 100 ASAP and put out a memo
saying "if you run Oracle, try setting this to 5 or 30, but be aware
it has been seen to cause reliablity issues."

Seriously, do you care about realiability at all? Isn't that the main
point of RHEL over Fedora and the old 7.3/8/9 releases? That RHEL is
supported to be *reliable*?

Sorry if I sound mad, but I wasted about 2 days fighting this bug (and
my customers saw at least 24 hours of an essentially unusable system
from me). A lot of other people are wasting their time too. And
meanwhile Red Hat is sitting there wondering if 5 or 30 is best - just
shut the thing off and release a patch ASAP, then leave it off until
you know exactly how to make sure it won't go berzerk!

Comment 7 Ernie Petrides 2004-03-24 21:29:37 UTC

The default value for "inactive_clean_percent" has been changed to
30 in RHEL 3 Update 2, which begins it's external beta period soon.
No patch is necessary, since a system administrator can alter the
value with a simple user-level command.  The value is strictly a
VM performance tuning parameter - some workloads do better with a
lower value and others do better with a higher value.

If you find that the Update 1 kernel (2.4.21-9.EL) performs well
with inactive_clean_percent manually assigned to 30, then you'll
find that the Update 2 kernel performs similarly "out of the box".

Please let us know whether 30 works okay for you.  Thanks.

Comment 8 William Shubert 2004-03-25 03:50:40 UTC

H'm, you say "The value is strictly a VM performance tuning
parameter," which indicates to me you don't know how severe this bug
is when it hits. If "ls" on a barely-loaded system takes 15 seconds,
and everything else takes minutes, would you call that "performance
tuning" or "broken"?

Also, nobody I've exchanged messages with knows about this parameter.
The problem is that the default setting leaves many systems broken,
and there is no way to get the word out to all the system
administrators that it can be fixed, so most system administrators who
experience this problem are pulling their hair out wondering what is
wrong.

I haven't tried a value of 30 (don't have a spare system, and I can't
afford to risk it on my production system), but I saw a message from
somebody who said that with a value of 30 it took longer for the
problem to appear, but it showed up all the same. I'll ask them to add
a comment to this bug directly, so that they can give a firsthand account.

Again, when this hits, it isn't a performance tuning issue, it makes
the system *BROKEN*, with no hint on how to fix it unless the sysadmin
realizes he must search red hat's bugzilla database.

Comment 9 Jordan Russell 2004-03-28 20:49:05 UTC

I've run my server with it set to 30 for several days and haven't 
(yet) experienced any of the extreme slowness/unresponsiveness that I 
saw with it set to 5.

However, my htdig test (see previous post above) runs roughly 10-15% 
slower (a difference of about 10 minutes) with it set to 30 as 
opposed to 100.

Comment 10 Rik van Riel 2004-03-28 23:37:58 UTC

Jordan, performance differences like that are to be expected.  Some
workloads run slower with the inactive_clean_percent higher, others
with it lower.  We suspect that 30% is a decent middle ground; some
people will be able to improve things by tuning it, but it won't give
disastrous performance for anyone (unlike 5% or 100%).

Comment 11 Jeffrey Siegal 2004-03-28 23:41:28 UTC

Is there any documentation on tuning this?  I could read the source
code if I have to (I'm assuming that would help but even that isn't a
given) but I'm sure a lot of system administrators can't.

Comment 12 Rik van Riel 2004-03-29 00:59:04 UTC

There is documentation in progress, almost ready for publication. I
have reviewed the whitepaper in question a few times now and it looks
fine to me.

Norm Murray will probably know at what date his whitepaper will be
published.  Norm ? ;)

Comment 13 Neil Horman 2004-03-29 17:31:20 UTC

The latest version of the whitepaper (this will also be published in
the April edition of Wide Open) is available here:
http://people.redhat.com/nhorman/papers/papers.html

Comment 14 Ernie Petrides 2004-03-29 23:33:36 UTC

Stephen Drye (sdrye), could you please confirm whether
setting "inactive_clean_percent" to 30 resolves the performance
problem that you originally reported?

Thanks.  -ernie

Comment 15 Stephen Drye 2004-03-30 18:42:05 UTC

Yes, 30 performs better than 100 for my workload, oddly enough.  30
"feels" similar to the performance that 4.0.2 had.

Comment 16 Ernie Petrides 2004-03-31 22:09:38 UTC

Thanks for the info, Stephen.  At this point, this bug should be
considered resolved, because we've change the default value to 30
in RHEL3 U2.  So, I'm changing this bug's state to "modified".

Comment 17 John Flanagan 2004-05-12 01:08:24 UTC

An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-188.html

Comment 18 Terje Kvernes 2004-06-06 14:25:45 UTC

even with /proc/sys/vm/inactive_clean_percent set to 30 we still have
problems with our machines feeling extremly slow.  the first couple of
days it felt better, before the systems regressed to where they were
before the tuning started.

switching desktops if one has a large (>50MiB) application on the
desktop takes half a minute and a good five to ten seconds for a
desktop of just xterms.

also, while this wait occurs, applications requireing a steady supply
of data (multimedia applications like xmms et al) will stutter, freeze
and skip.  in addition, copying files between machines over a 100MB
network will produce the same effects for multimedia applications.

this happens on hardware from PIII 800Mhz with 256MiB RAM up to FX-51s
 , Pentium Extremes and everything inbetween.  our users complain
daily about their machines being to slow to be usable, and that they
can't listen to music in any way while working, due to skipping every
time they switch desktops.

for my personal desktop (AMD Athlon 64 3400+, 1GiB RAM), I have turned
off swap, which helps quite a bit, but even then I can easily produce
1-3 second long skips in xmms by switching between a couple of
desktops quickly.

these problems are the same with the RHEL-kernels we have tested
(2.4.21-9.0.[13]) and we have tried rebuilding them with different
options with regards to CPU et al to see if it helps, which it hasn't.  

these problem do _not_ occur with any vanilla 2.4-kernel we have
tested, but since ypbind segfaults with a vanilla kernel (bug 122528,
closed, RedHat doesn't support self-built kernels) we don't have that
option.  any hints or tips on how to get our machines back into a
usable state is greatly appreciated.

Comment 19 Baard Kristiansen 2004-06-08 09:30:50 UTC

We have now tried upgrading to 2.4.21-15.EL without any success.  The
machines are still dead slow.  I recommend re-opening this bug.

My Pentium 4 2.66 GHz with 1 GiB RAM is really sloppy.  Sound/video
stutters and freezes with the smallest amount of IO happening.  After
being idle for some time, the machine uses a 30-60 seconds to "wake
up".  Everything is swapped out and it takes too long for it to get
"normal" (well, as normal as it gets...).

As mentioned before, it gets a little bit better when turning off swap
and when we renice the processes that causes slow behaviour.

FYI: 
$ cat /proc/sys/vm/inactive_clean_percent
30

Comment 20 Sudhir Srinivasan 2004-08-10 03:13:15 UTC

We have been tracking this issue and have a load test that really 
beats on the system from an I/O perspective and with this test, we 
have been able to successfully hang the system (memory starvation) 
with any setting of /proc/sys/vm/inactive_clean_percent, all the
way up to 100. The same test has been running for a week on a 
stock 2.4.25 kernel.

Is this bug going to be reopned as recommended in Comment #19?

Comment 21 Terje Kvernes 2004-08-25 13:49:49 UTC

one of our termnialservers hangs on occation as well, I don't know if
it's related, but it will get 2.6 this weekend.  we're in the process
of  moving machines over to 2.6 after a couple of months of testing. 
the EL-kernels just aren't usable for us, be it for desktop or
servers, no matter how much I adjust the VM from reading the pdf given
above.

Comment 22 Larry Woodman 2004-08-25 14:21:49 UTC

Terje, is there any chance you could collect some stats for me running
the latest RHEL3 kernel before moving on to the 2.6 kernel?  If
possible, please get me several "AltSysrq M" outputs when the terminal
servers hang up on you.  The latest RHEL3 kernel can be found here:

http://people.redhat.com/~lwoodman/.RHEL3/


Thanks, Larry