Bug 114553
| Summary: | Bad performance with Q1 update kernel (-9EL) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 3 | Reporter: | Stephen Drye <sdrye> |
| Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
| Status: | CLOSED ERRATA | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.0 | CC: | bark, dstewart, jbs, jr-redhatbugs2, mcrawford, nmurray, petrides, riel, terjekv, wms |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | i686 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2004-05-12 01:08:24 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Stephen, can you please try one quick thing to see if it helps relieve the performance problems yuo are seeing? Try "echo 100 > /proc/sys/vm/inactive_clean_percent". No reboot is necessecary, this might hopefully eliminate the sluggishness you are seeing on your workstation running this kernel. Thanks, Larry Woodman Stephen, can you please try one quick thing to see if it helps relieve the performance problems yuo are seeing? Try "echo 100 > /proc/sys/vm/inactive_clean_percent". No reboot is necessecary, this might hopefully eliminate the sluggishness you are seeing on your workstation running this kernel. Thanks, Larry Woodman I too experience extreme lag and slowness with the update kernel, particularly when switching virtual desktops. Windows (particularly web browsers) will take a very long time to redraw themselves. Also, I have installed Xine for viewing DVDs. When I used to watch DVDs under RedHat 9, there was never a trace of slowdown or flickering, no matter what I was doing. Now, when I load/reload a webpage or do anything involving moving large windows around, I get extreme stuttering and frame loss in Xine. I have a dual AthlonMP 1900+ workstation with SCSI disks and 1GB of RAM. Performance of ANY previous RH distro has never been an issue. I've also encountered poor VM-related performance. Under 2.4.21-9.EL, htdig would take anywhere from 2 to 8 hours to rebuild its database (it varied wildly despite the system load being constant), compared to a consistent 45 minutes with vanilla 2.4.24. Good news is that "echo 100 > /proc/sys/vm/inactive_clean_percent" definitely helped my particular case. Database rebuilds on 2.4.21- 9.EL with inactive_clean_percent=100 take roughly 60 minutes. Not quite as good as vanilla 2.4.24 but more within the realm of acceptibility. Are there any bad side effects to changing inactive_clean_percent? I take it setting it to 100 must disable something? Just a followup to my previous post: I got around to testing 2.4.21-4.0.2.EL, and the results are very similar to 2.4.21-9.EL with inactive_clean_percent=100. I can't believe that this bug was first reported almost 2 months ago, and still there is no patch. It hit me on my new RHEL3 system. If you look at the forums on my ISP (servermatrix), you'll see that people are really disgusted with RHEL - and from the reports, it sounds like most of the people are upset because of this very bug!!! The ISP itself has even adopted a policy of replacing the RHEL kernel with the Fedora kernel whenever people report that their system becomes to slow that it is unusable. So this tweak makes Oracle run a few percent faster. On a lot of systems, it makes them freeze up regularly and become totally unstable. If seems like the only reasonable thing is to release a kernel with the default value set to 100 ASAP and put out a memo saying "if you run Oracle, try setting this to 5 or 30, but be aware it has been seen to cause reliablity issues." Seriously, do you care about realiability at all? Isn't that the main point of RHEL over Fedora and the old 7.3/8/9 releases? That RHEL is supported to be *reliable*? Sorry if I sound mad, but I wasted about 2 days fighting this bug (and my customers saw at least 24 hours of an essentially unusable system from me). A lot of other people are wasting their time too. And meanwhile Red Hat is sitting there wondering if 5 or 30 is best - just shut the thing off and release a patch ASAP, then leave it off until you know exactly how to make sure it won't go berzerk! The default value for "inactive_clean_percent" has been changed to 30 in RHEL 3 Update 2, which begins it's external beta period soon. No patch is necessary, since a system administrator can alter the value with a simple user-level command. The value is strictly a VM performance tuning parameter - some workloads do better with a lower value and others do better with a higher value. If you find that the Update 1 kernel (2.4.21-9.EL) performs well with inactive_clean_percent manually assigned to 30, then you'll find that the Update 2 kernel performs similarly "out of the box". Please let us know whether 30 works okay for you. Thanks. H'm, you say "The value is strictly a VM performance tuning parameter," which indicates to me you don't know how severe this bug is when it hits. If "ls" on a barely-loaded system takes 15 seconds, and everything else takes minutes, would you call that "performance tuning" or "broken"? Also, nobody I've exchanged messages with knows about this parameter. The problem is that the default setting leaves many systems broken, and there is no way to get the word out to all the system administrators that it can be fixed, so most system administrators who experience this problem are pulling their hair out wondering what is wrong. I haven't tried a value of 30 (don't have a spare system, and I can't afford to risk it on my production system), but I saw a message from somebody who said that with a value of 30 it took longer for the problem to appear, but it showed up all the same. I'll ask them to add a comment to this bug directly, so that they can give a firsthand account. Again, when this hits, it isn't a performance tuning issue, it makes the system *BROKEN*, with no hint on how to fix it unless the sysadmin realizes he must search red hat's bugzilla database. I've run my server with it set to 30 for several days and haven't (yet) experienced any of the extreme slowness/unresponsiveness that I saw with it set to 5. However, my htdig test (see previous post above) runs roughly 10-15% slower (a difference of about 10 minutes) with it set to 30 as opposed to 100. Jordan, performance differences like that are to be expected. Some workloads run slower with the inactive_clean_percent higher, others with it lower. We suspect that 30% is a decent middle ground; some people will be able to improve things by tuning it, but it won't give disastrous performance for anyone (unlike 5% or 100%). Is there any documentation on tuning this? I could read the source code if I have to (I'm assuming that would help but even that isn't a given) but I'm sure a lot of system administrators can't. There is documentation in progress, almost ready for publication. I have reviewed the whitepaper in question a few times now and it looks fine to me. Norm Murray will probably know at what date his whitepaper will be published. Norm ? ;) The latest version of the whitepaper (this will also be published in the April edition of Wide Open) is available here: http://people.redhat.com/nhorman/papers/papers.html Stephen Drye (sdrye), could you please confirm whether setting "inactive_clean_percent" to 30 resolves the performance problem that you originally reported? Thanks. -ernie Yes, 30 performs better than 100 for my workload, oddly enough. 30 "feels" similar to the performance that 4.0.2 had. Thanks for the info, Stephen. At this point, this bug should be considered resolved, because we've change the default value to 30 in RHEL3 U2. So, I'm changing this bug's state to "modified". An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html even with /proc/sys/vm/inactive_clean_percent set to 30 we still have problems with our machines feeling extremly slow. the first couple of days it felt better, before the systems regressed to where they were before the tuning started. switching desktops if one has a large (>50MiB) application on the desktop takes half a minute and a good five to ten seconds for a desktop of just xterms. also, while this wait occurs, applications requireing a steady supply of data (multimedia applications like xmms et al) will stutter, freeze and skip. in addition, copying files between machines over a 100MB network will produce the same effects for multimedia applications. this happens on hardware from PIII 800Mhz with 256MiB RAM up to FX-51s , Pentium Extremes and everything inbetween. our users complain daily about their machines being to slow to be usable, and that they can't listen to music in any way while working, due to skipping every time they switch desktops. for my personal desktop (AMD Athlon 64 3400+, 1GiB RAM), I have turned off swap, which helps quite a bit, but even then I can easily produce 1-3 second long skips in xmms by switching between a couple of desktops quickly. these problems are the same with the RHEL-kernels we have tested (2.4.21-9.0.[13]) and we have tried rebuilding them with different options with regards to CPU et al to see if it helps, which it hasn't. these problem do _not_ occur with any vanilla 2.4-kernel we have tested, but since ypbind segfaults with a vanilla kernel (bug 122528, closed, RedHat doesn't support self-built kernels) we don't have that option. any hints or tips on how to get our machines back into a usable state is greatly appreciated. We have now tried upgrading to 2.4.21-15.EL without any success. The machines are still dead slow. I recommend re-opening this bug. My Pentium 4 2.66 GHz with 1 GiB RAM is really sloppy. Sound/video stutters and freezes with the smallest amount of IO happening. After being idle for some time, the machine uses a 30-60 seconds to "wake up". Everything is swapped out and it takes too long for it to get "normal" (well, as normal as it gets...). As mentioned before, it gets a little bit better when turning off swap and when we renice the processes that causes slow behaviour. FYI: $ cat /proc/sys/vm/inactive_clean_percent 30 We have been tracking this issue and have a load test that really beats on the system from an I/O perspective and with this test, we have been able to successfully hang the system (memory starvation) with any setting of /proc/sys/vm/inactive_clean_percent, all the way up to 100. The same test has been running for a week on a stock 2.4.25 kernel. Is this bug going to be reopned as recommended in Comment #19? one of our termnialservers hangs on occation as well, I don't know if it's related, but it will get 2.6 this weekend. we're in the process of moving machines over to 2.6 after a couple of months of testing. the EL-kernels just aren't usable for us, be it for desktop or servers, no matter how much I adjust the VM from reading the pdf given above. Terje, is there any chance you could collect some stats for me running the latest RHEL3 kernel before moving on to the 2.6 kernel? If possible, please get me several "AltSysrq M" outputs when the terminal servers hang up on you. The latest RHEL3 kernel can be found here: http://people.redhat.com/~lwoodman/.RHEL3/ Thanks, Larry |
From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007 Description of problem: Using this kernel in a workstation setup is unusably slow. Switching back to the 4.0.2.EL kernel corrects the problem, so it is version specific. Symptoms: - Incredibly slow loading of apps, and anything that involves memory management (switching from one app to another, attaching from one process to another for debugging). - 100% reproducible by running Mozilla, Eclipse with Sun JDK 1.4.2_02, and using it to debug our app server also running on JDK 1.4.2_03 (you can probably use RedHat's Java app server to get the same effect). Using "top" with each kernel (9 and 4.0.2) shows exactly the same memory usage for the apps and for the whole system. IMHO, it looks and acts like something is wrong with the memory page management system :) During this problem there's limited if any disk activity. P4, 1G memory, IDE, using the motherboard video (i845) (Dell Optiplex GX260) Version-Release number of selected component (if applicable): kernel-2.4.21-9.EL How reproducible: Always Steps to Reproduce: 1. Start Eclipse 2. Start Mozilla 3. Start the Java app server 4. Use Eclipse to develop the app for a while, and Mozilla to test it. Debug every once and a while. It's really un-missable. Actual Results: Stupid-slow performance. 1-2 minutes to change from one app to another, "hangs" that take upwards of 10 minutes to clear (particularly when attaching the Eclipse debugger to the app server). Occasional app crashes on permanent non-responsiveness. Expected Results: It should perform at least as well as kernel 2.4.21-4.0.2.EL does. Additional info: