From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040803 Firefox/0.9.3 Description of problem: We've seen a dramatic drop in throughput for our database server when going from RHAS 2.1 to RHAS 3.0 on the same hardware. After some investigation, we found that the OS was swapping heavily during the test run, even though there should be more than enough physical memory. It did not when the same servers were running RHAS 2.1. REPRODUCTION: We have done some experiments using a simple test program (see below). It allocates a large buffer, fills it, and then accesses it randomly at full speed. On a machine with 1Gb memory, this ought to work fine with a 750M buffer. But if we first copy some large files between local file systems in order to use up memory for disk cache, and THEN start the test program (after waiting a minute for flushing), we see problems. The program quickly steals from the cache until (according to vmstat) it's down to about 250M. After that the cache is freed up only very slowly, and the test program starts swapping heavily. It takes several minutes to free up 35M more. Also, if we do the file copying while the test is running, it actually steals memory for cache from the active process, which again kills performance. In our database testing, it looks like some amount of cache is permanently reserved, with the result that the server *never* gets all it needs and keeps swapping. We had to reduce the memory usage by at least 200M to avoid swapping. The problem is seen on RHAS 3.0 Update 2, as well as on Update 3 (beta) (kernels 2.4.21-15.ELsmp and 2.4.21-17.ELsmp resp.). In the latter case, on one occasion vmstat says the OS keeps swapping *out* but not in, on the order of several 100K per second, for 10 minutes after the cache has shrunk to its "final" size. The same tests run on RHAS 2.1 did not show this problem; the test program gets all the memory it needs and there is very little swapping. All tests were run on machines with 1G memory, and using 750M for the test program. C source code follows: --- /* Trivial memory user. Argument: number of Mb to allocate */ #include <sys/types.h> #include <unistd.h> #include <stdlib.h> int main (int argc, char **argv) { int max = atoi (argv[1]) * 1048576 / sizeof(int); int *num = malloc (max * sizeof(int)); int i; for (i = 0; i < max; i++) { num[i] = i; } srand ((int)getpid()); for (;;) { i = (int) rand() % max; num[i] = rand(); } return 0; } - Version-Release number of selected component (if applicable): kernel 2.4.21-17.ELsmp and 2.4.21-15.ELsmp How reproducible: Always Steps to Reproduce: See Description Actual Results: See Description Expected Results: See Description Additional info:
Yngve, can you get several "Alt Sysrq M" outouts when you see the system in thae state that you are describing? Larry Woodman
Created attachment 103678 [details] Alt-SysRq-M outputs This is output from Alt-SysRq-M taken repeatedly starting just before the test program was started, while the test program ran and until the system started to calm down, i.e. when the disk cache size was starting to destabilize. We observed frantic swapping, as reported by vmstat. The test program was run with argument "750".
Sorry, I meant "stabilize", not "destabilize" in the previous comment. Freudian slip, I guess :-)
I have been working on a patch that helps the system reclaim pagecache memory more effectively when the pagecache is over pagecache.maxpercent. What this patch does is reactivate anonymous inactive dirty pages of memory when the active pagecache pages exceed pagecache.maxpercnet. This will further prevent the system from swapping when the majority of memory is in the pagecache. ************************************************************************ @@ -292,7 +310,14 @@ int launder_page(zone_t * zone, int gfp_ BUG_ON(!PageInactiveDirty(page)); del_page_from_inactive_dirty_list(page); - add_page_to_inactive_laundry_list(page); + + /* if pagecache is over max dont reclaim anonymous pages */ + if (cache_ratio(zone) > cache_limits.max && page_anon(page) && free_min(zone) < 0) { + add_page_to_active_list(page, INITIAL_AGE); + return 0; + } else { + add_page_to_inactive_laundry_list(page); + } /* store the time we start IO */ page->age = (jiffies/HZ)&255; /* ******************************************************************** Please try out the appropriate kernel and let me know how it works ASAP: >>>http://people.redhat.com/~lwoodman/.RHEL3pagecachefix/ Thanks, Larry Woodman
We did some tests with this patched kernel and the trivial test program in the original report, and the behaviour has unfortunately not improved. It may actually be that it is slightly worse. BTW: Will this get copied to the issue-tracker case that was opened for this report?
Yngve, can you rerun the test after a reboot and "echo 1 10 15 > /proc/sys/vm/pagecache" Larry
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.11.EL).
Yngve, cajn you grab and test the lates kernel for me? I have made several changes to minimize swapping when file cacheing is involved. If you still have problems with this kernel please get several AltSysrq-M outputs when the system is swapping heavily. The latest i686 smp kernel is here: >>>http://people.redhat.com/~lwoodman/.for_sun/ Thanks for your help, Larry Woodman
Very interesting. Running this kernel, we are almost back to the performance of RHAS 2.1. These are TPC-B figures from our database server product running on the new kernel: Kernel version 1 client 4 clients 16 clients 2.4.9-e.34: 124 tps 231 tps 245 tps 2.4.21-20.EL: 65 tps 94 tps 102 tps 2.4.21-22.ELsmp: 107 tps 215 tps 224 tps This means we are still 10% away from the performance of 2.1, but this is beginning to look acceptable. Two questions: 1: There is a mystery here. We still see about the same amount of swapping, but it seems that the stuff we need isn't getting swapped out as often as before. Do you have any comments on that? 2: What is the status of these optimizations. Will they make it into an official release or are they more of the experimental sort?
We have conducted somewhat more thorough testing, and we see the same pattern as above. It might be possible that a part of the 10% degradation that is left compared with 2.1 stems from us running with an SMP kernel on a uniprocessor machine. So, in addition to the two questions above, I'd like to ask you for a non-SMP version of 2.4.21-22.EL
Yngve 1.) can you explain more about the "swapping mystery" you are seeing? I am not following what you are trying to say. 2.) This kernel is the actual RHEL3-U4 beta kernel, everything in this kernel will stay. 3.) >>>http://people.redhat.com/~lwoodman/.for_sun/ now contains a UP kernel.
1: We were still seeing fairly intensive swapping with this kernel when running the usemem program quoted in the original bug report, as compared to the behaviour on RHAS 2.1 U3. However, swapping "died down" far more quickly than it did on previous RHAS 3.0 kernels. More importantly, we are not seeing excessive swapping during actual test runs of our database server, so this is probably not a concern for us. 2: Excellent. What is the projected release date for U4? We are probably unable to have RHAS 3.0 as a supported platform until U4 arrives, so this date is of importance to us. 3: Thanks. We'll be some more testing runs to confirm the data we have, and we also need to go hunting for the remaining 10% performance drop versus 2.1 (perhaps it is because we build on 2.1 and run on 2.1 AND 3.0?), but I think we are fairly close to calling this issue "resolved".
*** Bug 137984 has been marked as a duplicate of this bug. ***
RHEL3 U4 is expected to be released in 2-3 weeks. In the meantime, the RHN beta channel contains the 2.4.21-25.EL kernel, which includes all known VM-related fixes in U4.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html
Swapping is still present to a much less degree. I ahve a 10 slot counter strike server and a 20 slot teamspeak server going on a p-4 1.4 ghz w. 384 megs ram. here is my free: total used free shared buffers cached Mem: 382472 374256 8216 0 47784 237396 -/+ buffers/cache: 89076 293396 Swap: 522072 680 521392 I ahve not tried the echo commands yet. Will try them and report back if it is fixed.
i rebooted and tried "echo 1 10 15 > /proc/sys/vm/pagecache" minor swapping is still occurring: total used free shared buffers cached Mem: 382472 376464 6008 0 49464 229372 -/+ buffers/cache: 97628 284844 Swap: 522072 732 521340