Bug 124576
Summary: | Swapping even when there is enough memory free causing performance problems. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Vivek Rajan <vrajan> | ||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3.0 | CC: | bfox, bwthomas, johnsond, keith_fish, kevins, mcole, milan.kerslager, nakul, petrides, riel, sct, tao, vanhoof | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2004-12-20 20:55:18 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 123574 | ||||||||||
Attachments: |
|
Description
Vivek Rajan
2004-05-27 17:09:08 UTC
Larry, would we happen to have a test kernel ready with the rmap fixes? Guess we should also test the latest patches I made... Vivek, Can you please get me a few AltSysrq M outputs when the system is swapping heavily? Thanks, Larry Woodman Btw, is the kernel running a 64 bit (x86-64) or 32 bit (x86) kernel ? The kernel is a 64-bit (x86-64) kernel. The application is also 64-bit. Here's the Sysrq output when it's swapping heavily: May 27 18:40:16 aproloaner3 kernel: SysRq : Show Memory May 27 18:40:16 aproloaner3 kernel: May 27 18:40:16 aproloaner3 kernel: Mem-info: May 27 18:40:16 aproloaner3 kernel: Zone:DMA freepages: 0 min: 0 low: 0 high: 0 May 27 18:40:16 aproloaner3 kernel: Zone:Normal freepages: 63928 min: 1278 low: 9213 high: 13308 May 27 18:40:16 aproloaner3 kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 May 27 18:40:16 aproloaner3 kernel: Zone:DMA freepages: 1274 min: 1056 low: 1088 high: 1120 May 27 18:40:16 aproloaner3 kernel: Zone:Normal freepages:221760 min: 1279 low: 17342 high: 25501 May 27 18:40:16 aproloaner3 kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 May 27 18:40:16 aproloaner3 kernel: Free pages: 286962 ( 0 HighMem) May 27 18:40:16 aproloaner3 kernel: ( Active: 805700/113892, inactive_laundry: 23959, inactive_clean: 10375, free: 286962 ) May 27 18:40:16 aproloaner3 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 May 27 18:40:16 aproloaner3 kernel: aa:366080 ac:4159 id:53588 il:12196 ic:3972 fr: 63928 May 27 18:40:17 aproloaner3 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 May 27 18:40:17 aproloaner3 kernel: aa:940 ac:128 id:55 il:25 ic:38 fr:1274 May 27 18:40:17 aproloaner3 kernel: aa:421557 ac:12836 id:60249 il:11738 ic:6365 fr: 221760 May 27 18:40:17 aproloaner3 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 May 27 18:40:17 aproloaner3 kernel: 3294*4kB 7099*8kB 6025*16kB 2314*32kB 161*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 255712kB) May 27 18:40:17 aproloaner3 kernel: Swap cache: add 1559554, delete 1463414, find 617271/959812, race 0+0 May 27 18:40:17 aproloaner3 kernel: 5063 pages of slabcache May 27 18:40:17 aproloaner3 kernel: 244 pages of kernel stacks May 27 18:40:17 aproloaner3 kernel: 1910 lowmem pagetables, 2160 highmem pagetables May 27 18:40:17 aproloaner3 kernel: Free swap: 115228kB May 27 18:40:17 aproloaner3 kernel: 1572862 pages of RAM May 27 18:40:17 aproloaner3 kernel: 295103 free pages May 27 18:40:17 aproloaner3 kernel: 304206 reserved pages May 27 18:40:17 aproloaner3 kernel: 627146 pages shared May 27 18:40:17 aproloaner3 kernel: 96140 pages swap cached May 27 18:40:17 aproloaner3 kernel: Buffer memory: 10380kB May 27 18:40:17 aproloaner3 kernel: Cache memory: 2725288kB May 27 18:40:17 aproloaner3 kernel: CLEAN: 150 buffers, 588 kbyte, 82 used (last=149), 0 locked, 0 dirty 0 delay May 27 18:40:17 aproloaner3 kernel: LOCKED: 32 buffers, 128 kbyte, 32 used (last=32), 0 locked, 0 dirty 0 delay May 27 18:40:17 aproloaner3 kernel: DIRTY: 82 buffers, 328 kbyte, 82 used (last=82), 0 locked, 64 dirty 0 delay Created attachment 100649 [details]
better page allocation balancing
It appears that with the standard kernel, the system fills up
the first zone too far and starts swapping pages out before it
starts allocating from the second zone.
This patch tries to improve the balacne by falling back to
the second zone sooner.
I applied the patch and recompiled the kernel. With the new kernel swapping doesn't occur until most of the memory is used up. So it's a lot better than the previous kernel. But we are still having swapping issues. Here's what we are seeing: After a clean reboot about 260MB of memory (on a 7GB machine) is used up. Loading up the application uses another 100MB. So there is about 6.6GB of free memory after that. Then we are trying to load a seismic volume of 4GB from disk. The application allocates a 4GB array (in shared memory) and starts loading the data from disk into the array. By the time the whole file is loaded, all of the memory is used up and some of the swap space is also used. That's something we aren't able to explain. Since there was 6.6GB of free mem an user would expect approximately 2.6GB to be available even after loading a 4GB volume. After loading the volume if we wait for couple of minutes the memory usage automatically comes down to an expected level. But pages which got moved to the swap still remains in the swap. It seems like the memory is getting used by some cache and the application is unable to use the memory. Is there an explanation for this behavior? Are there any parameters that we can change to control this behavior. Vivek: can you get us a few more AltSysrq M outputs when the system is in the state that you descrived above? I suspect what is happening is that you use ~4.5GB between the kernel, applications and 4GB array, leaving ~2.5GB free. Then you start reading a 4GB file into the array. Once its ~2/3(actually 2.5/4) done the system runs out of memory and reclaims some of the application ans array pages because they are the oldest/least recently used in the system. Can you try "echo 1 10 15 > /proc/sys/vm/pagecache" to force the system to reclaim pagecache memory before anonymous memory ans see if that eliminates the swapping? Larry "echo 1 10 15 > /proc/sys/vm/pagecache" didn't help very much. Here is the SysRq output while swapping with pagecache set to "1 10 15" Jun 7 19:23:16 aproloaner3 kernel: SysRq : Show Memory Jun 7 19:23:16 aproloaner3 kernel: Jun 7 19:23:16 aproloaner3 kernel: Mem-info: Jun 7 19:23:16 aproloaner3 kernel: Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Jun 7 19:23:16 aproloaner3 kernel: Zone:Normal freepages: 1826 min: 1278 low: 9213 high: 13308 Jun 7 19:23:16 aproloaner3 kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Jun 7 19:23:16 aproloaner3 kernel: Zone:DMA freepages: 1065 min: 1056 low: 1088 high: 1120 Jun 7 19:23:16 aproloaner3 kernel: Zone:Normal freepages: 1684 min: 1279 low: 17342 high: 25501 Jun 7 19:23:16 aproloaner3 kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Jun 7 19:23:16 aproloaner3 kernel: Free pages: 4575 ( 0 HighMem) Jun 7 19:23:16 aproloaner3 kernel: ( Active: 993213/188273, inactive_laundry: 28947, inactive_clean: 27690, free: 4575 ) Jun 7 19:23:16 aproloaner3 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 7 19:23:16 aproloaner3 kernel: aa:385856 ac:13414 id:75335 il:11368 ic:11316 fr: 1826 Jun 7 19:23:16 aproloaner3 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 7 19:23:16 aproloaner3 kernel: aa:1058 ac:41 id:240 il:33 ic:54 fr:1065 Jun 7 19:23:16 aproloaner3 kernel: aa:571636 ac:21208 id:112698 il:17546 ic:16320 fr: 1684 Jun 7 19:23:16 aproloaner3 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 7 19:23:16 aproloaner3 kernel: 0*4kB 1*8kB 0*16kB 0*32kB 14*64kB 12*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 7304kB) Jun 7 19:23:16 aproloaner3 kernel: Swap cache: add 147725, delete 58630, find 22609/ 28303, race 0+0 Jun 7 19:23:16 aproloaner3 kernel: 5012 pages of slabcache Jun 7 19:23:16 aproloaner3 kernel: 226 pages of kernel stacks Jun 7 19:23:17 aproloaner3 kernel: 2288 lowmem pagetables, 1691 highmem pagetables Jun 7 19:23:17 aproloaner3 kernel: Free swap: 1562992kB Jun 7 19:23:17 aproloaner3 kernel: 1572862 pages of RAM Jun 7 19:23:17 aproloaner3 kernel: 12074 free pages Jun 7 19:23:17 aproloaner3 kernel: 304199 reserved pages Jun 7 19:23:17 aproloaner3 kernel: 923263 pages shared Jun 7 19:23:17 aproloaner3 kernel: 89095 pages swap cached Jun 7 19:23:17 aproloaner3 kernel: Buffer memory: 35720kB Jun 7 19:23:17 aproloaner3 kernel: Cache memory: 3940964kB Jun 7 19:23:17 aproloaner3 kernel: CLEAN: 205 buffers, 790 kbyte, 43 used (last=203), 0 locked, 0 dirty 0 delay Jun 7 19:23:17 aproloaner3 kernel: DIRTY: 11 buffers, 44 kbyte, 11 used (last=11), 0 locked, 1ub-page count 00000001, of page 00000000c98ce000(100000000000008). A fix for this problem has just been committed to the RHEL3 U3 patch pool this evening (in kernel version 2.4.21-15.10.EL). Is there a kernel that Landmark could test, now that the fix has been put into place? Is there a possibility of this kernel releasing before U3? The kernel with this fix can be downloaded for test purposes only from here: http://people.redhat.com/~lwoodman/.for_ibm/ Larry Thanks Larry. Is the kernel patch included in here the patch from comment #7, or an improved patch after the feedback from comment #10? The current patch is pretty much what's in comment #7. I have a patch available that should fix the problem further, but it is important that the first patch (the one from comment #7) gets tested by itself first, in order to prevent any regressions from entering the tree. Once we have a few more test results on how that patch behaves, we can confidently add a new patch into the mix. Cool. We'll see what we can do to look at it. Happen to have a test kernel for AMD64, since that's the heart of the issue? Thanks Larry. Could you please add the kernel source rpm too. We need to install nvidia drivers for our application to run and that requires the kernel source. Vivek, all set. The source rpm is in: http://people.redhat.com/~lwoodman/.for_ibm/ Larry Larry, here is some feedback from our testing. Sorry it's taken a bit longer than I wanted. We installed the new kernel and it's very similar to the kernel we compiled with the patch provided in comment #13. The kernel solves of problem of swapping to a certain extent i.e. swapping doesn't start until all of the memory is used up. But the problem now is that when we load a volume say 2GB then 2GB of memory is used up by the volume and another 2GB is used up by disk cache! So if we try loading a volume that's more that half the available memory we see the swapping problem. So my question is, is there a way to control this disk cache from not taking so much of memory? I'm monitoring the disk cache using the Info Center tool (RedHat->System Tools-> Info Center->Memory). We installed the new kernel and it's very similar to the kernel we compiled with the patch provided in comment #13. The kernel solves of problem of swapping to a certain extent i.e. swapping doesn't start until all of the memory is used up. But the problem now is that when we load a volume say 2GB then 2GB of memory is used up by the volume and another 2GB is used up by disk cache! So if we try loading a volume that's more that half the available memory we see the swapping problem. So my question is, is there a way to control this disk cache from not taking so much of memory? I'm monitoring the disk cache using the Info Center tool (RedHat->System Tools-> Info Center->Memory). Mark, you can "echo 1 5 15" > /proc/sys/vm/pagecache", that should help. If it doesn't help enough, let me know. We tried setting the pagecache parameter to "1 5 15" and it doesn't seem to help very much. Do you have any other suggestion to control the disk cache? Created attachment 101307 [details]
evict pages in page cache faster
This patch may help the system do what you want. OTOH, it's still somewhat
experimental and might make the pagecache eviction too aggressive...
Sorry for taking long to update... I was tied up most of yesterday. I'm trying to put together a very standard reproducable test so that we can give you better feed back on this problem and also track our progress. I tested this on an IBM Apro configured with 8Gbytes of physical memory, using GeoProbe to load and display two volumes -- a 4Gb volume and a 2Gb volume -- for a total of 6Gb. I used the SystemMonitor to display the amount of allocated and cache memory. The pagecache parameter was set to "" in both cases. First I tested using the 2.4.21-15.10 kernel from Larry Woodman, without the patch from Rik van Riel applied. -> Before loading the volumes, the System Monitor shows: Memory Used: 526Mb Total: 6.8Gb Swap Used: 0Mb Total: 2.0Gb -> After loading the 2Gb volume the SystemMonitor shows: Memory Used: 5.2Gb Total: 6.8Gb Swap Used: 0Mb Total: 2.0Gb -> After loading the 4Gb volume (6Gb loaded)the SystemMonitor shows: Memory Used: 6.8Gb Total: 6.8Gb Swap Used: 1.5Gb Total: 2.0Gb At this point GP performane is unpredictable. Sometimes good, then sometimes GP stops for a LONG time and then suddenly interactivity returns. Next I tested using the 2.4.21-15.10 kernel from Larry Woodman, but this time WITH the patch from Rik van Riel applied. -> Before loading the volumes, the System Monitor shows: Memory Used: 633Mb Total: 6.8Gb Swap Used: 0Mb Total: 2.0Gb -> After loading the 2Gb volume the SystemMonitor shows: Memory Used: 5.3Gb Total: 6.8Gb Swap Used: 0Mb Total: 2.0Gb -> After loading the 4Gb volume (6Gb loaded)the SystemMonitor shows: Memory Used: 6.8Gb Total: 6.8Gb Swap Used: 1.7Gb Total: 2.0Gb Same problems with interactivity, as above. --> Deleted both volumes from memory --> Loaded 4Gb Volume Memory Used: 6.8Gb --> then drops to 5.1Gb Total: 6.8Gb Swap Used: 205Mb Total: 2.0Gb --> Performance OK --> Attempt to load 1.5Gb volume (for a total of 5.5Gb loaded) Memory Used: 6.7Gb --> then drops to 5.7Gb Total: 6.8Gb Swap Used: 2.0Gb --> then drops to 1.7Gb Total: 2.0Gb --> Performance flakey (as above)-- system is obviously swapping, but there is a big piece of physical memory unused. --> Detach 1.5Gb volume (TAKES FOREVER) and remove from memory (quick) Memory Used: 3.7Gb --> then drops to 5.7Gb Total: 6.8Gb Swap Used: 1.7Gb --> then drops to 1.7Gb Total: 2.0Gb --> remember the 4Gb volume is still loaded... so now I try to move the probe around to "touch" all the parts of the volume... This is REALLY slow and painful with the application appearing to hang for minutes at a time, but the system monitor records some changes..cpu usage is low <15%... nothing else going on... swap usage is decreasing at ~100Mb/minute! At one point GeoProbe doesn't refresh the screen for 3 minutes!!! most users would have power-cycled at that point...finally the memory numbers stabilize... Memory Used: 4.4Gb Total: 6.8Gb Swap Used: 987Mb Total: 2.0Gb Remember only the 4Gb volume is still in memory... but now our performance is really good again -- after ~30 minutes of PATIENT fiddling. --> Attempt to load 1Gb volume (for a total of 5Gb loaded) Memory Used: 6.2Gb Total: 6.8Gb Swap Used: 1.2Gb --> then drops to 1.0Gb Total: 2.0Gb So, what have we learned? (1) The new kernel + the patch gives us some improvement. (2) Exceeding 5.0Gb of loaded volumes on a 8.0Gb machine is probably a bad idea (This is better than 3.8Gb without the two fixes). (3) You should strive to load your data upfront and never need to swap. (4) Recovering from having Volume data placed in swap is PAINFUL. So... better, but not ideal... I'll let our usability/testing folks weigh in with their opinions. Mary, good to hear that the patch helps some. I could try something more radical, but then the risk of regressions is too big, so I'd prefer to do the improvements in smaller, lower risk steps. I'll take your data point to the other developers here to argue for the patch. Once this patch is well tested and accepted we can move on to the next step. Rik, any progress on this? How goes the conversations with the developers on including this into an update? Do you have a patch that we could test that would help you get data on some of the more radical changes? Bradley, the "evict page cache faster" patch didn't make it in time for RHEL3 Update 3. I want to convince the other developers that the patch is harmless, but there is a call for more data points from users... Rik, We are very interested in the "evict page cache faster" patch -- however I'm not sure it goes far enough (as discussed in my testing log above). Even with the patch -- recovery is slow and painful -- despite having >1Gb of physical memory free. What sort of data do we need to provide you? Would it be helpful for us to send the GeoProbe application and a large sample dataset that will enable you to replicate the test described above? If we do get to an acceptable fix -- what sort of user datapoints do you need? We have decided to incorporate a variation of Rik's patch in comment #23 that preserves existing VM page eviction behavior by default, but allows the system administrator to switch to the more aggressive page eviction strategy through a new system tuning parameter (sysctl). In order to override the default manually, one can do the following: echo 1 > /proc/sys/vm/skip_mapped_pages Alternatively, one can add the following line in /etc/sysctl.conf to adopt the new strategy automatically upon reboot: vm.skip_mapped_pages = 1 The patch that implements this has just been committed to the RHEL3 U3 patch pool this evening (in kernel version 2.4.21-18.EL). Do we have an update yet from Landmark on their testing of this patch? Rik, do you have any further patches that you are working on that we would want to target for Update 4? Bradley, I've got no further patches queued at this moment, mostly for the reason that I'd like to know for sure if the current ones are the right direction for everyone before continuing further in this same direction. Details, details :). Thanks Rik. Hopefully we will have an update as to how the patches are working soon. Rik, Thanks for the work to get these patches into the release... I
loaded the BETA (and put vm.skip_mapped_pages = 1)
in /etc/sysctl.conf.
Unfortunately -- while this is an improvement over the behavior
without ANY patches... it doesn't go far enough -- and the BIG ISSUE
is that if we ever exceed physical memory and start swapping,
recovery is slow takes a LONG time... here are my notes from testing
using our GeoProbe 3.1.1 application and a demo dataset. If you're
curious to replicate this in your shop, I would be delighted to
provide you with the application, demo license and dataset.
On Aproloaner3 (IBM APRO - SIT preproduction hardware) which is
configured with 12Gb of RAM
o The memory hole is 1.4Gb
o After loading 4GbAmp.vol, I have
Memory Used: 8.7Gb Total: 10.6
Swap Used: 0 bytes Total: 1.9Gb
o I then load 4GbFreq.vol, and I seeâ¦
Memory Used: 10.6Gb Total: 10.6
Swap Used: 1.5 Gb Total: 1.9Gb
ï no problems, until we try to actually visualize both volumes by
creating a second probe⦠then it slows down considerably. System
takes >5minutes to respond after selecting the 4GbAmp.vol for the 2nd
probe.
o Delete 4GbFreq.vol, (Detach, then Attach/Remove)
Memory Used: 6.4 Gb Total: 10.6
Swap Used: 1.2 Gb Total: 1.9Gb
ï performance poor when I try to access the 2nd probe (presumabily it
is using memory that is still âswapped outâ⦠System takes over
>5minutes to come back after trying to select the 2nd probe.
ï However, if we wait long enough⦠(30 minutes) we can move the 2nd
probe around and bring that memory back from swap â but most users
would have rebooted their machine after the 1st 5minute lapse.
o After recoveryâ¦
Memory Used: 7.5Gb Total: 10.6
Swap Used: 100 Mb Total: 1.9Gb
Is each probe in its own process, or are they all loaded into the same process ? The reason I'd like to know this is deciding a direction in which to go with further improvements... The volumes are stored in shared memory and accessed by multiple lightweight processes (pthreads). In the case tested (single graphics window), the probes are all in the same process, but there are multiple threads performing data loading, computation, etc. Created attachment 102574 [details]
This testcase eats into swap when it should not need to.
The attachment is a "shar -V". sh SB.sh will unpack it.
more SWAPBUG/runit.sh to see the description.
The problem here is the DMA zone for the second pgdat is exhausted down below min and all of the pages are obviously wired by the kernel because there is practically in the active or inactive page lists. Something is leaking wired DMA memory! aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:385856 ac:13414 id:75335 il:11368 ic:11316 fr:1826 aa:0 ac:0 id:0 il:0 ic:0 fr:0 >>>aa:1058 ac:41 id:240 il:33 ic:54 fr:1065 aa:571636 ac:21208 id:112698 il:17546 ic:16320 fr:1684 aa:0 ac:0 id:0 il:0 ic:0 fr:0 I cant seem to reproduce this problem here at Red Hat. Please install the latest RHEL3-U3 kernel from here: http://people.redhat.com/~lwoodman/.RHEL3/ 1.) install the kernel and reboot your system. 2.) get an "AltSysrq M" right after boot. 3.) cat /proc/meminfo 4.) run the test that causes the problem. 5.) get another "AltSysrq M". 6.) cat /proc/meminfo. Attach all outputs to this bug. Thanks, Larry Woodman I see the problem here, while sizing memory and initializing the paging thresholds the DMA zone's min, low and high water marks are absurdly high. This is due to: ****************************************************************** static int zone_extrafree_max[MAX_NR_ZONES] __initdata = { 1024 , 1024, 0, }; ****************************************************************** This basically sets the DMA zone's low target to > 25% of the zone size which results in premature paging! ************************************************************** Zone:DMA freepages: 1065 min: 1056 low: 1088 high: 1120 ************************************************************** I'll work up a patch to fix this for all conditions and architectures. Larry Thanks Larry. Could you please add the kernel source rpm too. We need to install nvidia drivers for our application to run and that requires the kernel source. Mary, can you run a quick test for me? Please add "numa=off" to the commandline, reboot and "echo 1 10 15 > /proc/sys/vm/pagecache" then rerun the program that you are experiencing trouble with and let me know how it goes. Please use the same kernel you have installed for this test. Thanks, Larry Woodman Larry, I would have to say (guardedly - I'd like to do more tests) that this is an improvement. To confirm, I am using the RHWS 3.0 Update 3 Beta (2.4.21-17ELsmp) with "vm.skip_mapped_pages = 1" in /etc/sysctl.conf I added "numa=off" to the boot command, and rebooted; then did "echo 1 10 15 > /proc/sys/vm/pagecache" Here are the results when I follow the test procedure (outlined above): On Aproloaner3 (IBM APRO - SIT preproduction hardware) which is configured with 12Gb of RAM o The memory hole is 1.4Gb o Before loading 4GbAmp.vol, I have Memory Used: 316Mb Total: 10.6 Swap Used: 0 bytes Total: 1.9Gb o After loading 4GbAmp.vol, I have Memory Used: 9.8Gb Total: 10.6 (!WORSE than 8.7 in previous test!) Swap Used: 0 bytes Total: 1.9Gb o I then load 4GbFreq.vol, and I see⦠Memory Used: 10.6Gb Total: 10.6 Swap Used: 800MB Total: 1.9Gb (!BETTER than 1.5Gb in previous!) WHOO HOOO... I can use two probes to visualize the two volumes without a PRONOUNCED SLOWDOWN!!!! (moving the probes seems, qualitatively a bit slower than when we only have one dataset loaded, but the system doesn't take 5minute breaks.) THIS IS A DEFINITE IMPROVEMENT -- what are we losing here. So... Shall we go for 10GB?.... o I then load 2GbAmp.vol, and I see⦠Memory Used: 10.6Gb Total: 10.6 Swap Used: 1.9Gb Total: 1.9Gb (!BETTER than 1.5Gb in previous!) BAD IDEA... can you say "swapping fool"... I'll just send this rather than waiting for responsiveness to return (10 minutes and counting) Thanks for the effort here! To follow-up... I left it overnight, and this morning was able to remove the 2Gb volume from memory, leaving the two 4Gb volumes loaded. Numbers are: Memory Used: 8.7Gb Total: 10.6 Swap Used: 1.4Gb Total: 1.9Gb As I move the probe around the 4GbAmp.vol volume we have the same issues we had previously with EXTREME slowness in paging data back in to memory from swap. The program ends up halting for 1-2 minutes each time we move the probe to a part of the volume that must be paged back in. I am assuming that the kernel is not giving high enough priority to keeping process data pages resident. Is there /proc/ tunable, or a kernel source tweak, that will cause the kernel to give much higher priority to keeping process data pages (heap, shm) resident, versus file system buffer pages? I tried an experiment with (guessed values) DMA zone extrafree max @255, and extrafree ratio @4097. I can load 6G volume on 8G RAM system (EM64T). I can leave it all night on relatively quiet system, and it is immediately usable the next morning (almost no paging in accessing the entire 6G shared-memory volume). If I "cksum 3GbyteFile" and then try to access the entire 6G volume I get expected paging as I 'slice' through the volume initially. However, what I don't expect is that it takes 3 round-trips through all the volume data (accessing every page ~6 times), before all the pages stay resident. Simplistically, I'd expect an LRU for keeping pages resident -- it seems the kernel is doing something different since pages recently accessed appear to have been swapped out in favor of something that has not been accessed nearly as recently. If I set vm.pagecache=1 10 15 then the initial load and access of the 6G volume is not quite as smooth, but it is still acceptable; however, accessing the 6G volume after a 3G file is cksum'd behaves quite well. Keith, please try setting the pagecache to 1 15 20. This will set the limit which anonymous pages are swapped out to a higher value (20% vs 15%). Larry vm.pagecache=1 10 15 performed as good or better than vm.pagecache=1 15 20 for the initial load+slicing of the 6G volume Both vm.pagecache settings performed similarly (both were good) for slicing after the "cksum 3GbyteFile" operation.
I have been working on a patch that helps the system reclaim pagecache
memory more effectively when the pagecache is over pagecache.maxpercent.
What this patch does is reactivate anonymous inactive dirty pages of
memory when the active pagecache pages exceed pagecache.maxpercnet.
This will further prevent the system from swapping when the majority
of memory is in the pagecache.
************************************************************************
@@ -292,7 +310,14 @@ int launder_page(zone_t * zone, int gfp_
BUG_ON(!PageInactiveDirty(page));
del_page_from_inactive_dirty_list(page);
- add_page_to_inactive_laundry_list(page);
+
+ /* if pagecache is over max dont reclaim anonymous pages */
+ if (cache_ratio(zone) > cache_limits.max && page_anon(page) &&
free_min(zone) < 0) {
+ add_page_to_active_list(page, INITIAL_AGE);
+ return 0;
+ } else {
+ add_page_to_inactive_laundry_list(page);
+ }
/* store the time we start IO */
page->age = (jiffies/HZ)&255;
/*
********************************************************************
Please try out the appropriate kernel and let me know how it works ASAP:
>>>http://people.redhat.com/~lwoodman/.RHEL3pagecachefix/
Thanks, Larry Woodman
Larry, Thanks. This patch looks good to me -- it helps a lot with default vm.pagecache settings (ia32e kernel, dual em64t + 8G ram system). I get ~same performance with loading and then initial slicing through 6G volume with default vm.pagecache="1 15 100", as "1 5 10" setting. So far, it seems stable slicing 2Kx2K in Z, 2Kx1K in X & Y, while also moderately feeding highend graphics. 'top' shows no kscand/kswapd activity, and "swap used" is about 1/3 of what it was previously. I still have my version of an extra free zone tweak applied, as well as vm.skip_mapped_pages=1. Is there any reason to try this patch without the extra free zone tweak? A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.11.EL). An additional fix specific to x86_64 has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-21.EL). I see a lot of comments about 2.4.21-20+ (2.4.21-21) as testing version for U4. Is there a chance to get&test it? The latest I found is 21-20.8... (even I tryed RHN). An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html |