Bug 124576

Summary:

Swapping even when there is enough memory free causing performance problems.

Product:

Red Hat Enterprise Linux 3

Reporter:

Vivek Rajan <vrajan>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED ERRATA

QA Contact:

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.0

CC:

bfox, bwthomas, johnsond, keith_fish, kevins, mcole, milan.kerslager, nakul, petrides, riel, sct, tao, vanhoof

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2004-12-20 20:55:18 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

123574

Attachments:

Description	Flags
better page allocation balancing	none
evict pages in page cache faster	none
This testcase eats into swap when it should not need to.	none

Description Vivek Rajan 2004-05-27 17:09:08 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/124 (KHTML, like Gecko) Safari/125.1

Description of problem:
System specification:

Dual processor Opteron box
8GB RAM
nVidia Quadro FX 1000 (driver 1.0-6096)

When our application (GeoProbe) utilizes more that 5 GB of memory, the system starts swapping even though there is about 2GB of memory available. (free, top and the gnome system monitor reports 2GB of free mem). This swapping causes the performance of the application to degrade drastically.

On suggestion from a RedHat Engineer, we compiled a custom kernel turing the CONFIG_HIGHMEM option on in the kernel (otherwise same as the smp kernel). And on this kernel the swapping doesn't really occur until we use up all of the available memory.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-15.EL

How reproducible:
Always

Steps to Reproduce:
Load an application that allocates 5GB or more memory on an 8GB machine.
    

Actual Results:  Swapping occurs even though there is enough memory available.


Expected Results:  No swapping should occur when there is enough memory remaining.

Additional info:

Comment 2 Rik van Riel 2004-05-27 19:19:28 UTC

Larry, would we happen to have a test kernel ready with the rmap fixes?

Guess we should also test the latest patches I made...

Comment 3 Larry Woodman 2004-05-27 19:25:20 UTC

Vivek, Can you please get me a few AltSysrq M outputs when the 
system is swapping heavily?

Thanks, Larry Woodman

Comment 4 Rik van Riel 2004-05-27 19:47:24 UTC

Btw, is the kernel running a 64 bit (x86-64) or 32 bit (x86) kernel ?

Comment 5 Vivek Rajan 2004-05-27 23:43:19 UTC

The kernel is a 64-bit (x86-64) kernel. The application is also 64-bit.

Comment 6 Vivek Rajan 2004-05-27 23:45:55 UTC

Here's the Sysrq output when it's swapping heavily:

May 27 18:40:16 aproloaner3 kernel: SysRq : Show Memory
May 27 18:40:16 aproloaner3 kernel: 
May 27 18:40:16 aproloaner3 kernel: Mem-info:
May 27 18:40:16 aproloaner3 kernel: Zone:DMA freepages:     0 min:     0 low:     0 high:     
0
May 27 18:40:16 aproloaner3 kernel: Zone:Normal freepages: 63928 min:  1278 low:  
9213 high: 13308
May 27 18:40:16 aproloaner3 kernel: Zone:HighMem freepages:     0 min:     0 low:     0 
high:     0
May 27 18:40:16 aproloaner3 kernel: Zone:DMA freepages:  1274 min:  1056 low:  1088 
high:  1120
May 27 18:40:16 aproloaner3 kernel: Zone:Normal freepages:221760 min:  1279 low: 
17342 high: 25501
May 27 18:40:16 aproloaner3 kernel: Zone:HighMem freepages:     0 min:     0 low:     0 
high:     0
May 27 18:40:16 aproloaner3 kernel: Free pages:      286962 (     0 HighMem)
May 27 18:40:16 aproloaner3 kernel: ( Active: 805700/113892, inactive_laundry: 23959, 
inactive_clean: 10375, free: 286962 )
May 27 18:40:16 aproloaner3 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:0
May 27 18:40:16 aproloaner3 kernel:   aa:366080 ac:4159 id:53588 il:12196 ic:3972 fr:
63928
May 27 18:40:17 aproloaner3 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:0
May 27 18:40:17 aproloaner3 kernel:   aa:940 ac:128 id:55 il:25 ic:38 fr:1274
May 27 18:40:17 aproloaner3 kernel:   aa:421557 ac:12836 id:60249 il:11738 ic:6365 fr:
221760
May 27 18:40:17 aproloaner3 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:0
May 27 18:40:17 aproloaner3 kernel: 3294*4kB 7099*8kB 6025*16kB 2314*32kB 
161*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 255712kB)
May 27 18:40:17 aproloaner3 kernel: Swap cache: add 1559554, delete 1463414, find 
617271/959812, race 0+0
May 27 18:40:17 aproloaner3 kernel: 5063 pages of slabcache
May 27 18:40:17 aproloaner3 kernel: 244 pages of kernel stacks
May 27 18:40:17 aproloaner3 kernel: 1910 lowmem pagetables, 2160 highmem 
pagetables
May 27 18:40:17 aproloaner3 kernel: Free swap:       115228kB
May 27 18:40:17 aproloaner3 kernel: 1572862 pages of RAM
May 27 18:40:17 aproloaner3 kernel: 295103 free pages
May 27 18:40:17 aproloaner3 kernel: 304206 reserved pages
May 27 18:40:17 aproloaner3 kernel: 627146 pages shared
May 27 18:40:17 aproloaner3 kernel: 96140 pages swap cached
May 27 18:40:17 aproloaner3 kernel: Buffer memory:    10380kB
May 27 18:40:17 aproloaner3 kernel: Cache memory:   2725288kB
May 27 18:40:17 aproloaner3 kernel:   CLEAN: 150 buffers, 588 kbyte, 82 used 
(last=149), 0 locked, 0 dirty 0 delay
May 27 18:40:17 aproloaner3 kernel:  LOCKED: 32 buffers, 128 kbyte, 32 used (last=32), 
0 locked, 0 dirty 0 delay
May 27 18:40:17 aproloaner3 kernel:   DIRTY: 82 buffers, 328 kbyte, 82 used (last=82), 0 
locked, 64 dirty 0 delay

Comment 7 Rik van Riel 2004-05-28 01:31:30 UTC

Created attachment 100649 [details]
better page allocation balancing

It appears that with the standard kernel, the system fills up
the first zone too far and starts swapping pages out before it
starts allocating from the second zone.

This patch tries to improve the balacne by falling back to
the second zone sooner.

Comment 8 Vivek Rajan 2004-06-01 22:14:56 UTC

I applied the patch and recompiled the kernel. 

With the new kernel swapping doesn't occur until most of the memory is used up. So it's a 
lot better than the previous kernel. But we are still having swapping issues. Here's what we 
are seeing:

After a clean reboot about 260MB of memory (on a 7GB machine) is used up. Loading up 
the application uses another 100MB. So there is about 6.6GB of free memory after that. 
Then we are trying to load a seismic volume of 4GB from disk. The application allocates a 
4GB array (in shared memory) and starts loading the data from disk into the array. By the 
time the whole file is loaded, all of the memory is used up and some of the swap space is 
also used. That's something we aren't able to explain. Since there was 6.6GB of free mem 
an user would expect approximately 2.6GB to be available even after loading a 4GB 
volume.

After loading the volume if we wait for couple of minutes the memory usage automatically 
comes down to an expected level. But pages which got moved to the swap still remains in 
the swap. It seems like the memory is getting used by some cache and the application is 
unable to use the memory.

Is there an explanation for this behavior? Are there any parameters that we can change to 
control this behavior.

Comment 9 Larry Woodman 2004-06-03 14:22:28 UTC

Vivek: can you get us a few more AltSysrq M outputs when the system is
in the state that you descrived above?  

I suspect what is happening is that you use ~4.5GB between the kernel,
applications and 4GB array, leaving ~2.5GB free.  Then you start
reading a 4GB file into the array.  Once its ~2/3(actually 2.5/4) done
the system runs out of memory and reclaims some of the application ans
array pages because they are the oldest/least recently used in the
system.  Can you try "echo 1 10 15 > /proc/sys/vm/pagecache" to force
the system to reclaim pagecache memory before anonymous memory ans see
if that eliminates the swapping?

Larry

Comment 10 Vivek Rajan 2004-06-08 00:31:47 UTC

"echo 1 10 15 > /proc/sys/vm/pagecache" didn't help very much.

Here is the SysRq output while swapping with pagecache set to "1 10 15"

Jun  7 19:23:16 aproloaner3 kernel: SysRq : Show Memory
Jun  7 19:23:16 aproloaner3 kernel:
Jun  7 19:23:16 aproloaner3 kernel: Mem-info:
Jun  7 19:23:16 aproloaner3 kernel: Zone:DMA freepages:     0 min:     0 low: 0 high:     0
Jun  7 19:23:16 aproloaner3 kernel: Zone:Normal freepages:  1826 min:  1278 low:  9213 
high: 13308
Jun  7 19:23:16 aproloaner3 kernel: Zone:HighMem freepages:     0 min:     0 low:     0 
high:     0
Jun  7 19:23:16 aproloaner3 kernel: Zone:DMA freepages:  1065 min:  1056 low: 1088 
high:  1120
Jun  7 19:23:16 aproloaner3 kernel: Zone:Normal freepages:  1684 min:  1279 low: 17342 
high: 25501
Jun  7 19:23:16 aproloaner3 kernel: Zone:HighMem freepages:     0 min:     0 low:     0 
high:     0
Jun  7 19:23:16 aproloaner3 kernel: Free pages:        4575 (     0 HighMem)
Jun  7 19:23:16 aproloaner3 kernel: ( Active: 993213/188273, inactive_laundry: 28947, 
inactive_clean: 27690, free: 4575 )
Jun  7 19:23:16 aproloaner3 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:0
Jun  7 19:23:16 aproloaner3 kernel:   aa:385856 ac:13414 id:75335 il:11368 ic:11316 fr:
1826
Jun  7 19:23:16 aproloaner3 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:0
Jun  7 19:23:16 aproloaner3 kernel:   aa:1058 ac:41 id:240 il:33 ic:54 fr:1065
Jun  7 19:23:16 aproloaner3 kernel:   aa:571636 ac:21208 id:112698 il:17546 ic:16320 fr:
1684
Jun  7 19:23:16 aproloaner3 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:0
Jun  7 19:23:16 aproloaner3 kernel: 0*4kB 1*8kB 0*16kB 0*32kB 14*64kB 12*128kB 
1*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 7304kB)
Jun  7 19:23:16 aproloaner3 kernel: Swap cache: add 147725, delete 58630, find 22609/
28303, race 0+0
Jun  7 19:23:16 aproloaner3 kernel: 5012 pages of slabcache
Jun  7 19:23:16 aproloaner3 kernel: 226 pages of kernel stacks
Jun  7 19:23:17 aproloaner3 kernel: 2288 lowmem pagetables, 1691 highmem pagetables
Jun  7 19:23:17 aproloaner3 kernel: Free swap:       1562992kB
Jun  7 19:23:17 aproloaner3 kernel: 1572862 pages of RAM
Jun  7 19:23:17 aproloaner3 kernel: 12074 free pages
Jun  7 19:23:17 aproloaner3 kernel: 304199 reserved pages
Jun  7 19:23:17 aproloaner3 kernel: 923263 pages shared
Jun  7 19:23:17 aproloaner3 kernel: 89095 pages swap cached
Jun  7 19:23:17 aproloaner3 kernel: Buffer memory:    35720kB
Jun  7 19:23:17 aproloaner3 kernel: Cache memory:   3940964kB
Jun  7 19:23:17 aproloaner3 kernel:   CLEAN: 205 buffers, 790 kbyte, 43 used (last=203), 
0 locked, 0 dirty 0 delay
Jun  7 19:23:17 aproloaner3 kernel:   DIRTY: 11 buffers, 44 kbyte, 11 used (last=11), 0 
locked, 1ub-page count 00000001, of page 00000000c98ce000(100000000000008).

Comment 11 Ernie Petrides 2004-06-12 08:00:12 UTC

A fix for this problem has just been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-15.10.EL).

Comment 12 Bradley Thomas 2004-06-14 14:27:56 UTC

Is there a kernel that Landmark could test, now that the fix has been 
put into place?

Is there a possibility of this kernel releasing before U3?

Comment 13 Larry Woodman 2004-06-14 14:44:33 UTC

The kernel with this fix can be downloaded for test purposes only from
here:

http://people.redhat.com/~lwoodman/.for_ibm/


Larry

Comment 14 Bradley Thomas 2004-06-14 14:50:11 UTC

Thanks Larry.  Is the kernel patch included in here the patch from 
comment #7, or an improved patch after the feedback from comment #10?

Comment 15 Rik van Riel 2004-06-14 14:59:48 UTC

The current patch is pretty much what's in comment #7.

I have a patch available that should fix the problem further, but it
is important that the first patch (the one from comment #7) gets
tested by itself first, in order to prevent any regressions from
entering the tree.

Once we have a few more test results on how that patch behaves, we can
confidently add a new patch into the mix.

Comment 16 Bradley Thomas 2004-06-14 15:43:19 UTC

Cool.  We'll see what we can do to look at it.  Happen to have a test 
kernel for AMD64, since that's the heart of the issue?

Comment 17 Vivek Rajan 2004-06-15 22:43:10 UTC

Thanks Larry. Could you please add the kernel source rpm too. We need to install nvidia 
drivers for our application to run and that requires the kernel source.

Comment 18 Larry Woodman 2004-06-16 13:44:04 UTC

Vivek, all set.  The source rpm is in:

http://people.redhat.com/~lwoodman/.for_ibm/


Larry

Comment 19 Bradley Thomas 2004-06-18 18:01:58 UTC

Larry, here is some feedback from our testing.  Sorry it's taken a 
bit longer than I wanted.

We installed the new kernel and it's very similar to the kernel we 
compiled with the patch provided in comment #13. The kernel solves of 
problem of swapping to a certain extent i.e. swapping doesn't start 
until all of the memory is used up. But the problem now is that when 
we load a volume say 2GB then 2GB of memory is used up by the volume 
and another 2GB is used up by disk cache! So if we try loading a 
volume that's more that half the available memory we see the swapping 
problem. 

So my question is, is there a way to control this disk cache from not 
taking so much of memory? I'm monitoring the disk cache using the 
Info Center tool (RedHat->System Tools-> Info Center->Memory).

Comment 20 Mary Cole 2004-06-18 18:59:24 UTC

We installed the new kernel and it's very similar to the kernel we 
compiled with the patch provided in comment #13. The kernel solves of 
problem of swapping to a certain extent i.e. swapping doesn't start 
until all of the memory is used up. But the problem now is that when 
we load a volume say 2GB then 2GB of memory is used up by the volume 
and another 2GB is used up by disk cache! So if we try loading a 
volume that's more that half the available memory we see the swapping 
problem. 

So my question is, is there a way to control this disk cache from not 
taking so much of memory? I'm monitoring the disk cache using the 
Info Center tool (RedHat->System Tools-> Info Center->Memory).

Comment 21 Rik van Riel 2004-06-18 19:55:55 UTC

Mark, you can "echo 1 5 15" > /proc/sys/vm/pagecache", that should
help. If it doesn't help enough, let me know.

Comment 22 Mary Cole 2004-06-21 18:40:28 UTC

We tried setting the pagecache parameter to "1 5 15" and it doesn't
seem to help very much. Do you have any other suggestion to control
the disk cache?

Comment 23 Rik van Riel 2004-06-21 21:26:51 UTC

Created attachment 101307 [details]
evict pages in page cache faster

This patch may help the system do what you want.  OTOH, it's still somewhat
experimental and might make the pagecache eviction too aggressive...

Comment 24 Mary Cole 2004-06-23 23:54:13 UTC

Sorry for taking long to update... I was tied up most of yesterday.  

I'm trying to put together a very standard reproducable test so that 
we can give you better feed back on this problem and also track our 
progress.

I tested this on an IBM Apro configured with 8Gbytes of physical 
memory, using GeoProbe to load and display two volumes -- a 4Gb 
volume and a 2Gb volume -- for a total of 6Gb.  I used the 
SystemMonitor to display the amount of allocated and cache memory.  
The pagecache parameter was set to "" in both cases.

First I tested using the 2.4.21-15.10 kernel from Larry Woodman, 
without the patch from Rik van Riel applied.
-> Before loading the volumes, the System Monitor shows: 
Memory Used: 526Mb  Total: 6.8Gb
Swap Used:     0Mb  Total: 2.0Gb
-> After loading the 2Gb volume the SystemMonitor shows:
Memory Used: 5.2Gb  Total: 6.8Gb
Swap Used:     0Mb  Total: 2.0Gb
-> After loading the 4Gb volume (6Gb loaded)the SystemMonitor shows:
Memory Used: 6.8Gb  Total: 6.8Gb
Swap Used:   1.5Gb  Total: 2.0Gb
At this point GP performane is unpredictable. Sometimes good, then 
sometimes GP stops for a LONG time and then suddenly interactivity 
returns.

Next I tested using the 2.4.21-15.10 kernel from Larry Woodman, but 
this time WITH the patch from Rik van Riel applied.
-> Before loading the volumes, the System Monitor shows: 
Memory Used: 633Mb  Total: 6.8Gb
Swap Used:     0Mb  Total: 2.0Gb
-> After loading the 2Gb volume the SystemMonitor shows:
Memory Used: 5.3Gb  Total: 6.8Gb
Swap Used:     0Mb  Total: 2.0Gb
-> After loading the 4Gb volume (6Gb loaded)the SystemMonitor shows:
Memory Used: 6.8Gb  Total: 6.8Gb
Swap Used:   1.7Gb  Total: 2.0Gb
Same problems with interactivity, as above. 
--> Deleted both volumes from memory
--> Loaded 4Gb Volume
Memory Used: 6.8Gb --> then drops to 5.1Gb  Total: 6.8Gb
Swap Used:   205Mb  Total: 2.0Gb
--> Performance OK
--> Attempt to load 1.5Gb volume (for a total of 5.5Gb loaded)
Memory Used: 6.7Gb --> then drops to 5.7Gb Total: 6.8Gb
Swap Used:   2.0Gb --> then drops to 1.7Gb Total: 2.0Gb
--> Performance flakey (as above)-- system is obviously swapping, but
there is a big piece of physical memory unused.
--> Detach 1.5Gb volume (TAKES FOREVER) and remove from memory (quick)
Memory Used: 3.7Gb --> then drops to 5.7Gb Total: 6.8Gb
Swap Used:   1.7Gb --> then drops to 1.7Gb Total: 2.0Gb
--> remember the 4Gb volume is still loaded... so now I try to move 
the probe around to "touch" all the parts of the volume... This is 
REALLY slow and painful with the application appearing to hang for 
minutes at a time, but the system monitor records some changes..cpu 
usage is low <15%... nothing else going on... swap usage is 
decreasing at ~100Mb/minute!  At one point GeoProbe doesn't refresh 
the screen for 3 minutes!!! most users would have power-cycled at that
point...finally the memory numbers stabilize... 
Memory Used: 4.4Gb  Total: 6.8Gb
Swap Used:   987Mb  Total: 2.0Gb
Remember only the 4Gb volume is still in memory... but now our 
performance is really good again -- after ~30 minutes of PATIENT 
fiddling.
--> Attempt to load 1Gb volume (for a total of 5Gb loaded)
Memory Used: 6.2Gb  Total: 6.8Gb
Swap Used:   1.2Gb --> then drops to 1.0Gb  Total: 2.0Gb

So, what have we learned? 

(1) The new kernel + the patch gives us some improvement.
(2) Exceeding 5.0Gb of loaded volumes on a 8.0Gb machine is probably 
a bad idea (This is better than 3.8Gb without the two fixes). 
(3) You should strive to load your data upfront and never need to
swap.
(4) Recovering from having Volume data placed in swap is PAINFUL. 

So... better, but not ideal...  I'll let our usability/testing folks 
weigh in with their opinions.

Comment 25 Rik van Riel 2004-06-24 01:10:10 UTC

Mary, good to hear that the patch helps some.  I could try something
more radical, but then the risk of regressions is too big, so I'd
prefer to do the improvements in smaller, lower risk steps.

I'll take your data point to the other developers here to argue for
the patch. Once this patch is well tested and accepted we can move on
to the next step.

Comment 26 Bradley Thomas 2004-07-07 16:21:33 UTC

Rik, any progress on this?  How goes the conversations with the 
developers on including this into an update?

Do you have a patch that we could test that would help you get data 
on some of the more radical changes?

Comment 27 Rik van Riel 2004-07-07 17:18:23 UTC

Bradley, the "evict page cache faster" patch didn't make it in time
for RHEL3 Update 3.  I want to convince the other developers that the
patch is harmless, but there is a call for more data points from users...

Comment 28 Mary Cole 2004-07-07 17:38:21 UTC

Rik, We are very interested in the "evict page cache faster" patch -- 
however I'm not sure it goes far enough (as discussed in my testing 
log above).  Even with the patch -- recovery is slow and painful -- 
despite having >1Gb of physical memory free.

What sort of data do we need to provide you? 

Would it be helpful for us to send the GeoProbe application and a 
large sample dataset that will enable you to replicate the test 
described above?

If we do get to an acceptable fix -- what sort of user datapoints do 
you need?

Comment 35 Ernie Petrides 2004-07-22 07:28:33 UTC

We have decided to incorporate a variation of Rik's patch
in comment #23 that preserves existing VM page eviction
behavior by default, but allows the system administrator
to switch to the more aggressive page eviction strategy
through a new system tuning parameter (sysctl).  In order
to override the default manually, one can do the following:

    echo 1 > /proc/sys/vm/skip_mapped_pages

Alternatively, one can add the following line in /etc/sysctl.conf
to adopt the new strategy automatically upon reboot:

    vm.skip_mapped_pages = 1

The patch that implements this has just been committed to the
RHEL3 U3 patch pool this evening (in kernel version 2.4.21-18.EL).

Comment 36 Bradley Thomas 2004-07-29 13:36:29 UTC

Do we have an update yet from Landmark on their testing of this patch?

Rik, do you have any further patches that you are working on that we 
would want to target for Update 4?

Comment 37 Rik van Riel 2004-07-29 14:19:48 UTC

Bradley, I've got no further patches queued at this moment, mostly for
the reason that I'd like to know for sure if the current ones are the
right direction for everyone before continuing further in this same
direction.

Comment 38 Bradley Thomas 2004-07-29 14:45:41 UTC

Details, details :).  Thanks Rik.  Hopefully we will have an update 
as to how the patches are working soon.

Comment 39 Mary Cole 2004-08-01 23:33:22 UTC

Rik,  Thanks for the work to get these patches into the release... I 
loaded the BETA (and put vm.skip_mapped_pages = 1) 
in /etc/sysctl.conf.

Unfortunately -- while this is an improvement over the behavior 
without ANY patches... it doesn't go far enough -- and the BIG ISSUE 
is that if we ever exceed physical memory and start swapping, 
recovery is slow takes a LONG time... here are my notes from testing 
using our GeoProbe 3.1.1 application and a demo dataset.  If you're 
curious to replicate this in your shop, I would be delighted to 
provide you with the application, demo license and dataset.


On Aproloaner3 (IBM APRO - SIT preproduction hardware) which is 
configured with 12Gb of RAM
o	The memory hole is 1.4Gb
o	After loading 4GbAmp.vol, I have 
Memory Used: 	8.7Gb 	Total: 10.6
Swap Used: 	0 bytes	Total: 1.9Gb

o	I then load 4GbFreq.vol, and I seeâ¦ 
Memory Used:  10.6Gb	Total: 10.6
Swap Used: 	1.5 Gb	Total: 1.9Gb

ï  no problems, until we try to actually visualize both volumes by 
creating a second probeâ¦ then it slows down considerably.  System 
takes >5minutes to respond after selecting the 4GbAmp.vol for the 2nd 
probe.

o	Delete 4GbFreq.vol, (Detach, then Attach/Remove)
Memory Used:  6.4 Gb	Total: 10.6
Swap Used:      1.2 Gb	Total: 1.9Gb

ï  performance poor when I try to access the 2nd probe (presumabily it 
is using memory that is still âswapped outââ¦ System takes over 
>5minutes to come back after trying to select the 2nd probe.

ï  However, if we wait long enoughâ¦ (30 minutes) we can move the 2nd 
probe around and bring that memory back from swap â but most users 
would have rebooted their machine after the 1st 5minute lapse.

o	After recoveryâ¦ 
Memory Used:  7.5Gb	Total: 10.6
Swap Used:     100 Mb	Total: 1.9Gb

Comment 40 Rik van Riel 2004-08-01 23:43:20 UTC

Is each probe in its own process, or are they all loaded into the same
process ?

The reason I'd like to know this is deciding a direction in which to
go with further improvements...

Comment 41 Mary Cole 2004-08-02 09:34:53 UTC

The volumes are stored in shared memory and accessed by multiple
lightweight processes (pthreads).  In the case tested (single graphics
window), the probes are all in the same process, but there are
multiple threads performing data loading, computation, etc.

Comment 44 Keith Fish 2004-08-10 16:13:22 UTC

Created attachment 102574 [details]
This testcase eats into swap when it should not need to.

The attachment is a "shar -V".	sh SB.sh will unpack it.
more SWAPBUG/runit.sh to see the description.

Comment 51 Larry Woodman 2004-08-20 19:05:47 UTC

The problem here is the DMA zone for the second pgdat is exhausted
down below min and all of the pages are obviously wired by the kernel
because there is practically in the active or inactive page lists.

Something is leaking wired DMA memory!


   aa:0 ac:0 id:0 il:0 ic:0 fr:0
   aa:385856 ac:13414 id:75335 il:11368 ic:11316 fr:1826
   aa:0 ac:0 id:0 il:0 ic:0 fr:0
>>>aa:1058 ac:41 id:240 il:33 ic:54 fr:1065
   aa:571636 ac:21208 id:112698 il:17546 ic:16320 fr:1684
   aa:0 ac:0 id:0 il:0 ic:0 fr:0

I cant seem to reproduce this problem here at Red Hat.  Please install
the latest RHEL3-U3 kernel from here:

http://people.redhat.com/~lwoodman/.RHEL3/

1.) install the kernel and reboot your system.
2.) get an "AltSysrq M" right after boot.
3.) cat /proc/meminfo
4.) run the test that causes the problem.
5.) get another "AltSysrq M".
6.) cat /proc/meminfo.

Attach all outputs to this bug.


Thanks, Larry Woodman

Comment 52 Larry Woodman 2004-08-20 21:10:04 UTC

I see the problem here, while sizing memory and initializing the
paging thresholds the DMA zone's min, low and high water marks are
absurdly high.  This is due to:

******************************************************************
static int zone_extrafree_max[MAX_NR_ZONES] __initdata = { 1024 ,
1024, 0, };
******************************************************************

This basically sets the DMA zone's low target to > 25% of the zone
size which results in premature paging!

**************************************************************
Zone:DMA freepages:  1065 min:  1056 low: 1088 high:  1120
**************************************************************


I'll work up a patch to fix this for all conditions and architectures.


Larry

Comment 53 Mary Cole 2004-08-23 14:36:48 UTC

Thanks Larry. Could you please add the kernel source rpm too. We need 
to install nvidia  drivers for our application to run and that 
requires the kernel source.

Comment 54 Larry Woodman 2004-08-23 19:46:10 UTC

Mary, can you run a quick test for me?  Please add "numa=off" to the
commandline, reboot and "echo 1 10 15 > /proc/sys/vm/pagecache" then
rerun the program that you are experiencing trouble with and let me
know how it goes.  Please use the same kernel you have installed for
this test.  

Thanks, Larry Woodman

Comment 55 Mary Cole 2004-08-23 21:48:56 UTC

Larry, 

I would have to say (guardedly - I'd like to do more tests) that this 
is an improvement.

To confirm, I am using the RHWS 3.0 Update 3 Beta (2.4.21-17ELsmp) 
with "vm.skip_mapped_pages = 1" in /etc/sysctl.conf

I added "numa=off" to the boot command, and rebooted; then did "echo 
1 10 15 > /proc/sys/vm/pagecache"

Here are the results when I follow the test procedure (outlined 
above): 

On Aproloaner3 (IBM APRO - SIT preproduction hardware) which is 
configured with 12Gb of RAM
o	The memory hole is 1.4Gb
o	Before loading 4GbAmp.vol, I have 
Memory Used: 	316Mb 	Total: 10.6
Swap Used: 	0 bytes	Total: 1.9Gb

o	After loading 4GbAmp.vol, I have 
Memory Used: 	9.8Gb Total: 10.6 (!WORSE than 8.7 in previous test!)
Swap Used: 	0 bytes	Total: 1.9Gb

o	I then load 4GbFreq.vol, and I seeâ¦ 
Memory Used:  10.6Gb	Total: 10.6
Swap Used:     800MB	Total: 1.9Gb (!BETTER than 1.5Gb in previous!)

WHOO HOOO... I can use two probes to visualize the two volumes 
without a PRONOUNCED SLOWDOWN!!!!  (moving the probes seems, 
qualitatively a bit slower than when we only have one dataset loaded, 
but the system doesn't take 5minute breaks.)

THIS IS A DEFINITE IMPROVEMENT -- what are we losing here.

So... Shall we go for 10GB?.... 
o	I then load 2GbAmp.vol, and I seeâ¦ 
Memory Used:  10.6Gb	Total: 10.6
Swap Used:     1.9Gb	Total: 1.9Gb (!BETTER than 1.5Gb in previous!)

BAD IDEA... can you say "swapping fool"... I'll just send this rather 
than waiting for responsiveness to return (10 minutes and counting)

Thanks for the effort here!

Comment 56 Mary Cole 2004-08-24 14:15:45 UTC

To follow-up... I left it overnight, and this morning was able to 
remove the 2Gb volume from memory, leaving the two 4Gb volumes loaded.

Numbers are: 
Memory Used: 8.7Gb Total: 10.6
Swap Used:   1.4Gb Total: 1.9Gb

As I move the probe around the 4GbAmp.vol volume we have the same 
issues we had previously with EXTREME slowness in paging data back in 
to memory from swap.  The program ends up halting for 1-2 minutes 
each time we move the probe to a part of the volume that must be 
paged back in.

Comment 57 Keith Fish 2004-09-02 16:43:15 UTC

I am assuming that the kernel is not giving high enough priority to
keeping process data pages resident.  Is there /proc/ tunable, or a
kernel source tweak, that will cause the kernel to give much higher
priority to keeping process data pages (heap, shm) resident, versus
file system buffer pages?

Comment 58 Keith Fish 2004-09-09 13:22:42 UTC

I tried an experiment with (guessed values) DMA zone extrafree max
@255, and extrafree ratio @4097.

I can load 6G volume on 8G RAM system (EM64T).  I can
leave it all night on relatively quiet system, and it is
immediately usable the next morning (almost no paging in
accessing the entire 6G shared-memory volume).

If I "cksum 3GbyteFile" and then try to access the entire 6G volume
I get expected paging as I 'slice' through the volume initially.

However, what I don't expect is that it takes 3 round-trips through
all the volume data (accessing every page ~6 times), before all the
pages stay resident.  Simplistically, I'd expect an LRU for keeping
pages resident -- it seems the kernel is doing something different
since pages recently accessed appear to have been swapped out in favor
of something that has not been accessed nearly as recently.

If I set vm.pagecache=1 10 15  then the initial load and access of the
6G volume is not quite as smooth, but it is still acceptable; however,
accessing the 6G volume after a 3G file is cksum'd behaves quite well.

Comment 59 Larry Woodman 2004-09-09 13:52:40 UTC

Keith, please try setting the pagecache to 1 15 20.  This will set the
limit which anonymous pages are swapped out to a higher value
(20% vs 15%).  

Larry

Comment 61 Keith Fish 2004-09-14 22:58:21 UTC

vm.pagecache=1 10 15 performed as good or better than
vm.pagecache=1 15 20 for the initial load+slicing of the 6G volume

Both vm.pagecache settings performed similarly (both were good) for
slicing after the "cksum 3GbyteFile" operation.

Comment 62 Larry Woodman 2004-09-20 17:52:32 UTC

I have been working on a patch that helps the system reclaim pagecache
memory more effectively when the pagecache is over pagecache.maxpercent.
What this patch does is reactivate anonymous inactive dirty pages of
memory when the active pagecache pages exceed pagecache.maxpercnet. 
This will further prevent the system from swapping when the majority
of memory is in the pagecache.

************************************************************************
@@ -292,7 +310,14 @@ int launder_page(zone_t * zone, int gfp_
                                                                     
                                            
        BUG_ON(!PageInactiveDirty(page));
        del_page_from_inactive_dirty_list(page);
-       add_page_to_inactive_laundry_list(page);
+
+       /* if pagecache is over max dont reclaim anonymous pages */
+       if (cache_ratio(zone) > cache_limits.max && page_anon(page) &&
free_min(zone) < 0) {
+               add_page_to_active_list(page, INITIAL_AGE);
+               return 0;
+       } else {
+               add_page_to_inactive_laundry_list(page);
+       }
        /* store the time we start IO */
        page->age = (jiffies/HZ)&255;
        /*
********************************************************************

Please try out the appropriate kernel and let me know how it works ASAP:

>>>http://people.redhat.com/~lwoodman/.RHEL3pagecachefix/

Thanks, Larry Woodman

Comment 63 Keith Fish 2004-09-21 21:16:01 UTC

Larry, 

Thanks.  This patch looks good to me -- it helps a lot with default
vm.pagecache settings (ia32e kernel, dual em64t + 8G ram system).

I get ~same performance with loading and then initial slicing through
6G volume with default vm.pagecache="1 15 100", as "1 5 10" setting. 

So far, it seems stable slicing 2Kx2K in Z, 2Kx1K in X & Y, while also
moderately feeding highend graphics.  'top' shows no kscand/kswapd
activity, and "swap used" is about 1/3 of what it was previously.

I still have my version of an extra free zone tweak applied, as well
as vm.skip_mapped_pages=1.  Is there any reason to try this patch
without the extra free zone tweak?

Comment 64 Ernie Petrides 2004-09-24 09:46:02 UTC

A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.11.EL).

Comment 65 Ernie Petrides 2004-10-01 11:11:46 UTC

An additional fix specific to x86_64 has just been committed to the
RHEL3 U4 patch pool this evening (in kernel version 2.4.21-21.EL).

Comment 66 Milan Kerslager 2004-10-08 04:36:36 UTC

I see a lot of comments about 2.4.21-20+ (2.4.21-21) as testing
version for U4. Is there a chance to get&test it? The latest I found
is 21-20.8... (even I tryed RHN).

Comment 67 John Flanagan 2004-12-20 20:55:18 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html