Bug 60079 - touching swap causes strange paging behavior afterwards
touching swap causes strange paging behavior afterwards
Status: CLOSED RAWHIDE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.1
i686 Linux
high Severity high
: ---
: ---
Assigned To: Brian Brock
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-02-19 14:57 EST by Jim Garlick
Modified: 2005-10-31 17:00 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-04-05 17:09:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
allocate memory and walk through it (1.03 KB, text/plain)
2002-02-19 14:59 EST, Jim Garlick
no flags Details
typescript of foo on 2GB machine with greedy swap reclaim (1.67 KB, text/plain)
2002-03-20 14:41 EST, Michael K. Johnson
no flags Details
It would help if I posted the right file (1.96 KB, patch)
2002-03-20 18:13 EST, Michael K. Johnson
no flags Details | Diff

  None (edit)
Description Jim Garlick 2002-02-19 14:57:49 EST
Description of Problem:

> We have observed that once a node starts using swap, even though all memory   
> pressure has been removed, there is still swap-related activity going on.     
> I have attached foo.c, a program which simply allocates memory and touches    
> it. On a freshly rebooted node with 2G, I can run "foo 3", which allocates    
> 3*512MB of memory and memsets it, in a deterministic time, e.g. 4sec. If I    
> once run "foo 4", allocating 4*512MB and causing a little swap activity,      
> then subsequent runs of "foo 3" run in very nondeterministic times, from      
> 8sec to 20sec.  Watching /proc/meminfo is interesting while "foo 3" runs -    
> swap never seems to be reclaimed, and MemFree never goes back to a            
> reasonable value. Our users are reporting that their long-running             
> computationally intensive codes run substantially slower after swap has       
> been "touched" in this way, so there may be more going on than we have been   
> able to reproduce trivially.                       

Version-Release number of selected component (if applicable):

kernel-2.4.9-21

How Reproducible:

See attached foo.c

Steps to Reproduce:
1.  
2. 
3.  

Actual Results:

Expected Results:

Additional Information:
	
This was reported in email to Brian Matsubara <bmm@redhat.com>,
Eric Nolen <nolen@redhat.com>, and Dion Gengler <dgengler@redhat.com>.
Comment 1 Jim Garlick 2002-02-19 14:59:07 EST
Created attachment 46080 [details]
allocate memory and walk through it
Comment 2 Arjan van de Ven 2002-02-28 05:28:31 EST
Can you try the pensacola (beta release) kernel ?
also 2.4.9 based but with a differently tuned vm
Comment 3 Jim Garlick 2002-02-28 09:12:30 EST
Can you elaborate on where to find this kernel, what is different about its vm
tuning,
 why it is likely to fix my problem?
Comment 4 Arjan van de Ven 2002-02-28 09:22:54 EST
ftp://ftp.redhat.com/pub/redhat/linux/beta/pensacola/en/os/i386/RedHat/RPMS

What is different:
The "when to swap out" logic has been changed, and we expedited "dead swap"
detection (eg swap used by program that has exited).
Basically 2.4 kernels "lazy" remove such dead swap, and not fast enough it
seems. We made the detection more aggressive
Comment 5 Ben Woodard 2002-03-13 12:09:31 EST
They tried this with pensecola and it didn't work.
Comment 6 Alan Cox 2002-03-15 15:24:09 EST
Ok there are some real horrors lurking in the questions we need to work out -
stuff from cache colouring and up.

Firstly. If you get a machine in this state where you see reduced throughput do
you continue to see it if you turn off swap, and turn it back on again ?

Are these codes typically sweeping large contiguous blocks of memory ?
Comment 7 Ben LaHaise 2002-03-15 15:27:57 EST
Also, is the problem as severe with the rmap based kernels?
Comment 8 Michael K. Johnson 2002-03-15 15:49:29 EST
(We can provide an rmap-based kernel for testing)
Comment 9 Ben Woodard 2002-03-15 17:16:37 EST
Yes the problem goes away when you swapoff and then swapon until the next time
you touch swap.
The problem doesn't seem to appear on my laptop which only has 256Mb memory.
This tends uphold jim's belief that it has something to do with bigger memory. I
tried the 2.4.7-10 as well as the 2.4.9-31.
Comment 10 Ben Woodard 2002-03-15 17:21:53 EST
I would say yes the programs are using large blocks of memory. Yes, these
hydrodynamics programs often do allocate a large portion of
physical memory and use it all.  Users tend to adjust the memory footprint
of their programs close to brink of swapping and occasionally slip over the edge
briefly. As for how they use it once they have it, that is currently unknown.
However the problem does show up when you simply memset the a previously
allocated large block of memory.

Please take a look at the attached file. According to the customer, this program
reproduces the problem on his machines.
Comment 11 Gavin Romig-Koch 2002-03-15 17:35:57 EST
johnsonm@redhat.com asks:
What are the parameters of the acceptable solution space?  Is a
2.4.9 kernel a requirement?  Can we ask them to test a 2.4.18
kernel to gather more data?
Comment 12 Michael K. Johnson 2002-03-15 17:38:55 EST
To be clear, the 2.4.18 test kernel request is relative to the rmap VM;
we've tried to put the rmap VM into 2.4.9 for testing and it will be
easier to stabilize our 2.4.18 tree than to get rmap VM in 2.4.9 stabilized;
we don't think it's really practical to shoehorn the rmap VM into 2.4.9.
Comment 13 Ben Woodard 2002-03-15 17:49:13 EST
I believe that they would be more than happy to move to a 2.4.18 based kernel.
The only reason that they have been sticking with the 2.4.9 kernel is that their
only means of support is RedHat and so they felt the need to stick with a RH
provided kernel. They have also been getting pressure from some of their other
vendors to use a 2.4.10+ kernel to better support ia64, a specialized high speed
interconnect, and a new networked filesystem called luster that is being
developed for scientific applications.
Comment 14 Ben Woodard 2002-03-15 18:35:18 EST
SORRY!! I MISPOKE.

I just talked to the people in charge of the computers affected by this problem
and I found out that I misspoke. They are would like this problem solved in the
2.4.9 series of kernels rather than moving to 2.4.18. At some later date they
would like to move to the 2.4.18 series -- however, they cannot do that right
now. The problem is that this cluster of computers is going to go into
production in two weeks and they don't have time to completely test out a new
kernel.
Comment 15 Michael K. Johnson 2002-03-15 22:10:30 EST
Well, there's no way we would have a 2.4.18-based kernel ready for production
use in two weeks, since we haven't even had a single public beta of it yet.
So I can't do anything but agree that they won't be able to trust 2.4.18
in that time frame.

Are they willing to even test a 2.4.18-based kernel to see if it's any better
for them with their real-life workloads, or are they only willing to try
things aimed at instant relief?

Can we suggest workarounds for a 2.4.9-based kernel that involve modifications
to their codes?

I'm looking for more parameters for a workable short-term solution here.
Comment 16 Ben Woodard 2002-03-18 10:14:02 EST
I will be willing to try out a 2.4.18 kernel on their machines and with their
workload and verify if the problem exists.
Comment 17 Michael K. Johnson 2002-03-18 12:02:14 EST
2.4.18-based kernel with rmap VM provided to Ben Woodard for testing, to
see if the rmap VM does a better job of choosing what to swap when in this
case.
Comment 18 Michael K. Johnson 2002-03-18 12:53:42 EST
Alan wrote up a set of suggestions of possible changes to llnl's codes
to help the kernel do the right thing; I have provided this writeup to
Ben Woodard in email.
Comment 19 Ben Woodard 2002-03-18 17:34:16 EST
personally verified the reported results. Here is the exact output out of a
terminal:
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m4.364s
user
0m0.460s
sys
0m3.910s
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m4.337s
user
0m0.400s
sys
0m3.940s
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m4.344s
user
0m0.430s
sys
0m3.910s
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m4.353s
user
0m0.440s
sys
0m3.920s
[ben@edev25 Kernels]$ time ./a.out 4
500M: Success
500M: Success
500M: Success
500M: Success

real
0m23.572s
user
0m0.630s
sys
0m8.950s
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m11.463s
user
0m0.460s
sys
0m6.380s
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m9.247s
user
0m0.460s
sys
0m8.000s
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m13.500s
user
0m0.480s
sys
0m7.420s
[ben@edev25 Kernels]$ time ./a.out 3
500M: Success
500M: Success
500M: Success

real
0m12.272s
user
0m0.470s
sys
0m8.060s

Comment 20 Michael K. Johnson 2002-03-18 18:21:12 EST
I'd like to decompose this bug report a bit, to make sure that our
response clearly addresses the concerns.

> We have observed that once a node starts using swap, even though all memory   
> pressure has been removed, there is still swap-related activity going on.     
> I have attached foo.c, a program which simply allocates memory and touches    
> it. On a freshly rebooted node with 2G, I can run "foo 3", which allocates    
> 3*512MB of memory and memsets it, in a deterministic time, e.g. 4sec. If I    
> once run "foo 4", allocating 4*512MB and causing a little swap activity,      
> then subsequent runs of "foo 3" run in very nondeterministic times, from      
> 8sec to 20sec.

Is it the nondeterminism or the slow speed that is the primarly complaint
here?

How close is this test program to the codes being run?  This test program
is just about pessimal for normal swap algorithms; a linear scan or several
linear scans in parallel are the worst case for LRU page aging.  The page
most likely to be evicted (as least recently used) is about to be used
because it's next in linear order.  Therefore, once even a few pages need
to swap, the program will run much, much slower, and especially in your case
where there are multiple linear scans in parallel, will not be
deterministic, either.

The suggestions that Alan provided and that have been sent to Ben Woodard
in email and should have been distributed to you are multiple ways to
deal with this "degenerate" case for LRU, and should help both with the
speed reduction and the degradation of determinism.

> Watching /proc/meminfo is interesting while "foo 3" runs -    
> swap never seems to be reclaimed, and MemFree never goes back to a            
> reasonable value. Our users are reporting that their long-running             
> computationally intensive codes run substantially slower after swap has       
> been "touched" in this way, so there may be more going on than we have been   
> able to reproduce trivially.

This part of the report was why we originally thought that the pensacola
kernel might fix your problem -- we did fix a bug where swap cache was
not being reclaimed.  Are you still seeing this particular behavior when
testing with the pensacola kernel?  (I think that this bit ended up
something of a red herring for us.)

I hope this is useful.
Comment 21 Ben Woodard 2002-03-18 19:26:29 EST
I believe the problem is a little different than you seem to be portraying it
as. The problem is not with the performance of one application. The problem is
that one application can leave the system almost permenantly in a state in which
 all subsequent application's performance suffers.

The problem is not that sweeping memory is the worst case for LRU. That is well
understood and accepted. The problem is that once a few pages of swap have been
used. The kernel never seems to recover and every subsequent application is
forced to use swap to one extent or another even if its memory footprint is much
smaller than the available physical memory.
Comment 22 Ben Woodard 2002-03-18 19:40:38 EST
+Is it the nondeterminism or the slow speed that is the primarly complaint     
+here?

The expectation is that the same workload will run with a constant performance 
every time if it starts on an idle node.  What is demostrated by the test
case is that Linux will not recover from past swap activity even when the
node is idle.  You can "touch swap" and let the node sit idle overnight,
and the next time you run your app it will run with reduced performance.

The original expectation is set by running a program on a system that has never
touched swap in a way in which it never touches swap. Therefore, the fact that
Linux's performance drops in the neighborhood of 30% after the first application
touches swap is not seen as acceptable.

+How close is this test program to the codes being run?

The codes being run vary greatly. Our users will generally run with RSS
footprints just short of what they think will cause the system to swap.  They
may occasionally step over the edge, for example during some intermediate
calculation or collective operation, then return to long periods of running just
under.

In the presense of this bug, they are hosed from the moment they hit swap.
Worse, subsequent users scheduled on the node who may run with a more
conservative RSS are also hosed.  The only cure is a reboot or swapoff/
swapon between jobs.  In a batch environment this is not practical.
Comment 23 Ben LaHaise 2002-03-18 19:40:59 EST
FYI, this problem is probably fixed in newer kernels with rmap as rmap already
does non-lazy swap cache reclaim. (For example, the below is a series of runs
with 2.4.19-pre3-ac1.)  There is still variability, but that's part of life with
VM.


[bcrl@toolbox ~]$ time ./foo3 4
500M: Success[bcrl@toolbox ~]$ time ./foo3 4
500M: Success
500M: Success
500M: Success
500M: Success

real	0m25.648s
user	0m0.850s
sys	0m11.070s

500M: Success
500M: Success
500M: Success

real	0m31.580s
user	0m0.890s
sys	0m11.530s
[bcrl@toolbox ~]$ time ./foo3 4
500M: Success
500M: Success
500M: Success
500M: Success

real	0m32.036s
user	0m0.910s
sys	0m11.420s
[bcrl@toolbox ~]$ time ./foo3 4
500M: Success
500M: Success
500M: Success
500M: Success

real	0m24.979s
user	0m0.940s
sys	0m11.630s
Comment 24 Ben Woodard 2002-03-19 13:10:32 EST
How much memory did this computer have?
How many processors?

One thing that your test did not show is the true crux of the issue.
Do you get different levels of performance with a process that fits totally
within real memory before and after swap has been touched.

Let me the problem as LLNL sees it:

They run a program that fits totally within real memory and the time it takes to
run sets their expectation on how long that program will run in the future.

They now run a program which touches a miniscule number of pages of swap.

They re-run their first program (which still should functions totally within
real memory) and now it runs 30% slower. This performance degregation persists
until they reboot the machine or until they do a "swapoff/swapon".

Comment 25 Michael K. Johnson 2002-03-19 14:18:34 EST
That is, you want bcrl to do
./foo3 3
./foo3 4
./foo3 3
./foo3 3
./foo3 3
./foo3 3
and compare the times of all the "./foo3 3" runs, right?
Comment 26 Ben Woodard 2002-03-19 17:22:59 EST
I was just able to verify that the problem exists on 2.4.9-31 running in single
processor mode on their hardware. This is an important clue.

I was unable to reproduce the problem on a single processor laptop with 256Mb or
memory. The two substantial differences being the number of processors and the
amount of memory. I believe that the fact that the problem still appears using a
uniprocessor kernel on the 2GB machine strongly suggests that the problem is
related to the large amount of memory in the system.

For reference the run times for this test were:
4.093
4.091
4.092
4.092
<touch swap>
10.269
10.027
10.127

Note that this is a 247% increase in the runtime. That is why they are so
anxious to have this problem fixed.
Comment 27 Ben Woodard 2002-03-19 17:53:24 EST
That is correct, to reproduce the bug do the following on a machine with 2Gb of RAM:
time ./foo3 3
time ./foo3 3
time ./foo3 3
./foo3 4
time ./foo3 3
time ./foo3 3
time ./foo3 3
time ./foo3 3

The thing to note is that the first 3 runs will be very close to each other. The
 run after the "4" parameter is passed will take substantially longer than any
of the other 3's. This is generally expected and accepted. However, the
following 3 runs are expected to take about the same amount of time as the first
three runs.   The bug is, that the final 3 runs take anywhere up to 250% longer
than the original three runs. The final three runs do take approximately, the
same amount of time to run as each other. They just take 2-3 times as long as
the original three runs.
Comment 28 Ben Woodard 2002-03-20 12:52:24 EST
That new kernel vmlinuz-2.4.18-0.4smb doesn't work at all!!!!
"time ./foo 3" works fine but "time ./foo 4" fills up the console with:

VM: refill_inactive, wrong page on list.

It looks like we just flushed a bug in the new VM out into the open.

Comment 29 Ben Woodard 2002-03-20 14:09:37 EST
BTW the 2.4.18-0.4smp kernel was also almost 1 second slower on the 1.5Gb test
runs. 

i.e. 2.4.9-31 ran "foo 3" at an average speed of 4.092. Whereas, 2.4.18-0.4smp
ran "foo 3" at an average speed of 4.968.
Comment 30 Michael K. Johnson 2002-03-20 14:41:11 EST
Created attachment 49244 [details]
typescript of foo on 2GB machine with greedy swap reclaim
Comment 31 Michael K. Johnson 2002-03-20 14:42:45 EST
Ben LaHaise has ported greedy swap reclaim to the 2.4.9-31 kernel,
and the typescript I just posted is from that kernel.  We'll provide
it for testing.
Comment 32 Michael K. Johnson 2002-03-20 18:13:18 EST
Created attachment 49304 [details]
It would help if I posted the right file
Comment 33 Michael K. Johnson 2002-03-20 18:14:54 EST
My first attachment was the wrong script, it was with the kernel without
the greedy swap reclaim.  <blush>.  the "foo.script.3" script is the one
I should have posted in the first place, the one run with the kernel with
the greedy swap reclaim.  Sorry for the confusion, it was entirely my
fault.
Comment 34 Ben Woodard 2002-03-21 18:20:39 EST
I tested this on their machine and the patched kernel does seem to solve the
problem as advertised. I will submit this kernel to the customers Sysadmin staff
 for stability testing.
Comment 35 Michael K. Johnson 2002-03-22 10:28:02 EST
Overnight, a kernel with this patch passed our stress-kernel testing
on one machine with 2GB of memory.  We'll be doing more testing today.
Comment 36 Michael K. Johnson 2002-03-25 18:02:15 EST
All of our additional stress testing, on a multitude of machines with
different memory and CPU configurations, passed.
Comment 37 Ben Woodard 2002-03-25 19:03:01 EST
The only thing further that the client requires is to put a test in your stress
kernel script which tests for this problem so that the problem doesn't reappear.
Comment 38 Michael K. Johnson 2002-03-27 11:47:35 EST
This is not something that can be tested in stress-kernel.
Comment 39 Michael K. Johnson 2002-03-27 11:57:13 EST
Assigned to Brian to make an automatable test out of foo.c
and add it to the release test scripts.
Comment 40 Brian Brock 2002-03-27 14:45:32 EST
working on it, I'll post when I'm convinced the automation is seamless.
Comment 41 Brian Brock 2002-04-01 10:18:57 EST
Tested on 2.4.9-31 and recent 2.4.18-0.12 from rawhide (rawhide yielded better
results and appeared to reclaim swap very aggressively).

Test is now integrated into the standard release tests, please let me know if
you'd like detailed results for specific kernels.
Comment 42 Michael K. Johnson 2002-04-01 11:29:30 EST
To clarify, 2.4.9-31 scores a "does not pass" and 2.4.18-0.12 scores
as "pass", and as this test has been integrated into our release
processes, I'm going to close this bug now.
Comment 43 Ben Woodard 2002-04-05 17:09:46 EST
Quick question? What is the "standard release test"
Comment 44 Michael K. Johnson 2002-04-05 17:15:03 EST
Tests QA runs before passing a kernel.
(BTW, you don't have to reopen a bug to ask a question.)

Note You need to log in before you can comment on or make changes to this bug.