Bug 119351

Summary:

Getting OOM errors on an unconstrained system

Product:

Red Hat Enterprise Linux 3

Reporter:

Jim Richard <jrichard>

Component:

kernel

Assignee:

Dave Anderson <anderson>

Status:

CLOSED ERRATA

QA Contact:

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.0

CC:

lwoodman, petrides, riel, tao

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-05-18 13:27:20 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
LARRD Chart of memory usage durring incidents described earlier	none
Memory Chart durring re-production of problem.	none
messages.log from 4/3/2004 gzipped	none
Gzipped messages.log from 4/4/2004	none
Script that put's preasure on active page utilization	none
Gzipped Messages.log With AltSysRq-m output	none
OOM with AltSysRq-M and Slabinfo dump	none

Description Jim Richard 2004-03-29 19:33:51 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4)
Gecko/20011019 Netscape6/6.2

Description of problem:
Kernel is intermittently killing DB2 processes on a machine that
appears to be unconstrained, events can occur after the DB has been up
for 2 hours or 2 weeks. We receive the following in /var/log/messages:

DB up for one day at this point...
Mar 28 20:20:12 architect kernel: Out of Memory: Killed process 5698
(db2sysc).

Restart DB at ~ 20:40
Mar 28 22:40:15 architect kernel: Out of Memory: Killed process 21570
(db2sysc).

Mar 28 22:40:20 architect kernel: Out of Memory: Killed process 21431
(db2sysc).

Restart DB at ~ 23:00 and it's still running.

Version-Release number of selected component (if applicable):
2.4.21-9.0.1.ELsmp

How reproducible:
Sometimes

Steps to Reproduce:
1.Start DB2
2.Run Mixed work load, transaction/batch
3.Failure is intermittent ( 3 to date )


Actual Results:  We've had 3 failures. The first occured after 12 days
of normal opperation, the second after 5 days and the third after 2
hours. There were reboots between the first and second. Between the
second and 3rd occurance there were only 2 hours elapsed time and no
reboot.

Kernel sent a random db2 process a SIGKILL in an effort to releive
non-existant memory constraint. This causes db2 to terminate all
applications.

Expected Results:  Since memory did not appear to be constrained, I'd
expect things to continue to opperate normally. 

Additional info:

5 minute snapshots of the free comand run on this system and recorded
in rrd database. Durring the time in question Free Memory was ~10% or
1.6GB; Used-Cache was ~ 20% or 3.2 GB; Swap was at ~176KB or 0% of 8 GB.

Comment 1 Jim Richard 2004-03-29 19:35:53 UTC

Created attachment 98943 [details]
LARRD Chart of memory usage durring incidents described earlier

Chart Documents memory utilization durring OOM incidents.

Comment 3 Dave Anderson 2004-03-30 14:16:53 UTC

It's not so much a matter of total free memory, but most likely
a lowmem (<1GB) memory deficit that is the problem here.

This issue will need retesting when the U2 beta becomes available
for external sites (probably within a week?).  In the U2 kernel
the default page reclamation aggressiveness has been re-tuned, which
may address the problem in this case.  In the interim, please try the
following:

$ echo 30 > /proc/sys/vm/inactive_clean_percent

It is currently set to 5 percent by default, but will be set to 30 in
the U2 kernel.  (It can be set as high as 100, but we'd like to know
whether the new default alleviates your particular problem.) 

If that does not help, the U2 kernel will dump a complete
"Alt-sysrq-m" type output with each OOM kill.  Without that debug
data, it's impossible to know what VM circumstances precipitated
each OOM kill.

Comment 4 Jim Richard 2004-03-30 16:28:25 UTC

Is there any way for me to make my current kernel provide Alt-sysrq-m 
output. I'm at the latest post u1 errata for AS 3, also are there any 
other diagnostics I can turn on that will assist in problem 
resolution?

BTW. After the last crash, my google research indicated that I 
update /proc/sys/vm/inactive_clean_percent to 30 and I've already 
done that. But thanks for the confirmation. 

I was also wondering if the problem might be caused by the fact that 
I've only got 8G of swap on a 16G system?... I've not had much luck 
in locating recommendations for swap on large memory systems. I 
originallly configured the system this way due to a shortage of OS 
disk space, I've resolved that issue, and if I need another 8G of 
swap I can add it.

Comment 5 Dave Anderson 2004-03-30 16:41:19 UTC

Unfortunately there is no way to make your current kernel
provide Alt-sysrq-m data precisely at the time of the OOM
kill, which is what is required.  

You mention that you've updated the inactive_clean_percent
"after the last crash".  Does this mean that the 30% setting
has resulted in no further OOM kills?

As to the swap question, since your free data shows no problem
with swap utilization, that should not be an issue.

Comment 6 Jim Richard 2004-03-30 18:08:47 UTC

Dave,

Thanks for your response.

Can you let me know where I can pick up the U2 beta?

We have not experienced further OOM events since the update; but then 
again,we haven't attempted to reproduce the error either.

About the swap space, I was concerned about this since I'm not 
intimate with all the linux VMM internals, some systems are very good 
about this kind of arrangement, and others I've worked with are not 
so graceful.

When the first occurance happend, we identified a 3rd party app that 
was abusing memory, fixed the 3rd party app. and applied the last 
security kernel, that had some vmm fixes. Ran in test for 2 weeks and 
called it good. 

The last 2 errors happend during final testing prior to release into 
production. So we fell back to the server this  new one is replacing. 
We are in the process of doing a full review of memory utilization by 
db2 in an attempt to better understand what is using what, before we 
attempt to reproduce the error. 

We will probably begin testing activities tomorrow morning. The plan 
will include running with both 5 and 30 set 
in /proc/sys/vm/inactive_clean_percent

Thanks again for your help with this..

Jim

Comment 7 Dave Anderson 2004-03-30 18:33:41 UTC

> Can you let me know where I can pick up the U2 beta?

I will do that, it should be available in a few days...

Dave

Comment 8 Jim Richard 2004-03-31 01:21:42 UTC

Dave,

Fyi:

We had also opened a PR with IBM, In case db2 was doing something 
untoward with memory. They have reviewed our configuration and the 
DB2 crash diagnostics and feel they are clean. They did provide some 
feedback regarding our db2 configuration but nothing significant 
related to this discussion. The have closed the ticket, but will re-
open it if we identify something suspicious in db2's behavior. This 
seems reasonable since the crash was caused externally (kill -9). 

Thought you should know.

Jim

Comment 9 Jim Richard 2004-04-05 00:43:10 UTC

Created attachment 99100 [details]
Memory Chart durring re-production of problem.

Dave,

I've reproduced tha problem using inactive_clean_percent set to 5 30 90. When
set to 60, page cleaning kicked in, in time to prevent the problem. I've
Identified the	source of the presure on the active pages, I've also identified
the source of presure on the overall cache. The Cache is poluted by our backup
process ( we use Mondo Archive). The presure on active memory was caused by a
script we implemented to alleviate the random number generator being run out of
entropy to the point where it never recovers (documented in BUG #s 117218 and 
119526 ). The scrip randomly calls misc commands designed to generate I/O in
order to stir in entropy. When entropy becomes chronically exhausted the script
begins to run almost continiously, when this happens active pages begins to
rise steadily untill either OOM starts or the dead pages are cleaned from the
active pool. 

I am attempting to recreate using inactive_clean_percent=100. Meanwhile I've
attached my messages.log from yesterday and today, the script used to generate
entropy, and the memory utilization chart from yesterday. 

Please let me know if I can provide any other information, or if you have
suggestions on other vm parameters that could use tweaking. We can live without
the GenEntropy script, but the backups must obviously continue. But I'm
concerned about other script based daemons that could contribute to active page
utilization , for instance we use BigBrother, which is mostly script based,
granted it only runs every 5 minutes, but over extended periods of time could
present similar problems. We also have some application specific scripts that
run. I'd like to find a way to prevent these tasks from poluting the Active
Memory pages, once their terminate normally.

Comment 10 Jim Richard 2004-04-05 00:46:34 UTC

Created attachment 99101 [details]
messages.log from 4/3/2004 gzipped

gzipped messages.log from 4/3/2004

Comment 11 Jim Richard 2004-04-05 00:47:47 UTC

Created attachment 99102 [details]
Gzipped messages.log from 4/4/2004

Gzipped messages.log from 20040404

Comment 12 Jim Richard 2004-04-05 00:49:22 UTC

Created attachment 99103 [details]
Script that put's preasure on active page utilization

Script designed to stir entropy, that puts presure on active page utilization

Comment 13 Jim Richard 2004-04-05 00:52:02 UTC

Typo in last line of my last comment... Should read:

...I'd like to find a way to prevent these tasks from poluting the Active
Memory pages, once their children terminate normally.

Comment 14 Dave Anderson 2004-04-05 13:29:48 UTC

Setting inactive_clean_percent to 100 is the best course of
action for now.

However, if you can gather Alt-sysrq-m data when the system
is just about to start OOM-killing, there may be enough info
to help.

Comment 16 Dave Anderson 2004-04-05 15:34:49 UTC

Jim,

Re-reading your post, now I'm a bit confused:

> I've reproduced tha problem using inactive_clean_percent set
> to 5 30 90. When set to 60, page cleaning kicked in, in time to
> prevent the problem.

/proc/sys/vm/inactive_clean_percent is a single percentage value.
When you say "5 30 90", are you referring to /proc/sys/vm/pagecache?
Exactly what did you set to "60"?  The pagecache max percent?  And
if so, what was the inactive_clean_percent value at that time?

Comment 17 Jim Richard 2004-04-05 17:41:45 UTC

Dave,

Sorry I should have been clearer, I reproduced the problem multiple
times using different settings on inactive_clean_percent. So at this
point I've produced the problem on 5 occasions, each using a different
setting for inactive_clean_percent. BTW last night I re-created the
situation using inactive_clean_percent set to 100. I haven't tried any
other settings at this point. 

The funny thing about this whole thing is that the system is reporting
between 2 and 5 gigs of active memory when there is nowhere near this
much being used by any process. Typically there is less then 1 Gig in
use by processes and ipc shared memory that I can tell. ( DB2 makes
extensive use of shared memory). 

The GenEntropy script runs bunches of commands that just terminate
normally, yet this appears to be what is driving the active memory
counter up over time. When the script is not running I see no rise in
active memory.

Unfortunately it will be very difficult to catch the Alt-sysrq-m
data since the timing of the event is not predictable. It can run for
hours at a specific level of utilization before the problem occurs.
For this, I think we'll have to wait for the new kernel. 

It seems to me the kernel is keeping pages marked active when the
process that owned them is long gone. If you review the chart I sent
last night, note the sudden drops in memory utilization at ~17:30 and
7:30 am. The drop at 17:30 happened while running
inactive_clean_percent=60. This drop happened without any external
events, no process terminations or OOM events. Something just woke up
and released almost 5Gigs of memory. I'm not sure I understand the
behavior or what drove the release. The release at 7:30 am happened in
the middle of a multiple OOM events. The OOM events began at 5:00 and
continued untill the release. A total of 139 processes were killed.

Are there any knobs we can turn to force more frequent release of
"Active" pages back to the inactive pool? Since holding pages active
seems to be the problem. Is there something we can do with
/proc/sys/vm/pagecache, I've seen references to /proc/sys/vm/freepages
(doesn't appear in my /proc... though), is this an option?

Thanks again for your help with this.

Jim

Comment 18 Dave Anderson 2004-04-05 18:17:20 UTC

I guess I don't understand what you mean exactly by "active memory",
and how you determine what it is?

The Alt-sysrq-m output shows exactly what the page counts are on
each list in each zone, and in particular, it gives the exact 
breakdown of the currently-active pages, be they either (1) pagecache
pages or (2) anonymous memory pages used by processes.

For example, here are the first few lines of an Alt-Sysrq-m output:

SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  2902 min:     0 low:     0 high:     0
Zone:Normal freepages:176074 min:  1279 low:  4544 high:  6304
Zone:HighMem freepages:2118244 min:   255 low: 34304 high: 51456
Free pages:      2297220 (2118244 HighMem)
( Active: 11955/1050, inactive_laundry: 317, inactive_clean: 0, free:
2297220 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2902
  aa:0 ac:3926 id:1 il:0 ic:0 fr:176072
  aa:2483 ac:5548 id:1049 il:317 ic:0 fr:2118244
...

The lines above starting with "aa:" give the page counts per zone,
first the DMA zone, then the Normal zone, and last the Highmem zone.
The letters are shorthand for:

  aa: active anonymous memory pages
  ac: active pagecache pages
  id: inactive_dirty pages
  il: inactive_laundry pages
  ic: inactive_clean pages
  fr: free pages

The id, il, ic page lists contain combined anonymous/pageache pages,
but the "active" page list is broken down into two sub-lists, the aa:
and ac: lists.
 
We're guessing that the pagecache is being flooded and not flushed
quickly enough to avoid OOM kills when a user process is attempting to
allocate a page.  Setting inactive_clean_percent to 100 is the most
important tuning knob to keep page reclamation going as aggressively
as possible.

Another thing you could try is tinkering with /proc/sys/vm/pagecache
values, specifically the third ("max") value, which is set to 100
(percent) by default.  If the percentage of active pages that are
being used by the pagecache goes above that max percentage value,
then only pagecache pages will be reclaimed, and anonymous memory
pages will be left alone.  Since its default value is 100, the
active page list is allowed to be totally consumed with pagecache
pages.  So, if you set it to a lower value, pagecache pages will be
selected for reclamation in preference to anonymous memory pages. 
It's not a hard limit, but it does influence page reclamation, and
will keep user process memory around longer. But it's impossible to
predict whether it will help.  With the U2 kernel, the Alt-sysrq-m
will show exactly the page count state that precipitated the OOM kill;
it's strictly a matter of numbers at that point in time.

Comment 19 Jim Richard 2004-04-05 18:46:56 UTC

Dave,

Thanks for the response, I'll give the pagecache knob a try.

It looks like I can script the Alt-sysrq-m by running the following cmd:
  echo "m" > /proc/sysrq-trigger

When I did this it wrote the info to messages.log. 
Do you think once a minute would be frequent enough or do you think
more frequent recording is in order?

Let me know your thoughts and I'll set up another test.

Thanks again!

Jim

Comment 20 Dave Anderson 2004-04-05 19:35:23 UTC

It might be helpful, although unfortunately the problem with doing
what you propose is that the bash shell process running the "echo"
script may need memory, but probably won't be able to get any when the
system gets into the memory-starved state.  By the time it does run,
the OOM kill has happened, or the memory made has been made available,
etc...

Comment 21 Jim Richard 2004-04-05 20:27:20 UTC

Created attachment 99121 [details]
Gzipped Messages.log With AltSysRq-m output

Dave,

I thought it'd be interesting, and ran it anyway. The extract of our
messages.log is attached. Also since echo is a built-in I don't think it'll
require mememory since any library code should already be in main storage... If
you think it'd be better I could run it under busybox or some other staticly
linked shell, with all required functionallity implemented in the shell. 

Regardless I have AltSysRq-m data gotten 20 seconds before db2sysc was killed
by OOM. Another one 17 seconds before db2bp (db2 commandline) was killed by
OOM, and another 5 seconds before another db2bp got the ax... There's more in
there but these are the closest dumps of memory info to OOM events. 

Let me know if anything leaps out at you.

Thanks,

Jim

Comment 22 Dave Anderson 2004-04-05 21:11:23 UTC

Yes, something does leap out...

The page counts for the Normal zone show a remarkably small number
of pages being cycled through the pagecache/anonymous-memory
reclamation process, typically around 5000 pages.  (total the aa:
through fr: counts for the Normal zone in any of the sysrq-m outputs)

The Low/Normal zones in your system starts with 896MB of memory, or
about 225,000 pages.  Subtract from that the kernel's text and data,
most notably the mem_map array taking 60 bytes per page of physical
memory (~65,000 pages), and the remaining amount of available Normal
memory pages would be roughly 160,000 pages.  These pages are made
available for the free page lists, but also for kernel memory
allocations that must come from low memory, such as for the kmalloc()
slab cache.  And that's where the problem is here -- an unusually
large number of pages (~149000 in each sysrq-m output) are consumed
the slab cache.

So what we need now is a dump of /proc/slabinfo during the "problem
time" to see where it's all being allocated.

Comment 24 Jim Richard 2004-04-06 01:05:49 UTC

Created attachment 99130 [details]
OOM with AltSysRq-M and Slabinfo dump

Dave,

Ok. Here's one for sendmail, with AltSysRq-m and slabinfo from 3 seconds prior
to OOM event. 

Let me know where we go from here.

Thanks

Jim

Comment 25 Jim Richard 2004-04-06 01:25:45 UTC

Ernie,

Thanks, I've located it and will download it tonight. I should be able
to get it installed tomorrow. If the qla2xxx.conf module has been
included, I should be ready to test tomorrow afternoon. If not I
should have it ready tomorrow night. 

Thanks again, to all for your responsiveness in this matter, your
making a happy customer here. 

Jim

Comment 26 Dave Anderson 2004-04-06 19:16:59 UTC

As suspected, the problem here is the enormous size of the
pagecache, which can grow extremely large because of the
16MB of RAM in this system.  The 5,000,000+ buffer_head
structures -- which consume 130,000+ pages of lowmem slab
cache memory -- are associated with those pagecache pages
and filesystem metadata, which are all located in the
Highmem zone.

The state of the Highmem zone is fairly healthy, even at
the times when OOM kills occur.  The free page count ranges
from slightly below, to well above, the "low" watermark of
64000 pages.  When it does drop below the low value, they
are being replenished with no problem.  The combined number of
inactive_launday and inactive_clean pages is staying
equal to the number of inactive_dirty pages, so the
"inactive_clean_percent" setting of 100 is doing its job.
So, the page reclamation process sees no need be any more
aggressive in flushing Highmem pagecache pages to disk.

Have you considered the use of the hugemem kernel?  It
exists for situations like this to avoid lowmem exhaustion.
In the standard kernels, the 4GB virtual address space is
split between user and kernel virtual address spaces,
with the lower 3GB given to user space, and the upper
1GB used for kernel virtual address space.  Of this 1GB of
kernel virtual address space, 896MB is unity-mapped, and that
memory is used by lowmem (DMA/Normal) zones.  The hugemem
kernel splits the address space into two, with 4GB being
given to both user and kernel virtual address spaces.
This will increase the kernel's lowmem zone to ~4GB, and
therefore alleviates the type of lowmem exhaustion that
you are seeing.  You do pay for this split, however, because
a TLB flush will be done on every entry into the kernel.
So the hugemem kernel is only to be used for cases where
the extra lowmem requirement offsets the extra kernel
overhead.

As far as lowering the /proc/sys/vm/pagecache max value
from 100 down to a lower value, although the sysrq-m output
shows that the Highmem zone's active pages typically are
between 70-80% pagecache pages, the sysrq-m output shows
literally no swapping of anonymous memory going on at all.
So the inactivation of pages is already selecting only pagecache
pages as it is, so setting pagecache max won't accomplish anything.

There's little else to be done with the kernel as it is.
There are potential tests that could be done with
instrumented kernels, but that would require your being
willing to try some test kernels, and no guarantee that
the problem can be easily overcome.

Comment 27 Jim Richard 2004-04-06 21:07:44 UTC

Dave,

Thanks for the analysis. After seeing the numbers from last night's
test I've been comming to the same conclusions myself. 

I beleive I'll give the HugeMem kernel a shot. 

I do have a couple of concern's here. The first being, how much
performance degradation I should expect, and the second regarding; the
JVM incompatabilities noted in the release notes. Are there any user
experience documents available on either of these items that are
available for review? 

Tomorrow we'll re-run the offending workload against HugeMem. Assuming
that goes well. We'll run some benchmarks to identify the performance
hit. Then re-test our java based applications using SetArch to get
around the 3 G address space limitations of the JVM(s).

Thanks again for all your help with this. I'll update this report as
soon as I have any additional findings. 


Jim

Comment 30 Dave Anderson 2004-11-17 18:35:49 UTC

A patch that fixes try_to_reclaim_buffers() in the manner
suggested in Comment #29 has been queued for inclusion in
RHEL4-U5.  (i.e., it changes the 10% test to use
"nr_used_buffer_heads" instead of "nr_unused_buffer_heads")

Comment 31 Jim Richard 2004-11-17 19:22:57 UTC

I'm not sure what happened but comments 28 and 29 are missing here. 

I filed a comment some time ago indicating that switching to the huge-
mem kernel along with elimination of the script that was poluting the 
memory worked around the problem. 

But I'm glad to see that a fix is in the works.

Comment 32 Dave Anderson 2004-11-17 20:17:06 UTC

Jim -- sorry, #28 and #29 were Red Hat private comments.

Here's the part of try_to_reclaim_buffers() that is the problem:

   /*
    * Since removing buffer heads can be bad for performance, we
    * don't bother reclaiming any if the buffer heads take up less
    * than 10% of pageable low memory.
    */
   if (nr_unused_buffer_heads * sizeof(struct buffer_head) * 10 <
                                   freeable_lowmem() * PAGE_SIZE)
                   return 0;

So in your case, even though the buffer_head slab cache was using an
over 50% of freeable low memory, the test above would forestall the
function from trying to reclaim them.  The nr_unused_buffer_heads
counter is not allowed to exceed 1600 buffer_heads on an 86 (~42
pages), so this function would always just return 0.

The fix changes it to test "nr_used_buffer_heads", which in your
test case, would reflect the ~5000000 of in-use buffer_heads.

Comment 33 Ernie Petrides 2004-11-17 20:38:16 UTC

The fix for this problem was committed to the RHEL3 U5 patch pool
Monday evening (in kernel version 2.4.21-25.1.EL).

Comment 34 Jim Richard 2004-11-24 01:56:29 UTC

That's wonderful, thanks again for all the help with this! I'll look 
forward to U5

Comment 35 Tim Powers 2005-05-18 13:27:20 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html