Bug 175173 - oom kill kicks in but shouldn't
oom kill kicks in but shouldn't
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
5
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Dave Jones
Brian Brock
NeedsRetesting
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-12-07 04:33 EST by jan p. springer
Modified: 2015-01-04 17:23 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-02-13 11:43:24 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages (84.82 KB, text/plain)
2005-12-07 04:46 EST, jan p. springer
no flags Details
This is the message log output. (103.77 KB, text/plain)
2006-02-09 22:46 EST, Pete Stieber
no flags Details

  None (edit)
Description jan p. springer 2005-12-07 04:33:01 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20051201 Fedora/1.5-1.1.fc4.nr Firefox/1.5

Description of problem:
oom-kill kicks in but overcommit mode is 2 (/etc/sysctl.conf):

vm.overcommit_ratio = 2
vm.overcommit_memory = 80

(also verified via /proc/sys/vm/overcommit_*)

this happens when starting memory intensive tasks like copying filesystem trees from on disk to another.

Version-Release number of selected component (if applicable):
2.6.14-1.1637_FC4smp

How reproducible:
Always

Steps to Reproduce:
1. start memory intensive task
2.
3.
  

Actual Results:  random processes/task/terminal get killed

Expected Results:  task finishes

Additional info:

my impression is that overcommit mode 2 doesn't work. i need that behaviour, since it's a server-like machine where i cannot have deamons killed at random
Comment 1 jan p. springer 2005-12-07 04:46:14 EST
Created attachment 121963 [details]
/var/log/messages

complete boot-oom-reboot cycle
Comment 2 Dave Jones 2005-12-07 14:46:06 EST
can you try the latest errata kernel ? There was a nasty memory leak there that
recently got fixed.
Comment 3 jan p. springer 2005-12-07 15:17:15 EST
which version would that be and from which repository can i receive it?
Comment 4 Dave Jones 2005-12-07 23:19:14 EST
The regular updates repository which should be enabled by default.

yum --enablerepo=updates-released
yum update

Comment 5 jan p. springer 2005-12-08 14:07:01 EST
kernel is now at 2.6.14-1.1644_FC4smp
running "iozone -a -C -g3G" leads to the same behavior.

is there a known kernel version i could revert to?
Comment 6 Dave Jones 2005-12-10 01:59:57 EST
There's a possible fix in the kernel I'm building right now. In a few hours
2.6.14-1.1649 will appear at http://people.redhat.com/davej/kernels/Fedora/FC4

Let me know if it improves things any.
Comment 7 jan p. springer 2005-12-10 18:04:37 EST
only found 2.6.14-1.1650 at the above mentioned url.
unfortunately the problem still exists, but it seems to get triggered later than
before, i.e. i can run iozone for a longer time before processes are killed at
random.
Comment 8 Dave Jones 2005-12-12 17:09:12 EST
<re-reads oom report>
Ugh, you have 5 gig on a 32 bit machine ?

The problem is that every 4KB of memory needs a ~40 byte structure associated
with it, that has to live in low-memory. So your lowmem is consumed by *lots* of
these before you even run anything.

The OOM kill you're seeing is exactly because it can't get a page of memory from
low-mem, (It's a specific allocation, perhaps for a driver that needs memory to
DMA to/from).

Given there so little of it to start with, it's no surprise that you run out
quickly, and because its explicitly trying to allocate DMA memory, it can't fall
back to a different zone where there's plenty of free memory.
Swapping is irrelevant here, as none of ZONE_DMA is swapped out, it's just
permanently pinned with pointers to the rest of your ram.
Comment 9 Dave Jones 2005-12-12 17:17:42 EST
ignore that last comment, I overlooked that this was x86-64.
Larry, any ideas ?
Comment 10 Dave Jones 2005-12-12 17:46:07 EST
Hmm, this may get better when we move to a 2.6.15 based kernel, as that has the
GFP_DMA32 zone, which should take some of the pressure off of the smaller 16MB
dma zone.
Comment 11 Dave Jones 2006-02-03 00:21:33 EST
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.
Comment 12 jan p. springer 2006-02-07 09:38:38 EST
updated to 2.6.15-1.1830_FC4smp but still seeing oom kills
Comment 13 Pete Stieber 2006-02-09 22:46:23 EST
Created attachment 124473 [details]
This is the message log output.
Comment 14 Pete Stieber 2006-02-09 22:50:02 EST
Sorry for the lack of a message in my last post. I had the same problem with 
2.6.15-1.1831_FC4smp with an x86_64.
Comment 15 Pete Stieber 2006-02-16 19:21:43 EST
I take back everything I said in this thread. I now believe my problem was
caused by having BASH_ENV set in .bash_profile and running a Bash schell script.
This would cause an infinite loop and many unicode_start processes and this
would cause the oom.

See 172059, 181809, and 179949 for details.

Sorry for the noise.
Pete
Comment 16 Steve Schaeffer 2006-02-20 10:57:01 EST
I've been seeing the same problem since acquiring a dual-Xeon Dell Precision 
470 w/s with 4GB RAM (Yes, it IS sweet... thanks for asking :-) I'm currently 
running the 2.6.15-1.1831_FC4smp kernel, but I've seen the problem with every 
kernel (don't remember the earliest release number, but it was 2.6.14-
something). It seems to show up only when running memory-intensive tasks like 
backups (tar) or defragging a VMware VM.

I did NOT see the problem with the 2.6.14-something kernels on my old 700 MHz 
P3 with 512MB RAM.

More interestingly, I have several coworkers with the exact same h/w config 
but they run Gentoo and have not seen the problem even once with any kernel.
Comment 17 jan p. springer 2006-03-12 04:09:39 EST
checked against 2.6.15-1.1833_FC4smp and the problem still persists.
Comment 18 Steve Schaeffer 2006-03-13 15:23:05 EST
Here, too.

I'm certainly not up to speed on kernel innards, but I've been wondering if 
this is a timing issue with flushing memory to disk as "top" shows huge chunks 
(>2GB) being freed at times. Perhaps there is too little time between checks 
on free memory to allow the I/O to complete?
Comment 19 Steve Schaeffer 2006-04-13 17:25:32 EDT
Have not seen this problem even once since switching to FC5 when it was 
released. I'm currently using the 2.6.16-1.2080_FC5 kernel.
Comment 20 jan p. springer 2006-07-19 14:30:04 EDT
currently running kernel-smp-2.6.17-1.2142_FC4. have not seen the problem as of
kernel-smp-2.6.16-1.2111_FC4. i would consider this bug closed.
Comment 21 Dave Jones 2006-09-16 21:54:14 EDT
[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.
Comment 22 Dave Jones 2006-10-16 13:53:27 EDT
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.
Comment 23 Larry Woodman 2006-10-16 14:12:22 EDT
Dave, there are 2 separate problems in this BZ:

--------------------------------------------------------------------------------
1.) the "tiny" x86 DMA zone is exhausted, probably because build_zonelists()
includes the DMA zone.  This causes __alloc_pages to fall into the DMA zone when
the normal zone fals below low.  It those allocations are for the non-reclamable
slabcache entries that memory can be reclaimen and we end up in out_of_memory():

Node 0 DMA free:24kB min:28kB low:32kB high:40kB active:0kB inactive:0kB
present:15976kB

The way I fixed this in RHEL3 and RHEL3 was to change bould_zonelists() so that
it does not include the 16MB/4096 page DAM zone.

--- linux-2.6.9/mm/page_alloc.c.orig
+++ linux-2.6.9/mm/page_alloc.c
@@ -1170,6 +1170,9 @@ static int __init build_zonelists_node(p
                zone = pgdat->node_zones + ZONE_NORMAL;
                if (zone->present_pages)
                        zonelist->zones[j++] = zone;
+#if defined(CONFIG_HIGHMEM64G) || defined(CONFIG_X86_64)
+               break;
+#endif
        case ZONE_DMA:
                zone = pgdat->node_zones + ZONE_DMA;
                if (zone->present_pages)
--------------------------------------------------------------------------------

2.) the x86_64 DMA32 zone is exhausted and cant be reclaimes because we ran out
of swap space:

Free swap: 0kB
Comment 24 jan p. springer 2007-01-30 15:32:06 EST
moved machine in question to fc5/x86_64 (kernel:2.6.18-1.2257.fc5); did not see
this problem; i suggest to close this bug with resolution to move to newer kernel.

Note You need to log in before you can comment on or make changes to this bug.