Bug 565995 - RHEL 5 kernel kills process with Out-of-memory condition when there is 170MB of cached pages
Summary: RHEL 5 kernel kills process with Out-of-memory condition when there is 170MB ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Larry Woodman
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-02-16 21:11 UTC by Mikuláš Patočka
Modified: 2010-02-19 21:06 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-02-17 19:35:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oom killer report (19.38 KB, image/png)
2010-02-16 21:14 UTC, Mikuláš Patočka
no flags Details
top output during the crash (17.75 KB, image/png)
2010-02-16 21:16 UTC, Mikuláš Patočka
no flags Details
another out-of-memory crash (19.31 KB, image/png)
2010-02-16 21:17 UTC, Mikuláš Patočka
no flags Details
/proc/meminfo output (36.22 KB, image/png)
2010-02-16 21:22 UTC, Mikuláš Patočka
no flags Details
RHEL 5.3 out-of-memory crash (19.73 KB, image/png)
2010-02-16 21:24 UTC, Mikuláš Patočka
no flags Details

Description Mikuláš Patočka 2010-02-16 21:11:02 UTC
Hi

To reproduce the bug, create a XEN virtual machine with 512MB RAM and try to install RHEL5.5-beta into it (I tried also 5.3 and the bug is also present there). Don't create swap during installation.

During installation, the kernel kills Anaconda with out of memory condition, although  Anaconda uses only 176MB memory, total memory used by all processes is 277MB and there is 177MB of cached pages.

The kernel shouldn't kill the process if there are so many cached pages, it should try free the cache instead.

Creating the swap or allocating 1GB memory for the virtual machine avoids the problem --- but the memory traces suggest that there is something broken in the RHEL kernel OOM killer and it could potentially kill tasks spuriously not only during installation, but also during normal operation.

I'm attaching several screenshots taken when the OOM kill happened to show that the kernel kills processes while there is plenty of cached memory.

Comment 1 Mikuláš Patočka 2010-02-16 21:14:38 UTC
Created attachment 394641 [details]
oom killer report

oom killer report, notice the "43589 pagecache pages" line.

Comment 2 Mikuláš Patočka 2010-02-16 21:16:50 UTC
Created attachment 394642 [details]
top output during the crash

"top" command run during the crash (the kernel dump for this crash is in the previous screenshot)

Comment 3 Mikuláš Patočka 2010-02-16 21:17:43 UTC
Created attachment 394643 [details]
another out-of-memory crash

Another installation try.

Comment 4 Mikuláš Patočka 2010-02-16 21:22:18 UTC
Created attachment 394645 [details]
/proc/meminfo output

Output of /proc/meminfo during the out-of-memory crash from the previous screenshot. I ran the command:
while true; do echo `cat /proc/meminfo`; sleep 1; done
on the available console to capture memory state in 1-second intervals.

Notice the "Cached: 177892kB" entry, it corresponds with "43589 pagecache pages" in the previous screenshot.

The kernel definitely must not kill processes when there is so much cached data.

Comment 5 Mikuláš Patočka 2010-02-16 21:24:23 UTC
Created attachment 394646 [details]
RHEL 5.3 out-of-memory crash

The bug exists even in RHEL 5.3. There are 37705 cached pages and the OOM killer triggers.

Comment 6 Zdenek Kabelac 2010-02-17 09:50:23 UTC
I've opened Fedora bug 553193 for the locale issue, but without any progress so far...

Also it would be probably nice to see memory layout of processes just before oom starts to make some action?

Maybe some 'while : ; do ps aux >> /tmp/log ; sleep 1; done' during installation could be added to ananconda ?

Or passing 'sysrq  memory' dump at the right moment if the time could be determined.

Comment 7 Mikuláš Patočka 2010-02-17 19:35:13 UTC
Actually, after further analysis, I realized that this is caused by the root tmpfs filesystem --- it has pages in pagecache and they are not discardable. So, it is not kernel misbehaviour. Therefore, I am closing this as NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.