Bug 565995

Summary: RHEL 5 kernel kills process with Out-of-memory condition when there is 170MB of cached pages
Product: Red Hat Enterprise Linux 5 Reporter: Mikuláš Patočka <mpatocka>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 5.5CC: lwoodman, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-02-17 19:35:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oom killer report
none
top output during the crash
none
another out-of-memory crash
none
/proc/meminfo output
none
RHEL 5.3 out-of-memory crash none

Description Mikuláš Patočka 2010-02-16 21:11:02 UTC
Hi

To reproduce the bug, create a XEN virtual machine with 512MB RAM and try to install RHEL5.5-beta into it (I tried also 5.3 and the bug is also present there). Don't create swap during installation.

During installation, the kernel kills Anaconda with out of memory condition, although  Anaconda uses only 176MB memory, total memory used by all processes is 277MB and there is 177MB of cached pages.

The kernel shouldn't kill the process if there are so many cached pages, it should try free the cache instead.

Creating the swap or allocating 1GB memory for the virtual machine avoids the problem --- but the memory traces suggest that there is something broken in the RHEL kernel OOM killer and it could potentially kill tasks spuriously not only during installation, but also during normal operation.

I'm attaching several screenshots taken when the OOM kill happened to show that the kernel kills processes while there is plenty of cached memory.

Comment 1 Mikuláš Patočka 2010-02-16 21:14:38 UTC
Created attachment 394641 [details]
oom killer report

oom killer report, notice the "43589 pagecache pages" line.

Comment 2 Mikuláš Patočka 2010-02-16 21:16:50 UTC
Created attachment 394642 [details]
top output during the crash

"top" command run during the crash (the kernel dump for this crash is in the previous screenshot)

Comment 3 Mikuláš Patočka 2010-02-16 21:17:43 UTC
Created attachment 394643 [details]
another out-of-memory crash

Another installation try.

Comment 4 Mikuláš Patočka 2010-02-16 21:22:18 UTC
Created attachment 394645 [details]
/proc/meminfo output

Output of /proc/meminfo during the out-of-memory crash from the previous screenshot. I ran the command:
while true; do echo `cat /proc/meminfo`; sleep 1; done
on the available console to capture memory state in 1-second intervals.

Notice the "Cached: 177892kB" entry, it corresponds with "43589 pagecache pages" in the previous screenshot.

The kernel definitely must not kill processes when there is so much cached data.

Comment 5 Mikuláš Patočka 2010-02-16 21:24:23 UTC
Created attachment 394646 [details]
RHEL 5.3 out-of-memory crash

The bug exists even in RHEL 5.3. There are 37705 cached pages and the OOM killer triggers.

Comment 6 Zdenek Kabelac 2010-02-17 09:50:23 UTC
I've opened Fedora bug 553193 for the locale issue, but without any progress so far...

Also it would be probably nice to see memory layout of processes just before oom starts to make some action?

Maybe some 'while : ; do ps aux >> /tmp/log ; sleep 1; done' during installation could be added to ananconda ?

Or passing 'sysrq  memory' dump at the right moment if the time could be determined.

Comment 7 Mikuláš Patočka 2010-02-17 19:35:13 UTC
Actually, after further analysis, I realized that this is caused by the root tmpfs filesystem --- it has pages in pagecache and they are not discardable. So, it is not kernel misbehaviour. Therefore, I am closing this as NOTABUG.