Bug 71699

Summary: malloc() fails: OOM crashes SMP kernel when swap enabled
Product: [Retired] Red Hat Linux Reporter: josip
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 7.3   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:39:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test program: uses all successfully malloc'd memory none

Description josip 2002-08-16 21:00:45 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.79 [en] (X11; U; Linux 2.4.18-5 i686)

Description of problem:
Our SMP machines completely lock up with high probability when a program tries
to allocate more memory than the system is willing to give AND swap is enabled.

The first problem is that malloc() fails to set up errno properly when no more
memory is avalable.  As warned by the malloc() man page, instead of catching the
error, the offending process is killed by kernel's "Out of memory" killer:

Aug 16 15:35:13 server kernel: Out of Memory: Killed process 4777 (ffm).

The second, much more troubling problem is that with high probability our SMP
machines with swap enabled can crash (completely lock up, not respond to magic
SysRq, leave nothing in logs, etc.) when a process runs into this condition. 
Only a hardware reset can restore the machine to life.  If the test is done
after "swapoff -a" then our machine generally kills the process before crashing,
but with swap enabled crashes are very probable.  The MALLOC_CHECK_ environment
variable setting does NOT help catch malloc() problems of this type: OOM killer
still gets to kill the process (w/swap off) or the SMP kernel crashes with high
probability (w/swap on).

FYI, this problem crashed all 16 of our dual PIII/500 machines w/440BX chipset
and 512MB RAM, as well as 12 out of 16 dual PIII/800 machines w/ServerWorks LE
chipset and 1024MB RAM.  In all cases, the total available swap space (a pair of
partitions on IDE drives) was larger than twice the physical RAM.

My recommendation: FIX malloc() to conform to Unix98 standard.  Optimistic
memory allocation strategy is a disaster: OOM killer cannot be a substitute for
properly returning error codes encountered in malloc().


Version-Release number of selected component (if applicable):
Red Hat 7.3
Linux kernel 2.4.18-5smp
glibc-2.2.5-37


How reproducible:
Always

Steps to Reproduce:
1. Run on SMP-i686 machine with swap disabled, kernel 2.4.18-5smp,
glibc-2.2.5-37
2. Run program that tries to allocate & touch as much memory as it can
3. Do 'swapon -a', run again... and probably crash the system
	

Actual Results:  With swap off, malloc() fails to set errno and keeps on
returning non-NULL pointers until the process is killed by kernel's OOM killer. 
With swap enabled, the OOM killer apparently does not help, and our SMP machines
crash with high probability (75-100%) when OOM state is reached.

Expected Results:  Process should have been given proper value of errno per
Unix98 standard so that it could terminate itself and avoid the risk of crashing
the system.

More importantly, no machine should should crash when a simple program requests
too much memory.

Additional info:

Comment 1 josip 2002-08-16 21:14:15 UTC
Created attachment 71225 [details]
Test program: uses all successfully malloc'd memory

Comment 2 josip 2002-08-16 22:05:49 UTC
Some of our users' programs do not use malloc() but can still create crashes on
our SMP machines with the same symptoms.  While fixing malloc() would prevent
some crashes, this means that other crashes may still originate in the
interaction of swapping and OOM killer under "Out of Memory" conditions on SMP
machines.

Comment 3 Jakub Jelinek 2002-08-19 07:49:15 UTC
This has nothing to do with malloc nor glibc, it is about overcommiting memory
and kernel oom handling. Reassigning.

Comment 4 Arjan van de Ven 2002-08-19 08:37:53 UTC
The kernel has run-time configurable overcommit strategy. 
sysctl vm.overcommit_memory=2

will set a semi-strict overcommit strategy which seems to work pretty ok in practice

Comment 5 josip 2002-08-20 17:13:18 UTC
/proc/sys/vm/overcommit_memory=2 does not help: our SMP machines with swap
enabled still crash.

Supposedly, the default value overcommit_memory=0 should result in strict
malloc() checking, but this clearly is not working right.  I have yet to see a
malloc() actually return ENOMEM on any Linux host with kernel 2.4.18-{5,5smp}. 
Failure to properly detect memory situation is a normal priority bug; but
crashes of SMP hosts with swap enabled when OOM is encountered are a high
priority item.

Note that OOM-related crashes result on SMP hosts with swap enabled; if we
disable swap on SMP hosts OR run the same test program on uniprocessor hosts
with swap enabled, OOM killer usually kills the test program before the machine
crashes.  

We use default values in /proc/sys/vm:

bdflush:30	500	0	0	500	3000	60	20	0
kswapd:512	32	8
max-readahead:127
max_map_count:65536
min-readahead:3
overcommit_memory:0
page-cluster:3
pagetable_cache:25	50

My current thinking is that frequent SMP+swapon+OOM crashes are due to a kernel
bug which should be fixed at high priority.  The failure by malloc() to detect
OOM condition in conformance with Unix standards is a normal priority item that
should also be fixed.  FYI, malloc() operates properly on Suns running Solaris.

Reliance on OOM killer instead of reliable OOM condition detection at the point
of memory allocation is very troubling.  This is an intrinsically unreliable
design.


Comment 6 Alan Cox 2002-08-20 17:26:50 UTC
0 is not strict overcommit.

At the moment I am unable to duplicate your problem report. On all my test sets
the kernel correctly refuses to go out of memory.

With 

 echo "2" >/proc/sys/vm/overcommit_memory

I see

malloc() returned error: Cannot allocate memory
Allocated 307 MB

  echo "3" >/proc/sys/vm/overcommit_memory

I see

malloc() returned error: Cannot allocate memory
Allocated 245 Mb


I have also so far been unable to duplicate a hang on SMP boxes. With the
default policy I do see out of memory kills as I would expect.


As regards policy my personal view agrees with yours and is that the default
policy should be "2". Thats something that may change in future Linux releases

How much memory does your test system have ?


Comment 7 josip 2002-08-20 19:04:25 UTC
Curious.  I did more testing using test code "m", and found that malloc()
returns correct error codes on our Pentium IV 1.7GHz machines with 1 GB PC800
RDRAM and two 1 GB swap partitions on /dev/hda{2,3}.

However, on our uniprocessor Pentium II 400 MHz machines with 384 MB PC100 SDRAM
and two 384 MB swap partitions on /dev/hda{2,3} and the default VM settings, the
test program always gets killed by the kernel's OOM killer, never by detecting
any malloc() errors:

  [root@n027 root]# /usr/local/sbin/m
  Terminated

After "m" is killed, /var/log/messages shows:

  Aug 20 14:18:19 n027 kernel: Out of Memory: Killed process 9488 (m).

All our uniprocessor machines run the same kernel and glibc.

To make the story more interesting, our SMP servers with SCSI disks can
generally run "m" without crashing (OOM killer acts in time), while our SMP
compute nodes with IDE disks frequently crash when OOM state is reached.  No
error codes are returned by malloc() on any SMP machine I've tried, and the OOM
killer is activated again:

  [root@fs1 root]# /usr/local/sbin/m
  Terminated

and the /var/log/messages shows:

  Aug 20 13:59:51 fs1 kernel: Out of Memory: Killed process 19927 (m).

My suspicion is that this problem may be timing related, because:

400 MHz single CPU nodes (Intel 440BX chipset) fail to return malloc() errors. 
They do not crash since OOM killer terminates the test program, but fast swap
drives on ATA-133 interfaces do not help fix the incorrect malloc() behavior.

500 MHz dual CPU nodes (Intel 440BX) fail to return malloc() errors and always
crash if swapping is enabled on /dev/hda{2,3} using UDMA-33 interface.  Crash is
usually avoided if swapping is disabled on the UDMA-33 drive.  No crashes are
seen with swapping enabled only on /dev/sd{a,b,c}2 or on /dev/hde3 using ATA-133
interface.  In all tests, malloc() never returns error codes.

800 MHz dual CPU nodes (ServerWorks LE) fail to return malloc() errors and crash
with about 75% probability if swapping is enabled on /dev/hda{2,3} with UDMA-33
interface.  Interestingly enough, 4 out of 16 machines survived "m".

1.7 GHz single CPU nodes (Intel 850) return malloc() codes correctly.

In other words: malloc() misbehaves if CPU speed is under 1 GHz or so.  Crashes
result with OOM+swapping on SMP nodes which misbehaves if the disk interface is
UDMA-33.

All machines run Red Hat kernel 2.4.18-{5,5smp} and have identical binaries
installed (except servers, which have extra capabilities).  We've got 68
machines of the above types, and the problem not limited to a particular machine
or two, so I do not suspect a hardware defect.  Most likely, this is a genuine
Linux kernel problem.



Comment 8 josip 2002-08-21 16:36:58 UTC
Additional experiments with overcommit management facility on our dual Pentium
III 500 MHz nodes with 440BX chipset, 512 MB PC100 SDRAM, and two swap
partitions (512MB each) on UDMA/33 drive:

(1) When swap is ON, neither overcommit_memory=2 nor overcommit_memory=3 can fix
the problem.  The test program can get almost all physical RAM, but then instead
of continuing to allocate pages from swap space, the machine completely locks
up.  No error indications are returned by malloc(), but unlike the default value
overcommit_memory=0 where the machine simply dies, values 2 or 3 help get some
hints about the place where things went wrong.  The system console reports
traceback starting from page_launder_zone.  Obviously, handle_mm_fault is needed
in executing the test program, the system tries to allocate pages by executing
try_to_free_pages, gets into page_launder, and fails.  Another run (with fewer
kernel modules loaded) produced slightly different resuts, with kswapd failing
and tracebacks complaining about "EIP page_over_rsslimit", then listing
refill_inactive_zone in kswapd, and eventually complaining about "Unable to
handle kernel NULL pointer dereference".  In both cases, it appears that the
trouble involves getting a page in swap.

Suggestion: Look for timing problems within/near page_launder or refill_inactive
on SMP machines where the CPU and the disk are not very fast.


(2) When swap is OFF, overcommit_memory=3 reduces the total address space commit
to zero, which crashes the system:
    [root@n033 vm]# swapoff -a
    [root@n033 vm]# echo "3" >/proc/sys/vm/overcommit_memory 
    [root@n033 vm]# sync
    bash: fork: Cannot allocate memory
    bash: xmalloc: subst.c:258: cannot allocate 5 bytes (0 bytes allocated)
    rlogin: connection closed.

Suggestion: Change behavior of overcommit_memory=3 so that when swap is off, the
active constraint becomes physical RAM, thus avoiding system lockups.


(3) When swap is OFF, overcommit_memory=2 reduces the total address space commit
to about half physical RAM, but malloc() returns correct error codes.

Suggestion: Swap is not relevant when off.  Use the physical RAM limit in
vm_enough_memory() when swap is off.



Comment 9 Alan Cox 2002-08-21 16:46:39 UTC
If the machine is hanging doing swapping then its not the vm overcommit handling
thats a problem. Something else is messed up and the vm overcommit is just a red
herring. 

vm overcommit 3 cannot be changed as you suggest. What other modules are you
using (an lsmod would be useful here). I've seen those kind of hangs with people
using openafs for example


Comment 10 josip 2002-08-21 21:21:02 UTC
Agreed, the source of crashes is probably swapping, so we've turned swap off for
now.  If overcommit_memory=3 handling cannot be changed, perhaps it should be
disabled when swap is off.  Finally, user programs should get better feedback
from the kernel on memory status (how much can I get, can I use what I got, can
I catch OOM exceptions if I can't, etc.).

Regarding loaded modules, we run two different configurations, and the one
without clan1k and lanevi is more reliable.  These modules operate our Giganet
cLAN network, which remains unused during the test but may want to perform path
discovery or something similar on its own.  Removing these modules and using
"paranoid" overcommit_memory=3 (with swap enabled) helps somewhat (typically,
dual P3/800 stay up; but 14 to 15 our of 16 dual P3/500 crash).  Since we start
the test program over the network, it may be that even NIC activity during
swapping is a problem.

My current hypothesis: Crashes happen and swapping fails when there is competing
device activity, e.g. network.



Modules typically present on our dual P3/500 nodes:
[root@n033 root]# lsmod
Module                  Size  Used by    Tainted: P  
w83781d                18592   0  (unused)
i2c-proc                8224   0  [w83781d]
i2c-isa                 1892   0  (unused)
i2c-piix4               4996   0  (unused)
i2c-core               19360   0  [w83781d i2c-proc i2c-isa i2c-piix4]
autofs                 12612   0  (autoclean) (unused)
nfs                    91196   5  (autoclean)
lockd                  58880   1  (autoclean) [nfs]
sunrpc                 84180   1  (autoclean) [nfs lockd]
tulip                  43840   1 
lanevi                 23036   1 
clan1k                 26736   3  [lanevi]
usb-uhci               25604   0  (unused)
usbcore                75904   1  [usb-uhci]
ext3                   70944   2 
jbd                    53728   2  [ext3]



Comment 11 josip 2002-09-12 14:24:30 UTC
This problem is still present in Red Hat kernel 2.4.18-10smp.

Comment 12 Bugzilla owner 2004-09-30 15:39:50 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/