Bug 681586 - Out of vmalloc space
Summary: Out of vmalloc space
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: i386
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Larry Woodman
QA Contact: Eryu Guan
URL:
Whiteboard:
Depends On:
Blocks: 684940
TreeView+ depends on / blocked
 
Reported: 2011-03-02 16:31 UTC by Joseph Kachuck
Modified: 2018-11-14 14:10 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 10:11:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oomalloc-fix.patch (1.38 KB, application/octet-stream)
2011-03-02 16:31 UTC, Joseph Kachuck
no flags Details
simple reproducer (692 bytes, text/plain)
2011-05-16 11:26 UTC, Eryu Guan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Joseph Kachuck 2011-03-02 16:31:59 UTC
Created attachment 481902 [details]
oomalloc-fix.patch

SHAILAJA GOLIKERI <golikeri.com> -
---Problem Description---
With RHEL 5 Update 5 - Error "allocation failed: out of vmalloc space" seen as processes are spun
up - if a 32 bit OS is running on a 64 bit hardware.

if 64 bit OS is running on 64 bit hardare, error is not seen but VM usage is seen to increase
continuously.

This behaviour is not seen with RHEL 5 Update 4

Contact Information = Shailaja Golikeri/golikeri.com

---uname output---
(32 bit)#Linux cmtest-cmsvr 2.6.18-194.el5PAE #1 SMP Tue Mar 16 22:00:21 EDT 2010 i686 i686 i386
GNU/Linux and (64 bit) #Linux cmtest-cmsvr 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010
x86_64 x86_64 x86_64 GNU/Linux

Machine Type = model name : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz

---Debugger---
A debugger is not configured

---Steps to Reproduce---
We have reproduced this in system testing with Rational Change Management server. I expect it could
be reproduce by spawning an increasing number of processes - process list should keep growing.

---Not Yet Classified Component Data---
*Additional Instructions for Shailaja Golikeri/golikeri.com:
-Post a private note with access information to the machine that the bug is occuring on.
=Comment: #3=================================================
SHAILAJA GOLIKERI <golikeri.com> -
Below is the questions I am trying to answer (I sent this to Kamalesh via email) - adding them here
for other folks working on this:

Hi Kamalesh,
I submitted a defect in bugzilla against RHEL 5.5 (VM usage increasing
steadily). Apparently, the testers are seeing the same increase on RHEL
5.4.

Our server process is a purely user space code. However, for each server
process started the VM usage increases and when around 15 instances are
active we see "allocation failed: out of vmalloc space" errors. . This
error is seen on a 32 bit RHEL system. On a 64 bit we don't see an error
but the usage is seen to increase.

So I have a couple of questions:

1. What would make a user process consume VMalloc space?
2.Is a 32 bit RHEL system not a server class machine with a limit of number
of simultaneously active processes?


The backend process is a C/C++ process.


Thanks much

Shaila Golikeri



=Comment: #4=================================================
Mel Gorman <MELGOR.com> -

Is the userspace process interacting with any binary driver - for example some sort of network
accelerator or a special filesystem? If so, this is likely the problem. Based on the report, it
feels like someone or something is leaking buffers allocated with vmalloc(). If this is not the
case, there must be a bug with a driver in the distro somewhere.

I'd suggest they write a systemtap script to dump out stack traces of all callers to vmalloc and
vfree to discover the source of the leak.

=Comment: #6=================================================
Mel Gorman <MELGOR.com> -
(In reply to comment #5)
> Can you provide more details, preferably a sample script to run to get this
> information?
>

I don't have a sample script for specifically this type of problem but it's straight-forward. Start
with something like http://sourceware.org/systemtap/examples/locks/bkl.stp and probe vmalloc and
vfree. Record if possible what buffers are being returned by vmalloc and match them with vfree.
Print out the stack traces of buffers that have not been freed yet and count unique stack traces.
The highest unique stack trace for a buffer that has not been freed is your memory leak.
=Comment: #10=================================================
Dang En Ren <rende.com> -
Hi Mel,

I've read some systemtap document, but still have no idea how to capture the allocated size by
kernel vmalloc(), can you show me some examples? (Original stp scripts are attached).

As to # of files "ccrpc" process opened, I queried that while VmallocUsed is increasing, not many,
typically about a dozen, e.g.

lr-x------ 1 user181 users 64 Jul 8 18:35 0 -> pipe:[621018]
l-wx------ 1 user181 users 64 Jul 8 18:35 1 -> pipe:[621019]
lrwx------ 1 user181 users 64 Jul 8 18:35 10 -> socket:[621207]
lrwx------ 1 user181 users 64 Jul 8 18:35 11 -> socket:[621290]
lrwx------ 1 user181 users 64 Jul 8 18:35 12 -> socket:[621291]
lrwx------ 1 user181 users 64 Jul 8 18:35 13 -> socket:[622907]
l-wx------ 1 user181 users 64 Jul 8 18:35 2 -> pipe:[621019]
lrwx------ 1 user181 users 64 Jul 8 18:35 3 -> socket:[621029]
lrwx------ 1 user181 users 64 Jul 8 18:35 4 -> socket:[621031]
lrwx------ 1 user181 users 64 Jul 8 18:35 5 ->
lrwx------ 1 user181 users 64 Jul 8 18:35 6 -> socket:[621049]
lrwx------ 1 user181 users 64 Jul 8 18:35 7 -> socket:[622682]
lrwx------ 1 user181 users 64 Jul 8 18:35 8 -> socket:[621071]
lrwx------ 1 user181 users 64 Jul 8 18:35 9 -> socket:[621081]

Please let me know if you need a live environment to investigate further, thanks!

=Comment: #11=================================================
Mel Gorman <MELGOR.com> -
(In reply to comment #10)
> Hi Mel,
>
> I've read some systemtap document, but still have no idea how to capture the
> allocated size by kernel vmalloc(), can you show me some examples? (Original
> stp scripts are attached).
>

probe kernel.function("vmalloc") {
printf("vmalloc: size = %d\n", $size)
}


> As to # of files "ccrpc" process opened, I queried that while VmallocUsed is
> increasing, not many, typically about a dozen, e.g.
>
> lr-x------ 1 user181 users 64 Jul 8 18:35 0 -> pipe:[621018]
> l-wx------ 1 user181 users 64 Jul 8 18:35 1 -> pipe:[621019]
> lrwx------ 1 user181 users 64 Jul 8 18:35 10 -> socket:[621207]
> lrwx------ 1 user181 users 64 Jul 8 18:35 11 -> socket:[621290]
> lrwx------ 1 user181 users 64 Jul 8 18:35 12 -> socket:[621291]
> lrwx------ 1 user181 users 64 Jul 8 18:35 13 -> socket:[622907]
> l-wx------ 1 user181 users 64 Jul 8 18:35 2 -> pipe:[621019]
> lrwx------ 1 user181 users 64 Jul 8 18:35 3 -> socket:[621029]
> lrwx------ 1 user181 users 64 Jul 8 18:35 4 -> socket:[621031]
> lrwx------ 1 user181 users 64 Jul 8 18:35 5 ->
> /opt/ibm/RationalSDLC/common/CM/logs/server1/ccrpc/ccrpc_user181_d100708_t183510_p17292_.log
> lrwx------ 1 user181 users 64 Jul 8 18:35 6 -> socket:[621049]
> lrwx------ 1 user181 users 64 Jul 8 18:35 7 -> socket:[622682]
> lrwx------ 1 user181 users 64 Jul 8 18:35 8 -> socket:[621071]
> lrwx------ 1 user181 users 64 Jul 8 18:35 9 -> socket:[621081]
>

Ok, this is not enough files to be triggering vmalloc on fork. Modify your script to report the
sizes and lets see what crops up in terms of usage.
=Comment: #12=================================================
Dang En Ren <rende.com> -

Output from my RHEL5.4 32bit VM: (standard kernel: 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:54 EDT
2009 i686 i686 i386 GNU/Linux)

================================================================================
vmalloc: size=131072
vmalloc: size=131072
vmalloc: size=4194304
vmalloc: size=131072
vmalloc: size=131072
vmalloc: size=4194304
Thu Jul 8 14:22:15 2010 - vmalloc total: 15, vfree total: 0, vmalloc: 6, vfree: 0
vmalloc: size=131072
vmalloc: size=131072
vmalloc: size=4194304
Thu Jul 8 14:22:20 2010 - vmalloc total: 18, vfree total: 0, vmalloc: 3, vfree: 0
vmalloc: size=131072
vmalloc: size=131072
vmalloc: size=4194304
Thu Jul 8 14:22:25 2010 - vmalloc total: 21, vfree total: 0, vmalloc: 3, vfree: 0
....
================================================================================

Please be noted that not every time we can see increase of VmallocUsed, in other words, sometimes
this value won't change as we increase more "ccrpc" processes, but other times it will change, at
this constant pace, till buffer run out.

It's observed on both RHEL5.4 and 5.5, (32bit). Meanwhile, on 64bit RHEL5.4 and 5.5, everytime the
VmallocUsed will increase, but since the upper limited is much higher, our test can still go on.






=Comment: #13=================================================
Mel Gorman <MELGOR.com> -
> vmalloc: size=131072
> vmalloc: size=131072
> vmalloc: size=4194304
> vmalloc: size=131072
> vmalloc: size=131072
> vmalloc: size=4194304
> Thu Jul 8 14:22:15 2010 - vmalloc total: 15, vfree total: 0, vmalloc: 6,

And the sources of these? Your earlier logs included stack traces. I want to know specifically who
or what is allocating 4M but the 128K allocations alone are disturbing enough.
=Comment: #14=================================================
Dang En Ren <rende.com> -
OK.
On RHEL5.4 32bit:
======================================================
Fri Jul 9 01:13:47 2010 - vmalloc total: 0, vfree total: 0, vmalloc: 0, vfree: 0
vmalloc: size=131072
0xc04691fd : vmalloc+0x1/0x10 [kernel]
0xc048c0ad : expand_files+0x9c/0x2d3 [kernel]
0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
0xc0421de4 : copy_files+0x3d/0x54 [kernel]
0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
0xc04235e4 : do_fork+0x41/0x168 [kernel]
0xc040318b : sys_clone+0x28/0x2d [kernel]
0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
backtrace is:
vmalloc: size=131072
0xc04691fd : vmalloc+0x1/0x10 [kernel]
0xc048c0b6 : expand_files+0xa5/0x2d3 [kernel]
0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
0xc0421de4 : copy_files+0x3d/0x54 [kernel]
0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
0xc04235e4 : do_fork+0x41/0x168 [kernel]
0xc040318b : sys_clone+0x28/0x2d [kernel]
0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
backtrace is:
vmalloc: size=4194304
0xc04691fd : vmalloc+0x1/0x10 [kernel]
0xc048c10a : expand_files+0xf9/0x2d3 [kernel]
0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
0xc0421de4 : copy_files+0x3d/0x54 [kernel]
0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
0xc04235e4 : do_fork+0x41/0x168 [kernel]
0xc040318b : sys_clone+0x28/0x2d [kernel]
0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
backtrace is:
Fri Jul 9 01:13:52 2010 - vmalloc total: 3, vfree total: 0, vmalloc: 3, vfree: 0
vmalloc: size=131072
0xc04691fd : vmalloc+0x1/0x10 [kernel]
0xc048c0ad : expand_files+0x9c/0x2d3 [kernel]
0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
0xc0421de4 : copy_files+0x3d/0x54 [kernel]
0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
0xc04235e4 : do_fork+0x41/0x168 [kernel]
0xc040318b : sys_clone+0x28/0x2d [kernel]
0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
backtrace is:
vmalloc: size=131072
0xc04691fd : vmalloc+0x1/0x10 [kernel]
0xc048c0b6 : expand_files+0xa5/0x2d3 [kernel]
0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
0xc0421de4 : copy_files+0x3d/0x54 [kernel]
0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
0xc04235e4 : do_fork+0x41/0x168 [kernel]
0xc040318b : sys_clone+0x28/0x2d [kernel]
0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
backtrace is:
vmalloc: size=4194304
0xc04691fd : vmalloc+0x1/0x10 [kernel]
0xc048c10a : expand_files+0xf9/0x2d3 [kernel]
0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
0xc0421de4 : copy_files+0x3d/0x54 [kernel]
0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
0xc04235e4 : do_fork+0x41/0x168 [kernel]
0xc040318b : sys_clone+0x28/0x2d [kernel]
0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
backtrace is:
======================================================

On RHEL5.5 64bit:
======================================================
vmalloc: size=8192
0xffffffff800d2426 : vmalloc+0x0/0x14 [kernel]
0xffffffff80030ca7 : expand_files+0x124/0x2c7 [kernel]
0xffffffff80022584 : dup_fd+0x133/0x281 [kernel]
0xffffffff800498a9 : copy_files+0x47/0x63 [kernel]
0xffffffff8001f521 : copy_process+0x590/0x15eb [kernel]
0xffffffff8003151c : do_fork+0x69/0x1c1 [kernel]
0xffffffff800628dd : ia32_ptregs_common+0x25/0x4c [kernel]
backtrace is:
vmalloc: size=8192
0xffffffff800d2426 : vmalloc+0x0/0x14 [kernel]
0xffffffff80030ca7 : expand_files+0x124/0x2c7 [kernel]
0xffffffff80022584 : dup_fd+0x133/0x281 [kernel]
0xffffffff800498a9 : copy_files+0x47/0x63 [kernel]
0xffffffff8001f521 : copy_process+0x590/0x15eb [kernel]
0xffffffff8003151c : do_fork+0x69/0x1c1 [kernel]
0xffffffff800628dd : ia32_ptregs_common+0x25/0x4c [kernel]
backtrace is:
Fri Jul 9 01:21:11 2010 - vmalloc total: 2, vfree total: 0, vmalloc: 2, vfree: 0
Fri Jul 9 01:21:16 2010 - vmalloc total: 2, vfree total: 0, vmalloc: 0, vfree: 0
vmalloc: size=8192
0xffffffff800d2426 : vmalloc+0x0/0x14 [kernel]
0xffffffff80030ca7 : expand_files+0x124/0x2c7 [kernel]
0xffffffff80022584 : dup_fd+0x133/0x281 [kernel]
0xffffffff800498a9 : copy_files+0x47/0x63 [kernel]
0xffffffff8001f521 : copy_process+0x590/0x15eb [kernel]
0xffffffff8003151c : do_fork+0x69/0x1c1 [kernel]
0xffffffff800628dd : ia32_ptregs_common+0x25/0x4c [kernel]
backtrace is:
vmalloc: size=8192
0xffffffff800d2426 : vmalloc+0x0/0x14 [kernel]
0xffffffff80030ca7 : expand_files+0x124/0x2c7 [kernel]
0xffffffff80022584 : dup_fd+0x133/0x281 [kernel]
0xffffffff800498a9 : copy_files+0x47/0x63 [kernel]
0xffffffff8001f521 : copy_process+0x590/0x15eb [kernel]
0xffffffff8003151c : do_fork+0x69/0x1c1 [kernel]
0xffffffff800628dd : ia32_ptregs_common+0x25/0x4c [kernel]
backtrace is:
======================================================
=Comment: #15=================================================
Mel Gorman <MELGOR.com> -
(In reply to comment #14)
> OK.
> On RHEL5.4 32bit:
> ======================================================
> Fri Jul 9 01:13:47 2010 - vmalloc total: 0, vfree total: 0, vmalloc: 0,
> vfree: 0
> vmalloc: size=131072
> 0xc04691fd : vmalloc+0x1/0x10 [kernel]
> 0xc048c0ad : expand_files+0x9c/0x2d3 [kernel]
> 0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
> 0xc0421de4 : copy_files+0x3d/0x54 [kernel]
> 0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
> 0xc04235e4 : do_fork+0x41/0x168 [kernel]
> 0xc040318b : sys_clone+0x28/0x2d [kernel]
> 0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
> backtrace is:
> vmalloc: size=131072
> 0xc04691fd : vmalloc+0x1/0x10 [kernel]
> 0xc048c0b6 : expand_files+0xa5/0x2d3 [kernel]
> 0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
> 0xc0421de4 : copy_files+0x3d/0x54 [kernel]
> 0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
> 0xc04235e4 : do_fork+0x41/0x168 [kernel]
> 0xc040318b : sys_clone+0x28/0x2d [kernel]
> 0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]
> backtrace is:
> vmalloc: size=4194304
> 0xc04691fd : vmalloc+0x1/0x10 [kernel]
> 0xc048c10a : expand_files+0xf9/0x2d3 [kernel]
> 0xc0421c15 : dup_fd+0x142/0x2d4 [kernel]
> 0xc0421de4 : copy_files+0x3d/0x54 [kernel]
> 0xc0422649 : copy_process+0x4e4/0x1231 [kernel]
> 0xc04235e4 : do_fork+0x41/0x168 [kernel]
> 0xc040318b : sys_clone+0x28/0x2d [kernel]
> 0xc0404f17 : sys_sigreturn+0x1f8/0xf2d [kernel]

Excellent, so it is files that are causing the problem during fork. However, you said that there are
very few files open but maybe the PID you checked was wrong. Update the script to include the PID
and name of the process calling vmalloc() and see what how many files it has open. However, it's
also possible something else has artifically pushed this up. You could also create a systemtap
script that probes open and reports when the number of open files are over some high threshold and
see can you identify which of the application processes are causing problems.
=Comment: #23=================================================
Dang En Ren <rende.com> -
Yes, this is still an issue. (Please refer to my comment #19 a few months ago.)

comment #24
===============================================================
I see two issues

1. Why do we allocate 4M from vmalloc space when we have only 306 files open?
2. Why are we running out of vmalloc space

2 depends on 1. But 2 has been solved in mainline, where we try to kmalloc
prior to using vmalloc.

The mainline commit id a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6 which should
probably address this issue.




=Comment: #33=================================================
Dang En Ren <rende.com> -
Hi, Balbir, I built the new RHEL5u5 kernel with the patch in your last comment and it does work.


I tried for RHEL5u5 x86_64, on a x3650 box and a VMWare virtual images, same result..

Since the patch is arch-independent and has been adopted in mainline kernel, I assume it's safe and
works for 32bit OS as well.

I tend to close this one as "fixed" and have a document for our own product. Shaila and others,
please let me know your thoughts.

Thanks to all the folks here for driving this issue to an end :)

Comment 4 RHEL Program Management 2011-03-11 15:19:19 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Jarod Wilson 2011-05-13 22:19:36 UTC
Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 11 Eryu Guan 2011-05-16 11:25:44 UTC
Reproduced on -238 kernel, with simple reproducer attached which open 2048 files and fork children increasingly. 

On -238 kernel I got the following from dmesg
allocation failed: out of vmalloc space - use vmalloc=<size> to increase size.

On -261 kernel, the reproducer can exit normally.

You should modify the open file limit first
[root@localhost ~]# sysctl -a | grep fs.file-max
fs.file-max = 294829
[root@localhost ~]# ulimit -n 294829
[root@localhost 681586-out-of-vmalloc-space]# uname -a
Linux localhost.localdomain 2.6.18-261.el5 #1 SMP Thu May 12 16:46:44 EDT 2011 i686 i686 i386 GNU/Linux
[root@localhost 681586-out-of-vmalloc-space]# ./reproducer 15
[root@localhost 681586-out-of-vmalloc-space]# dmesg  <======== No error msg here

Comment 12 Eryu Guan 2011-05-16 11:26:38 UTC
Created attachment 499110 [details]
simple reproducer

Comment 13 IBM Bug Proxy 2011-05-20 13:20:32 UTC
------- Comment From lnx1138.ibm.com 2011-05-20 09:11 EDT-------
According to Red Hat, this bug this has been verified.

Comment 14 errata-xmlrpc 2011-07-21 10:11:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.