Bug 851508

Summary: Bogus (excessive) /proc/pid/status VmSize,VmmRSS values (& ps output) for threaded processes
Product: Red Hat Enterprise Linux 5 Reporter: paul manser <pmanser>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.6CC: chrisp, mvaliyav
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-04 09:44:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pgm: malloc 2gb, create 'n' threads, do nothing, quit after 100s
none
corrected test case - uncommented comment in first version none

Description paul manser 2012-08-24 11:49:46 UTC
Created attachment 606839 [details]
pgm: malloc 2gb, create 'n' threads, do nothing, quit after 100s

Description of problem:
Multi-threaded programs like attached example show incorrect/excessive
reported VmSize and VMRSS values in /proc/pid/status file and ps command
output. We have large applications that show the problem but the small
attached example also shows it.
On multiple hardware with same cpu architecture (but varying RAM/disk sizes)
problem always occurs on some machines but not others. 

Version-Release number of selected component (if applicable):

example machine 
lc-cam-512 ~ > uname -a
Linux lc-cam-512 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

Issue has not been seen on the few 5.8 version machines we have available

How reproducible:

compile and run attached example program, monitor associated 
/proc/pid/status file and/or ps output 
Program allocates 2GB memory and starts 'n'(first argument) threads
the program runs for 100s, both the threads and master sleeping 
with a wake-up every second, no other processing.

compile with : 
gcc thread_experience.c -D_REENTRANT -lpthread 

Steps to Reproduce:
1. a.out 5 
2. periodically inspect /proc/pid/status during execution
3.
  
Actual results:

ps -o pid,ppid,pcpu,rss,size,vsize,cputime,lstart -Lp 8205
  PID  PPID %CPU   RSS    SZ    VSZ     TIME                  STARTED
 8205  8203  0.0   608 23068996 23074668 00:00:00 Fri Aug 24 12:43:49 2012
 8205  8203  0.0   608 23068996 23074668 00:00:00 Fri Aug 24 12:43:49 2012
 8205  8203  0.0   608 23068996 23074668 00:00:00 Fri Aug 24 12:43:49 2012
 8205  8203  0.0   608 23068996 23074668 00:00:00 Fri Aug 24 12:43:49 2012
 8205  8203  0.0   608 23068996 23074668 00:00:00 Fri Aug 24 12:43:49 2012
 8205  8203  0.0   608 23068996 23074668 00:00:00 Fri Aug 24 12:43:49 2012

Name:   a.out
State:  S (sleeping)
SleepAVG:       78%
Tgid:   823
Pid:    823
PPid:   821
TracerPid:      0
Uid:    26235   26235   26235   26235
Gid:    20      20      20      20
FDSize: 256
Groups: 20 1165 51256
VmPeak: 23074668 kB
VmSize: 23074668 kB
VmLck:         0 kB
VmHWM:       604 kB
VmRSS:       604 kB
VmData: 23068900 kB
VmStk:        96 kB
VmExe:         4 kB
VmLib:      1536 kB
VmPTE:        64 kB
StaBrk: 12670000 kB
Brk:    12691000 kB
StaStk: 7fffe24f7160 kB
Threads:        6

 Expected results:

ps -o pid,ppid,pcpu,rss,size,vsize,cputime,lstart -Lp 1233
  PID  PPID %CPU   RSS    SZ    VSZ     TIME                  STARTED
 1233   840  0.0   596 2107708 2113380 00:00:00 Fri Aug 24 12:42:10 2012
 1233   840  0.0   596 2107708 2113380 00:00:00 Fri Aug 24 12:42:10 2012
 1233   840  0.0   596 2107708 2113380 00:00:00 Fri Aug 24 12:42:10 2012
 1233   840  0.0   596 2107708 2113380 00:00:00 Fri Aug 24 12:42:10 2012
 1233   840  0.0   596 2107708 2113380 00:00:00 Fri Aug 24 12:42:10 2012
 1233   840  0.0   596 2107708 2113380 00:00:00 Fri Aug 24 12:42:10 2012


Name:   a.out
State:  S (sleeping)
SleepAVG:       98%
Tgid:   28425
Pid:    28425
PPid:   28374
TracerPid:      0
Uid:    26235   26235   26235   26235
Gid:    20      20      20      20
FDSize: 64
Groups: 20 1165 51256
VmPeak:  2113380 kB
VmSize:  2113380 kB
VmLck:         0 kB
VmHWM:       596 kB
VmRSS:       596 kB
VmData:  2107620 kB
VmStk:        88 kB
VmExe:         4 kB
VmLib:      1536 kB
VmPTE:        68 kB
StaBrk: 04e81000 kB
Brk:    04ea2000 kB
StaStk: 7fff6c923a30 kB
Threads:        6



Additional info:

Comment 1 paul manser 2012-08-24 13:14:08 UTC
Created attachment 606858 [details]
corrected test case - uncommented comment in first version

Comment 2 paul manser 2012-08-24 14:51:57 UTC
Have now identified some additional details, it seems the VmSize,VmmRSS (& ps output) are correct and that the same binary is sometimes showing an
additional 4GB(4194304 kB)  virtual space allocation for each thread,
which makes the total about 2 + 4*5 =22GB in the above example 

The addtional space is being allocated only on some machines and then only
if the job is being executed by the LSF (form platform / IBM) job 
distribution system in which case the job has a process tree like the following

res,12350 -d /tools/lsf/Current/conf -m xl-cbga-03 /home/pmanser/.lsbatch/1345818946.181820
  |-1345818946.1818,12377 /home/pmanser/.lsbatch/1345818946.181820
  |   `-a.out,12379 5
  |       |-{a.out},12380
  |       |-{a.out},12381
  |       |-{a.out},12382
  |       |-{a.out},12383
  |       `-{a.out},12384
  `-{res},12378

if the program is run directly on the machine from a standard linux shell 
I see no issue.

So somehow the executing lsf program is causing the behaviour of the 
threads to be modified.
Inspection of the dynamically linked libraries 

show the same liraries when jobs is executed direcly and via lsf eg.
lc-cam-513 spice_leakage > grep lib64 /proc/12914/maps
3752600000-375261c000 r-xp 00000000 08:01 24674657                       /lib64/ld-2.5.so
375281b000-375281c000 r--p 0001b000 08:01 24674657                       /lib64/ld-2.5.so
375281c000-375281d000 rw-p 0001c000 08:01 24674657                       /lib64/ld-2.5.so
3752e00000-3752f4e000 r-xp 00000000 08:01 24674668                       /lib64/libc-2.5.so
3752f4e000-375314e000 ---p 0014e000 08:01 24674668                       /lib64/libc-2.5.so
375314e000-3753152000 r--p 0014e000 08:01 24674668                       /lib64/libc-2.5.so
3753152000-3753153000 rw-p 00152000 08:01 24674668                       /lib64/libc-2.5.so
3753a00000-3753a16000 r-xp 00000000 08:01 24674670                       /lib64/libpthread-2.5.so
3753a16000-3753c15000 ---p 00016000 08:01 24674670                       /lib64/libpthread-2.5.so
3753c15000-3753c16000 r--p 00015000 08:01 24674670                       /lib64/libpthread-2.5.so
3753c16000-3753c17000 rw-p 00016000 08:01 24674670                       /lib64/libpthread-2.5.so


Thanks for any ideas on likely cause.

Regards, Paul

Comment 3 paul manser 2012-09-10 10:49:04 UTC
Hi,
   We believe we identified cause.

If the process limit stacksize is set to a discrete value from executing shell, as opposed to unlimited; then increased virtual memory allocation of stacksize
multiplied by thread count is reserved. If however stacksize limit
is set to unlimited then no additional virtual memory is taken (per thread)

Regards, Paul

Comment 4 Minto Joseph 2012-10-04 09:44:58 UTC
I am able to reproduce the issue by increasing the stack size using ulimit,

# ulimit -s 1000240
# ./a.out 5
[..]

# ps -o pid,ppid,pcpu,rss,size,vsize,cputime,lstart -Lp 31913
  PID  PPID %CPU   RSS    SZ    VSZ     TIME                  STARTED
31913 31779  0.0   584 7098644 7104312 00:00:00 Wed Oct  3 11:49:57 2012
31913 31779  0.0   584 7098644 7104312 00:00:00 Wed Oct  3 11:49:57 2012
31913 31779  0.0   584 7098644 7104312 00:00:00 Wed Oct  3 11:49:57 2012
31913 31779  0.0   584 7098644 7104312 00:00:00 Wed Oct  3 11:49:57 2012
31913 31779  0.0   584 7098644 7104312 00:00:00 Wed Oct  3 11:49:57 2012
31913 31779  0.0   584 7098644 7104312 00:00:00 Wed Oct  3 11:49:57 2012


This is an expected behaviour.

From 'man  pthread_create',

On Linux/x86-32, the default stack size for a new thread is 2 megabytes.  Under the NPTL threading implementation, if the RLIMIT_STACK soft resource  limit  at  the  time  the  program  started has any value other than "unlimited", then it determines the default stack size of new threads.  Using pthread_attr_setstacksize(3), the stack size attribute can be explicitly set in the attr argument used to create a thread, in order to obtain a stack size other than the default.

Closing the bugzilla.

Comment 5 Chris Pugmire 2015-07-13 23:49:35 UTC
This bug/issue is incorrectly diagnosed, I spent many hours with a similar problem before discovering the true cause has nothing to do with setstacksize(), I'm posting this additional info here in case others strike the same problem.

The problem is that malloc() may create multiple arena's for each new thread based on the first threads malloc usage, the result can be that each thread gets an address space similar to the main program, e.g. your main thread is using 100mb, then each thread will get an address space of another 100mb.  

The impact of this on the system is hard to determine, but it does seem to cause performance issues in some situations.  

It appears doing this at the start of your program can stop this from occurring.

    rv = mallopt(-7, 1);  // M_ARENA_TEST
    rv = mallopt(-8, 1);  // M_ARENA_MAX

The real issue is described here by people who know much more than me:

https://sourceware.org/bugzilla/show_bug.cgi?id=11261