Bug 137927

Summary:	Process memory usage incorrect in top.
Product:	Red Hat Enterprise Linux 3	Reporter:	Jason Smith <smithj4>
Component:	kernel	Assignee:	Rik van Riel <riel>
Status:	CLOSED ERRATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	anderson, coughlan, george_robinson, jorton, kzak, petrides, rodrigo, v, villapla, wirth
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-12-20 20:56:53 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jason Smith 2004-11-02 21:34:57 UTC

Description of problem:
The top utility is reporting incorrect values for memory usage of
several long running processes.

Version-Release number of selected component (if applicable):
procps-2.0.17-10

How reproducible:
reported memory usage grows the longer the processes are running.

Actual results:
 16:32:50  up 14 days,  3:54, 23 users,  load average: 0.33, 0.17, 0.06
133 processes: 132 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total   10.4%    0.6%    5.2%   0.2%     0.0%    0.2%  182.8%
           cpu00    0.6%    0.0%    0.3%   0.0%     0.0%    0.0%   99.0%
           cpu01    9.9%    0.6%    4.9%   0.3%     0.0%    0.3%   83.8%
Mem:   509876k av,  495460k used,   14416k free,       0k shrd,  
15048k buff
                    339952k actv,   62768k in_d,    6932k in_c
Swap: 1044216k av,  398044k used,  646172k free                 
106360k cached
                                                                     
                  
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
24717 smithj4   15   0 1591M 1.6G  3720 S     4.6 319.4 248:32   1 gkrellm
 5148 smithj4   15   0  905M 894M 14876 S     0.0 179.7 109:55   0
mozilla-bin
 6232 smithj4   15   0  805M 785M  5460 S     0.3 157.7  99:32   1
gnome-terminal
 4625 root      15   0  870M 582M  7512 S     3.7 117.0 647:44   0 X
 4726 smithj4   15   0  467M 463M  5184 S     0.3 93.0  35:02   0
gnome-panel
 4472 root      15   0 98968  96M   540 S     0.0 19.4   0:01   0 crond


Expected results:
The ps command shows that gkrellm for example is using around 17MB of
memory instead of the nearly 1.6GB that top shows:

# ps aux | grep gkrellm
smithj4  24717  2.3  0.9 17080 4924 ?        S    Oct26 248:50 gkrellm


Additional info:
Compiled the attachment shown here:
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=106011&action=view

and ran the executable:

# ./pagesize
getpagesize()=4096, PAGE_SHIFT: 12, pgshift: 2

Comment 1 Joe Orton 2004-11-03 16:05:26 UTC

1591M is 319.4% of *physical* RAM, 509876k - I expect this is desired
behaviour?

Comment 2 Karel Zak 2004-11-03 16:35:26 UTC

Joe, the problem is not %MEM -- it's probably counted correctly. The
problem is SIZE and RSS that "ps" reports right and "top" shows some
strange output.

Comment 3 Karel Zak 2004-11-04 08:07:06 UTC

I think problems was resolved (thanks to Jason Smith). 

Jason runs unstable kernel (see: 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434#c198) and
this kernel produces wrong /proc/<pid>/statm data. 

Dumps:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU
COMMAND
 6232 smithj4   15   0  913M 894M  5300 S     0.4 179.6 101:07   1
gnome-terminal


# cat /proc/6232/statm
233769 228990 1325 187 694 228109 2697
       ^^^^^^

After conversion from pages to KB it's: 
python -c "print 228990 << 2"
915960

So 915M, that's almost same as in the "top" output. 

# cat /proc/6232/stat
6232 (gnome-terminal) S 1 4726 4726 0 -1 4194304 278393 3980530 51621
4394127 560022 46748 42882 17739 15 0 0 0 849277 58626048 5720
                                                          ^^^^
4294967295 134512640 134802800 3221203328 3221202796 45691177 0 0 4096
66800 3222490981 0 0 17 0 0 0 560022 46748 42882 17739

In the "stat" file is probably right value:
python -c "print 5720 << 2"
22880

So 22M, -- this "stat" file uses the "ps" utils.

You can check all by:

# cat /proc/6232/status
Name:   gnome-terminal
State:  S (sleeping)
Tgid:   6232
Pid:    6232
PPid:   1
TracerPid:      0
Uid:    1829    1829    1829    1829
Gid:    31016   31016   31016   31016
FDSize: 256
Groups: 31016
VmSize:    57252 kB
VmLck:         0 kB
VmRSS:     22884 kB
           ^^^^^^
VmData:    36472 kB
VmStk:       148 kB
VmExe:       284 kB
VmLib:     14048 kB
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 00000000800104f0
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000

... so a problem is the reporter's unstable kernel.

Comment 5 Ernie Petrides 2004-11-04 20:19:19 UTC

If this is deemed a kernel problem, please reassign to Dave Anderson.

Comment 6 Tom Coughlan 2004-11-05 13:04:58 UTC

We tested 2.4.21-23.ELsmp (RHEL 3 U4 beta) and found that the RSS
value reported by top and ps aux do not agree. 

Karel Zak determined that /proc/<pid>/stat and /proc/<pid>/statm do
not agree on this value.  He wrote a handy script to test this, and
tried various kernels.  The results:

It passes on:

2.4.9-e.49smp   (people.redhat.com)
2.4.21-11.ELsmp (porkchop)
2.4.26 #1 SMP   (Debian)
2.6.8-1.521     (my FC box)
2.4.20-31.9smp  (Red Hat Linux release 9 (Shrike))

If fails on 2.4.21-23.ELsmp.

Test is available on:

http://people.redhat.com/kzak/procps/proc-mem-test.py

usage:   ps -A -opid= | ./proc-mem-test.py

Comment 7 Ville Herva 2004-11-12 11:30:13 UTC

*** Bug 136630 has been marked as a duplicate of this bug. ***

Comment 8 Helmut Wirth 2004-12-03 19:13:32 UTC

We see this behavior on kernel-smp-2.4.21-25.EL with an proprietary
monitoring application, too. Numbers in "ps vx" are right, top
claims rss size of several gbytes after some hours.

Comment 9 John Caruso 2004-12-03 20:27:20 UTC

Just another "me too."  Look at this top listing:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU 
COMMAND
 1363 user1     15   0  122G 121G  4924 S     0.2 1556.6 318:10   3 
prog
 1369 user1     15   0 46.0G  45G  4876 S     0.4 586.8  19:05   3 
prog
 1366 user1     15   0 45.7G  45G  4828 S     0.0 582.2  20:55   1 
prog
 1360 user1     15   0 21.2G  21G  4912 S     0.0 271.0   9:40   0 
prog

And it continues like that.  I'm sure it goes without saying that 
these processes were nowhere near this size.  This is on the current 
RHEL3 beta kernel (2.4.21-25.ELsmp).

Here's a more specific example...top currently gives the following 
output for process 8866:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU 
COMMAND
 8866 user1     20   0 1246M 1.2G  5488 S     2.3 15.5   1:04   3 prog

But ps shows the following:

   user1   8866  8.5  2.8 253844 232456 ?     S    11:51   0:51  prog

And summing up the sizes from /proc/8866/maps gives 259936256 bytes, 
or 253844K, which exactly agrees with the ps total.  Also, top's 
displayed size for this process grew from 561M to 1246M during the 
time it took me to type in this comment--although the ps 
and /proc/8866/maps values haven't budged.

If it helps, here's the output for stat and statm for this same 
process:

# cat /proc/8866/stat
8866 (prog) S 1 8866 8866 0 -1 256 28884 423 2238 704 298 48 1 1 20 0 
0 0 7303209 259936256 58206 4294967295 134512640 135480512 3221217376 
3221212292 3076385297 0 4096 528384 16395 3222736608 0 0 17 3 0 0 
5267 1120 3632 916

# cat /proc/8866/statm
317160 317150 1372 272 263316 53562 785

Comment 10 John Caruso 2004-12-03 21:02:58 UTC

BTW, Karel's test script fails on our kernel (2.4.21-25.ELsmp).  89 
processes show FAILED and just 27 show OK.  And one of the failed 
processes was in fact the python interpreter for the script itself, 
which obviously hadn't been running for very bloody long. :-)

Comment 11 Ernie Petrides 2004-12-04 00:40:25 UTC

A fix for this problem was committed to the RHEL3 U4 patch
pool Wednesday evening (in kernel version 2.4.21-27.EL).

Comment 12 Ernie Petrides 2004-12-04 00:52:27 UTC

*** Bug 138101 has been marked as a duplicate of this bug. ***

Comment 13 Helmut Wirth 2004-12-08 11:39:47 UTC

Yes, 2.4.21-27.ELsmp seem to fix the issue completly,
proc-mem-test.py shows all procs OK and in summary "test PASS".

Comment 14 Ernie Petrides 2004-12-10 02:56:46 UTC

The fix for this problem has also been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-27.3.EL).

Comment 15 John Flanagan 2004-12-20 20:56:53 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html

Comment 16 rodrigo 2005-05-08 00:27:23 UTC

Hi , we have Linux rac5 2.4.21-27.EL #1 SMP Wed Dec 1 21:54:21 EST 2004 ia64
ia64 ia64 GNU/Linux.

We have the same problem. 

RH 3 update 4

Comment 17 Ernie Petrides 2005-05-09 20:41:34 UTC

In response to comment #16, the problem originally reported in this bug
was *fixed* in U4 (in 2.4.21-27.EL).  If you're still having a new problem,
please open a new bug report with an exact description of the problem and a
way to reproduce it.  Thanks in advance.

Comment 18 Ernie Petrides 2005-05-09 20:42:52 UTC

Never mind, I just noticed that you already opened bug 157171.  Thanks.