Bug 170078

Summary: Top shows high cpu utilization but processes dont add up to it
Product: Red Hat Enterprise Linux 4 Reporter: Dan Eisner <deisner>
Component: procpsAssignee: Tomas Smetana <tsmetana>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: marcmo
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-28 13:12:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sample of the problem
none
Screensoft of TradeManager on 2.6.9-11
none
Screensoft of TradeManager on 2.6.9-22 none

Description Dan Eisner 2005-10-07 01:00:20 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b4) Gecko/20050908 Firefox/1.4

Description of problem:
Top shows CPU utilization in the 30-60% range, but the only processes it lists are using less than 10% total. We are running a server which uses noteworthy CPU, so the "total" is probably right, but the processes don't show up on the list (it is sorted by CPU usage). Screenshot below.

Version-Release number of selected component (if applicable):
procps-3.2.3-7EL

How reproducible:
Always

Steps to Reproduce:
1.run top
2.
3.
  

Actual Results:  Top doesn't show cpu-bound processes as using CPU

Expected Results:  Application should be on the top of the list, and show cpu usage to equal the totals displayed.

Additional info:

Comment 1 Dan Eisner 2005-10-07 01:04:19 UTC
Created attachment 119697 [details]
sample of the problem

Comment 2 Karel Zak 2005-10-12 13:11:53 UTC
Please, I would like to see your "uname -a". Thanks.

Note that RHEL4 kernel does not have the utime/stime information of whole of
threads in /proc/<pid>/stat. It means that per process %CPU usage doesn't count
threads :-( This problem has been fixed in RHEL4 kernel >= 2.6.9-22.EL.

Comment 3 Dan Eisner 2005-10-12 18:55:03 UTC
It looks like I'm using an earlier kernel than that. I'll try to upgrade and see
if that fixes it. Thanks.

uname -a:
Linux korea-eye2.walleyetrading.net 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST
2005 i686 i686 i386 GNU/Linux

Comment 4 Dan Eisner 2005-10-17 18:39:52 UTC
Nope -- turns out that wasn't the problem, because I upgraded the kernel and
kernel-utils, and I'm still seeing the same thing.

[deisner@korea-eye2 ~]$ uname -a
Linux korea-eye2.walleyetrading.net 2.6.9-11.ELsmp #1 SMP Wed Jun 8 17:54:20 CDT
2005 i686 i686 i386 GNU/Linux


Here is some additional data:

[deisner@korea-eye2 ~]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.60GHz
stepping        : 3
cpu MHz         : 3592.515
cache size      : 2048 KB
physical id     : 0
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor
ds_cpl est tm2 cid xtpr
bogomips        : 7094.27

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.60GHz
stepping        : 3
cpu MHz         : 3592.515
cache size      : 2048 KB
physical id     : 3
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor
ds_cpl est tm2 cid xtpr
bogomips        : 7176.19

[deisner@korea-eye2 ~]$ cat /proc/meminfo
MemTotal:      3115132 kB
MemFree:       2090640 kB
Buffers:        233036 kB
Cached:         474944 kB
SwapCached:          0 kB
Active:         600444 kB
Inactive:       251552 kB
HighTotal:     2227968 kB
HighFree:      1603456 kB
LowTotal:       887164 kB
LowFree:        487184 kB
SwapTotal:     2097144 kB
SwapFree:      2097144 kB
Dirty:              88 kB
Writeback:           0 kB
Mapped:         196380 kB
Slab:           160428 kB
Committed_AS:   626156 kB
PageTables:       2332 kB
VmallocTotal:   106488 kB
VmallocUsed:      3152 kB
VmallocChunk:   102872 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

[deisner@korea-eye2 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant)

[deisner@korea-eye2 ~]$ cat /proc/pci
PCI devices found:
  Bus  0, device   0, function  0:
    Class 0600: PCI device 8086:3590 (rev 9).
  Bus  0, device   2, function  0:
    Class 0604: PCI device 8086:3595 (rev 9).
      IRQ 121.
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  0, device   4, function  0:
    Class 0604: PCI device 8086:3597 (rev 9).
      IRQ 121.
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  0, device   5, function  0:
    Class 0604: PCI device 8086:3598 (rev 9).
      IRQ 121.
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  0, device   6, function  0:
    Class 0604: PCI device 8086:3599 (rev 9).
      IRQ 121.
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  0, device  29, function  0:
    Class 0c03: PCI device 8086:24d2 (rev 2).
      IRQ 121.
      I/O at 0xbce0 [0xbcff].
  Bus  0, device  29, function  1:
    Class 0c03: PCI device 8086:24d4 (rev 2).
      IRQ 137.
      I/O at 0xbcc0 [0xbcdf].
  Bus  0, device  29, function  2:
    Class 0c03: PCI device 8086:24d7 (rev 2).
      IRQ 129.
      I/O at 0xbca0 [0xbcbf].
  Bus  0, device  29, function  7:
    Class 0c03: PCI device 8086:24dd (rev 2).
      IRQ 161.
      Non-prefetchable 32 bit memory at 0xdff00000 [0xdff003ff].
  Bus  0, device  30, function  0:
    Class 0604: PCI device 8086:244e (rev 194).
      Master Capable.  No bursts.  Min Gnt=11.
  Bus  0, device  31, function  0:
    Class 0601: PCI device 8086:24d0 (rev 2).
  Bus  0, device  31, function  1:
    Class 0101: PCI device 8086:24db (rev 2).
      I/O at 0xfc00 [0xfc0f].
      Non-prefetchable 32 bit memory at 0xbffff000 [0xbffff3ff].
  Bus  1, device   0, function  0:
    Class 0604: PCI device 8086:0330 (rev 6).
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  1, device   0, function  2:
    Class 0604: PCI device 8086:0332 (rev 6).
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  2, device  14, function  0:
    Class 0104: PCI device 1028:0013 (rev 6).
      IRQ 209.
      Master Capable.  Latency=64.  Min Gnt=128.
      Prefetchable 32 bit memory at 0xd80f0000 [0xd80fffff].
      Non-prefetchable 32 bit memory at 0xdfde0000 [0xdfdfffff].
  Bus  5, device   0, function  0:
    Class 0604: PCI device 8086:0329 (rev 9).
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  5, device   0, function  2:
    Class 0604: PCI device 8086:032a (rev 9).
      Master Capable.  No bursts.  Min Gnt=7.
  Bus  6, device   7, function  0:
    Class 0200: PCI device 8086:1076 (rev 5).
      IRQ 217.
      Master Capable.  Latency=32.  Min Gnt=255.
      Non-prefetchable 32 bit memory at 0xdfae0000 [0xdfafffff].
      I/O at 0xecc0 [0xecff].
  Bus  7, device   8, function  0:
    Class 0200: PCI device 8086:1076 (rev 5).
      IRQ 225.
      Master Capable.  Latency=32.  Min Gnt=255.
      Non-prefetchable 32 bit memory at 0xdf8e0000 [0xdf8fffff].
      I/O at 0xdcc0 [0xdcff].
  Bus  9, device   5, function  0:
    Class ff00: PCI device 1028:0011 (rev 0).
      IRQ 145.
      Master Capable.  Latency=32.
      Prefetchable 32 bit memory at 0xd7fff000 [0xd7ffffff].
      I/O at 0xccf8 [0xccff].
      I/O at 0xcce8 [0xccef].
  Bus  9, device   5, function  1:
    Class ff00: PCI device 1028:0012 (rev 0).
      IRQ 153.
      Master Capable.  Latency=32.
      Non-prefetchable 32 bit memory at 0xdf5ff000 [0xdf5fffff].
      I/O at 0xcc80 [0xccbf].
      Prefetchable 32 bit memory at 0xd7f00000 [0xd7f7ffff].
  Bus  9, device   5, function  2:
    Class ff00: PCI device 1028:0014 (rev 0).
      Master Capable.  Latency=32.
  Bus  9, device   6, function  0:
    Class 0101: PCI device 1095:0680 (rev 2).
      IRQ 161.
      Master Capable.  Latency=32.
      I/O at 0xccf0 [0xccf7].
      I/O at 0xcce4 [0xcce7].
      I/O at 0xccd8 [0xccdf].
      I/O at 0xccd0 [0xccd3].
      I/O at 0xcc70 [0xcc7f].
      Non-prefetchable 32 bit memory at 0xdf5fec00 [0xdf5fecff].
  Bus  9, device  13, function  0:
    Class 0300: PCI device 1002:5159 (rev 0).
      IRQ 129.
      Master Capable.  Latency=32.  Min Gnt=8.
      Prefetchable 32 bit memory at 0xc8000000 [0xcfffffff].
      I/O at 0xc800 [0xc8ff].
      Non-prefetchable 32 bit memory at 0xdf5e0000 [0xdf5effff].


Comment 5 Marc Mondragon 2005-10-19 21:25:40 UTC
Not sure if my problem is truly related but it seems to be pretty close.  I've
recently upgraded some of my RHEL 4 machines to 2.6.9-22 from 2.6.9-11 and I'm
seeing some odd behaviour.  In top the per process utilization seems to be much
higher on -22 versus -11.  I've attached some screen shots to show the
difference the process in question is TradeManager.

Comment 6 Marc Mondragon 2005-10-19 21:34:35 UTC
Created attachment 120175 [details]
Screensoft of TradeManager on 2.6.9-11

This shows Trade Manager running at 0.1 CPU % Util on 2.6.9-11

Comment 7 Marc Mondragon 2005-10-19 21:36:32 UTC
Created attachment 120176 [details]
Screensoft of TradeManager on 2.6.9-22

This shows TradeManager on a 2.6.9-22 machine ... much higher CPU Util.

Comment 8 Karel Zak 2005-10-20 12:33:46 UTC
To Dan's (comment #3):
   You have to update to >= 2.6.9-22.EL (RHEL4-U2 kernel). Your update to
2.6.9-5 is not enough.

To Marc (comment #5):
   I think it should be correct that "per process utilization seems to be much
higher on -22". The old kernel doesn't count threads to sum of cpu process
utilization. I assume that your TradeManager is multi-thread process.


Comment 9 Marc Mondragon 2005-10-20 13:43:30 UTC
Karel,

You are correct, Trade Manager is a multi-threaded app and we were aware of the
threads not being summed correctly.  The issue is that during the day while the
market (stock) is open, the CPU % util goes to 99.9% under -22 while it hovers
around 10% on -11.  Now the machine doesn't act like it has a process consuming
99.9% of the CPU but we are concerned that the information being shown in top is
incorrect and might mask when an actual problem occurrs.  This app is sensitive
to load so we monitor metrics like this pretty closely.  Is there an alternative
method?

Marc Mondragon

Comment 10 Karel Zak 2005-10-20 14:22:03 UTC
Marc, you can try use %CPU in the "ps" util output (but I assume almost same
results as from the "top"). 

To be honest I'm not sure if monitor per process %CPU usage is a good idea. I
think more complex and useful is "load average" or global %CPU.

Comment 11 Marc Mondragon 2005-10-20 14:58:34 UTC
I agree that the per process %CPU util isn't the best metric and we don't
monitor it in that fashion.  We watch load average and the load average went up
after we installed the -22 kernel and when we investigated the 99.9% CPU Util
jumped out at us.  Is it fair to say that on -22 the load average from
top/uptime and the %CPU from ps auwwx is more accurate than the same info from
top?  How about under -11?


Comment 12 Marc Mondragon 2005-10-20 14:59:41 UTC
One more thing -- when you say global %CPU what do you mean exactly?

Comment 13 Karel Zak 2005-10-20 15:15:04 UTC
It seems you have a problem with U2 kernel if your load average went up after
update... I think you should write separate bug report to kernel component.

By the way, you needn't top/upload, you can directly check values from "cat
/proc/loadavg".

"global %CPU"... see top output :-)

top - 17:19:35 up 4 days, 20:09,  3 users,  load average: 0.13, 0.23, 0.20
Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.7% us,  0.3% sy,  0.1% ni, 97.6% id,  0.3% wa,  0.0% hi,  0.0% si
^^^^^^^

Comment 14 Marc Mondragon 2005-10-20 18:40:28 UTC
OK that is what I thought but I wasn't completely sure.  Final question: any
thoughts on why the per process CPU number is so far off?  When I look at my
machine with ps the CPU util is around 40%.  In top it shows 99%.

Comment 15 Karel Zak 2005-10-24 13:25:04 UTC
No idea -- bug or reality :-) We have to wait for more exact reports.

Comment 17 Albert Cahalan 2007-05-28 04:18:41 UTC
Transient processes will do this to you. Top looks, the transient process is
born and dies, and top looks again. Top never even sees the transient process.

There is a top-like tool called "atop" that solves this. Unfortunately it must
run as root because it uses the more-invasive BSD accounting mechanism rather
than the /proc filesystem.


Comment 18 Tomas Smetana 2007-11-28 13:12:39 UTC
Based on the last comment and no activity in the past: closing.