Bug 458324

Summary:	bogus %CPU values from top
Product:	[Fedora] Fedora	Reporter:	Jonathan Kamens <jik>
Component:	procps	Assignee:	Daniel Novotny <dnovotny>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	10	CC:	darrellpf, poelstra, vassili.gorshkov
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-11-23 09:59:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jonathan Kamens 2008-08-07 16:06:35 UTC

With everything current from rawhide, including kernel 2.6.27-0.226.rc1.git5.fc10.i686 and procps 3.2.7-20.fc9.i386, here's what I see when I run "top":

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 3769 jik       20   0 25976 8888 7532 S 143.2  0.4   1126:42 multiload-apple   
 3549 jik       20   0 46032 2044 1640 S 90.9  0.1 187:11.98 gvfs-fuse-daemo    
11045 jik       20   0  2560 1100  828 R  1.3  0.1   0:00.10 top                

Note the bogus %CPU values on the top two lines.

Comment 1 Tomas Smetana 2008-08-07 17:21:22 UTC

The values over 100 % are OK -- it means that the process consumes more than 1 CPU (core).  If you don't have multiprocessor/multicore machine then something is wrong.

How many (logical) CPUs does your system have?

Comment 2 Jonathan Kamens 2008-08-07 17:45:30 UTC

I have an HT CPU which pretends to have two CPUs.  "cat /proc/cpuinfo" gives me two of the blocks shown below.

I don't quite get how a single process can use more than one CPU at a time, but leaving that aside for the moment, since I have only two CPUs (or CPU-like things, since it's really only one CPU with HT), then it seems to me that the percentages shouldn't be able to go over 200%, and yet I regularly see processes listed at much higher percentages than that.

vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 1
cpu MHz         : 3016.881
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pebs bts pni monitor ds_cpl cid xtpr
bogomips        : 6033.76
clflush size    : 64
power management:

Comment 3 Tomas Smetana 2008-08-08 06:37:25 UTC

Please start top, press '1' and post the summary data that should contain the overall numbers for each logical CPU.

One process can have multiple threads and each of them can run on a different CPU.  That's how the usage goes over 100 %.

Comment 4 Jonathan Kamens 2008-08-08 19:09:26 UTC

top - 15:03:27 up  1:13,  1 user,  load average: 0.01, 0.03, 0.00
Tasks: 140 total,   1 running, 137 sleeping,   2 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1990292k total,   553932k used,  1436360k free,    50660k buffers
Swap:  2104496k total,        0k used,  2104496k free,   330116k cached

Perhaps there's no bug here, but there's certainly a change in behavior, though, because previously I rarely if ever saw top report a process with more than 100% CPU, but now I see it all the time.

Still, I find it hard to believe that the explanation is multiple threads.  Some other anomalies:

When I hit H to display all threads, the next time top updates the display, there a bunch of processes which show %CPU as 9999.9.  This seems to happen again when I switch back.

Watching the top output on an ongoing basis, I just saw rsyslogd report 9999.9% CPU, and imapd reporting over 2000% cpu for a number of updates in a row.  The former is clearly bogus.  I suppose the latter is possible, but it still seems quite odd.

I wonder if perhaps there's a bug in the code for amalgating usage for all the threads in a process?

Comment 5 Tomas Smetana 2008-08-13 10:09:03 UTC

(In reply to comment #4)

> Perhaps there's no bug here, but there's certainly a change in behavior,
> though, because previously I rarely if ever saw top report a process with more
> than 100% CPU, but now I see it all the time.

I can't explain the behaviour you observe (haven't reproduced it myself) by any change in the procps' code -- which hasn't changed much already for some months.  This may be caused by the kernel change and I can't do much about that.  All in all top only reads data from /proc, so it might be interesting to look at the raw values of /proc/<num>/stat.

Comment 6 darrell pfeifer 2008-08-26 02:11:36 UTC

I have this behaviour too.

model name	: Intel(R) Core(TM)2 Duo CPU     T9300  @ 2.50GHz

I agree that the numbers are clearly bogus. I can understand the up to 200% on a dual core machine but sometimes the values are well beyond that.

No matter how much I repetetively use

ps -eo pcpu,comm | sort -n

I can't get any pcpu value to exceed even 100% while top is concurrently showing values past 100% for the same processes. I believe that ps is also reading the /proc values.

I don't recall this behaviour happening in 2.6.26... it only seems to have started with 2.6.27.

Comment 7 Tomas Smetana 2008-08-26 06:05:32 UTC

(In reply to comment #6)

> ps -eo pcpu,comm | sort -n
> 
> I can't get any pcpu value to exceed even 100% while top is concurrently
> showing values past 100% for the same processes. I believe that ps is also
> reading the /proc values.

It's true that top and ps both read the values from /proc but there is a difference: ps computes the %CPU as a CPU time divided by the time the process is running while top divides the CPU time by the time since the last screen update.  You must be very lucky to get comparably high values from both.

I will try to look into the kernel changes between 2.6.26 and 2.6.27 that could affect the /proc values.

Comment 8 darrell pfeifer 2008-08-26 15:44:08 UTC

The routine that fills out the proc values is

do_task_stat in fs/proc/array.c

There were changes from 2.6.26 to 2.6.27-rc4 for namespaces and using seq_printf for buffering. Neither of those appears to be a problem (seq_printf and sprintf share the same formatting routine)

I'll wander through the code a bit more but I don't see any obvious problems.

Comment 9 darrell pfeifer 2008-08-28 21:34:05 UTC

The values are back to being correct/reasonable in

2.6.27-0.284.rc4.git6.fc10.i686.PAE

Comment 10 Daniel Novotny 2008-10-20 14:15:35 UTC

Jonathan, did the new kernel fix this for you, or does the problem still occur?

Comment 11 Jonathan Kamens 2008-10-20 15:01:47 UTC

I am still seeing %CPU values from top higher than 100% for single-threaded applications and much higher than 200% for both single-threaded and multi-threaded applications (on a two-processor system), so no, this doesn't appear to be fixed.

I'm running kernel 2.6.27.2-23.rc1.fc10.i686 and procps-3.2.7-21.fc10.i386

Comment 12 Bug Zapper 2008-11-26 02:42:00 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 13 Vassili Gorshkov 2009-04-21 16:09:56 UTC

I am also seeing a general discrepancy between the values in /proc/pid/stat and /proc/stat with the values for cpu and system cpu being understated in the latter.  The same discrepancy shows in top with sum of cpu usage from all the processes exceed that of the cpu utilization.  The problem shows for either individual cpus or the total cpu for the box.

To reproduce:

run an application with a noticeable cpu consumption. Note its pid, e.g. 1020, sample the values in /proc/stat and /proc/1020/stat

cat /proc/1020/stat /proc/stat | head 2 > t1
sleep 10
cat /proc/1020/stat /proc/stat | head 2 >> t1

Compute the jiffies delta for user and sys cpu for both the process and the host.  I am seeing the former being 1.5 to 2 times larger.

I am running kernel 2.6.27.19-170.2.35.fc10.x86_64 #1 SMP Mon Feb 23 13:00:23 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

Comment 14 Bug Zapper 2009-11-18 08:15:06 UTC

This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 15 Jonathan Kamens 2009-11-22 12:14:07 UTC

I don't think I'm seeing this anymore in rawhide.

Comment 16 Daniel Novotny 2009-11-23 09:59:04 UTC

ok, closing. if you encounter this again, you can reopen