Bug 797246

Summary:	Top can not add
Product:	[Fedora] Fedora	Reporter:	jeff <jcmj>
Component:	procps	Assignee:	Jaromír Cápík <jcapik>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	17	CC:	fedora, jcapik, jonathan, kzak, mluscon, ovasik
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-08-01 17:09:56 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jeff 2012-02-24 16:51:00 UTC

Description of problem:top does not dispaly correct idle time


Version-Release number of selected component (if applicable):
Fedora 16 3.2.7-1 kernel

How reproducible:
easy

Steps to Reproduce:
1. run top
2. notice idle time as compared to running processes
3.
  
Actual results:
top
top - 10:42:15 up 21:09,  6 users,  load average: 0.24, 0.25, 0.52
Tasks: 306 total,   1 running, 305 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 47.5%id,  0.0%wa, 52.5%hi,  0.0%si,  0.0%st
Mem:   2052684k total,  1912736k used,   139948k free,     3780k buffers
Swap:  2326524k total,  1329092k used,   997432k free,   112848k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                        
 2900 jeff      20   0 1899m 899m  16m S  3.3 44.9   2:16.46 firefox                                                                                        
 2391 jeff      20   0 1226m 9948 3564 S  1.3  0.5   3:12.41 knotify4                                                                                       
 4126 jeff      20   0 1267m 8200 3828 S  1.0  0.4   0:34.20 knotify4                                                                                       
 2507 jeff      20   0 11364  812  608 S  0.3  0.0   0:01.33 ksysguardd                                                                                     
 2527 root      20   0  534m 2828 1476 S  0.3  0.1   0:00.14 /usr/sbin/httpd                                                                                
 3931 jeff      20   0  650m  23m  12m S  0.3  1.2   0:04.51 konsole                                                                                        
 4125 jeff      20   0 1559m  58m  15m S  0.3  2.9   1:27.76 plasma-desktop                                                                                 
    1 root      20   0 39796 2804 1420 S  0.0  0.1   0:01.21 systemd       


Expected results: Idle time should be 93.2 not 47.5


Additional info:

Comment 1 Jaromír Cápík 2012-03-20 18:33:47 UTC

Hello Jeff.

I believe the value is correct.

47.5%id
52.5%hi

As you can see, the rest is "consumed" by hardware interrupts. This might be caused by a broken device on the bus or by your network card during some network issue (or DoS attack).
Please, check your hardware and let me know.

Thank you.

Regards,
Jaromir.

Comment 2 jeff 2012-03-24 14:27:49 UTC

Ok so how would I tell which hardware is being interrupted?  Network traffic at the time was Nil..

Thanks

if you add up the cpu col you get 6.8 subtracting that from 100 leaves you with 93.2 idle time..

Comment 3 Jaromír Cápík 2012-03-27 15:05:22 UTC

Could you please attach the output of the following command?

( cat /proc/interrupts ; sleep 5 ; cat /proc/interrupts ) > interrupts-diff.txt

it's gonna take 5 seconds to finish.

... but you're right, that discrepancy with 6.8% lost somewhere looks really suspicious ...

May I also know something about your hardware? Could you please describe your system briefly?

Thanks in advance.

Regards,
Jaromir.

Comment 4 jeff 2012-03-28 12:31:11 UTC

Jarpmir,

Cool like learning something new and so easy..

          CPU0       
  0:        151   IO-APIC-edge      timer
  1:       1143   IO-APIC-edge      i8042
  8:          1   IO-APIC-edge      rtc0
  9:          0   IO-APIC-fasteoi   acpi
 12:        221   IO-APIC-edge      i8042
 14:      50479   IO-APIC-edge      ata_piix
 15:          0   IO-APIC-edge      ata_piix
 19:      28461   IO-APIC-fasteoi   virtio0
 20:       6156   IO-APIC-fasteoi   vboxguest
 23:        193   IO-APIC-fasteoi   ehci_hcd:usb1, ohci_hcd:usb2
 40:     272896   PCI-MSI-edge      ahci
 41:        205   PCI-MSI-edge      snd_hda_intel
NMI:          0   Non-maskable interrupts
LOC:    6254622   Local timer interrupts
SPU:          0   Spurious interrupts
PMI:          0   Performance monitoring interrupts
IWI:          0   IRQ work interrupts
RTR:          0   APIC ICR read retries
RES:          0   Rescheduling interrupts
CAL:          0   Function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
MCE:          0   Machine check exceptions
MCP:        172   Machine check polls
ERR:          0
MIS:          0
           CPU0       
  0:        151   IO-APIC-edge      timer
  1:       1143   IO-APIC-edge      i8042
  8:          1   IO-APIC-edge      rtc0
  9:          0   IO-APIC-fasteoi   acpi
 12:        221   IO-APIC-edge      i8042
 14:      50485   IO-APIC-edge      ata_piix
 15:          0   IO-APIC-edge      ata_piix
 19:      28467   IO-APIC-fasteoi   virtio0
 20:       6157   IO-APIC-fasteoi   vboxguest
 23:        193   IO-APIC-fasteoi   ehci_hcd:usb1, ohci_hcd:usb2
 40:     272960   PCI-MSI-edge      ahci
 41:        205   PCI-MSI-edge      snd_hda_intel
NMI:          0   Non-maskable interrupts
LOC:    6255551   Local timer interrupts
SPU:          0   Spurious interrupts
PMI:          0   Performance monitoring interrupts
IWI:          0   IRQ work interrupts
RTR:          0   APIC ICR read retries
RES:          0   Rescheduling interrupts
CAL:          0   Function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
MCE:          0   Machine check exceptions
MCP:        172   Machine check polls
ERR:          0
MIS:          0


Running on AMD 4 cores, host os Openindiana Vbox Fedora 16 kernel 3.3

Thanks
Jeff

Comment 5 Jaromír Cápík 2012-03-28 17:22:05 UTC

Hello Jeff.

That's strange.
I don't see any interrupt peaks. Was the issue present during the test?

If so ... could you please also do a similar thing with /proc/stat?

  ( cat /proc/stat ; sleep 1; cat /proc/stat ) > proc-stat-diff.txt

... or simply without redirection if you wanna attach it in form of comment:

  cat /proc/stat ; sleep 1; cat /proc/stat


I've never noticed similar issue in Vbox, but we experienced similar issue with C-states, when Nahalem CPUs went to sleeping state to save power and in that state didn't even produce idle ticks. That caused the final sum to be less than 100%. Maybe a similar behaviour is achievable with some Vbox build-configuration. But let's wait until the /proc/stat results.

Thanks,
Jaromir.

Comment 6 jeff 2012-03-29 13:04:32 UTC

Hey Jaromir,

Yes, iotop showed kswapd going nuts no idea why, but that was a fw kernels ago..

cat proc*txt
cpu  82446 3931 123793 12673720 138313 377677 12130 0 0 0
cpu0 82446 3931 123793 12673720 138313 377677 12130 0 0 0
intr 20378602 151 1143 0 0 0 0 0 0 1 0 0 0 221 0 136743 0 0 0 0 166245 15579 0 0 193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 865980 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 38898088
btime 1332886151
processes 136803
procs_running 10
procs_blocked 0
softirq 13383560 0 8998095 2 225621 757810 0 827 0 73521 3327684
cpu  82447 3931 123796 12673789 138318 377692 12130 0 0 0
cpu0 82447 3931 123796 12673789 138318 377692 12130 0 0 0
intr 20378909 151 1143 0 0 0 0 0 0 1 0 0 0 221 0 136745 0 0 0 0 166250 15579 0 0 193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 866031 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 38898697
btime 1332886151
processes 136805
procs_running 1
procs_blocked 0
softirq 13384163 0 8998241 2 225626 757841 0 827 0 73521 3328105


Will check how I have the sleep set in VBOX I believe I turned that off for another issue awhile back with the Openindiana host.

Comment 7 Jaromír Cápík 2012-03-29 15:07:58 UTC

Hello Jeff.

The values sent by you correspond to the following result:

1%us, 6% system, 0%ni, 69%id, 5%wa, 15%hi, 0%si

Even if the interval was longer than 1 second, the final sum is 96 only (=lower than 100) ... the values usually represent 100Hz ticks, therefore they may be understood as percentage. So ... at least 4 ticks got lost. That might point to instability of the clock or to a completely different issue. Anyway ... the number of hardware interrupts is still too high. It's usually around 2% on my worstations and around 5% on my virtual machines.

I don't know if you checked the top output while collecting the results ... do my results correspond to what top displayed?

If so, then the problem probably lies in the following things:
1.) guest kernel
2.) VBox (maybe platform dependent issue)
3.) host OS.

Please, let me know.

Regards,
Jaromir.

Comment 8 jeff 2012-03-29 22:45:53 UTC

Hey Jaromir,

Ran it when top was running and user went briefly to 0..

Before

cat proc-stat-diff.txt
cpu  110523 4706 169950 15806004 153347 484663 15454 0 0 0
cpu0 110523 4706 169950 15806004 153347 484663 15454 0 0 0
intr 26614425 151 1490 0 0 0 0 0 0 1 0 0 0 2693 0 170847 0 0 0 0 232917 37695 0 0 193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1060901 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 51965086
btime 1332886151
processes 185477
procs_running 4
procs_blocked 0
softirq 17125833 0 11672536 2 318708 921992 0 1162 0 113399 4098034
cpu  110526 4706 169954 15806077 153355 484669 15454 0 0 0
cpu0 110526 4706 169954 15806077 153355 484669 15454 0 0 0
intr 26614841 151 1491 0 0 0 0 0 0 1 0 0 0 2693 0 170847 0 0 0 0 232917 37695 0 0 193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1060984 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 51965904
btime 1332886151
processes 185479
procs_running 3
procs_blocked 1
softirq 17126080 0 11672693 2 318708 922052 0 1163 0 113400 4098062


During

cpu  110566 4706 170041 15806998 154107 484791 15469 0 0 0
cpu0 110566 4706 170041 15806998 154107 484791 15469 0 0 0
intr 26625068 151 1514 0 0 0 0 0 0 1 0 0 0 2693 0 170869 0 0 0 0 232941 37697 0 0 193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1064439 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 51985427
btime 1332886151
processes 185521
procs_running 9
procs_blocked 0
softirq 17134332 0 11676026 2 318742 924666 0 1176 0 113424 4100296
cpu  110567 4706 170046 15807041 154143 484797 15470 0 0 0
cpu0 110567 4706 170046 15807041 154143 484797 15470 0 0 0
intr 26625737 151 1515 0 0 0 0 0 0 1 0 0 0 2693 0 170869 0 0 0 0 232946 37697 0 0 193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1064788 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 51986301
btime 1332886151
processes 185525
procs_running 1
procs_blocked 2
softirq 17134792 0 11676189 2 318747 924890 0 1177 0 113426 4100361

Comment 9 Jaromír Cápík 2012-03-30 10:23:09 UTC

Hello Jeff.

The evaluation is a dynamic process, that can return completely different values with each cycle and therefore it's very difficult to compare values returned by the top and values sampled with the provided shell command unless they're stable for a longer time period.
Anyway, according to my evaluation, the values returned by the kernel correspond to the following ...

before
3%us, 0%ni, 4%sy, 73%id, 8%wa, 6%hi, 0%si

during
1%us, 0%ni, 5%sy, 43%id, 36%wa, 6%hi, 1%si

as you can see, there was an io_wait peek and that affected the idle counter quite significantly. Please, note, that the virtual machine may distort the kernel counters quite considerably. Moreover, the evaluation logic is quite simple/clear and many times tested/discussed ... so, it's very unlikely that there's something wrong with it. All counter related issues I remember were caused by the kernel, because it does the hardest job in case of stats.

At the moment I believe, that the top command behaves correctly according to what kernel exports into the /proc/stat file.

If you believe, that the values in /proc/stat are incorrect, then we can change the component to kernel (but I believe this has a lot to do with the virtual machine and it's resource control).

Please, let me know.

Regards,
Jaromir.

Comment 10 Jaromír Cápík 2012-03-30 10:47:08 UTC

Maybe I should also mention that the counters represent time units "consumed" by the processess, idle state, etc. instead of work units and that already causes major distortions in case of hardware with frequency scaling. If the host OS / virtualization technology gives the guest OS different amount of resources each second, then the guest OS is not aware of that and the same amount of work units produces a different amount of time units in the stats and the same amount of load then looks like fluctuating. We've already requested new statistical counters for work units in the kernel, but this hasn't been implemented yet.

Comment 11 jeff 2012-03-30 21:37:19 UTC

Jaromir,


Thanks for the info will keep my eye on it see if I can catch anything else.  Look forward to be able to gather more detail when new counters are implemented.

Comment 12 Jaromír Cápík 2012-04-03 13:02:30 UTC

Hello Jeff.

These new counters are kind of long-term run. The feature has been requested, but it usually takes longer to get it into the upstream.
The question is if you'd like me to switch the component to kernel or wait till you collect more information.

Regards,
Jaromir.

Comment 13 Fedora End Of Life 2013-01-16 22:48:15 UTC

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 14 Fedora End Of Life 2013-07-04 05:52:48 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 15 Fedora End Of Life 2013-08-01 17:10:02 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.