821836 – top stops showing the most cpu consuming process after a while

Bug 821836 - top stops showing the most cpu consuming process after a while

Summary: top stops showing the most cpu consuming process after a while

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-05-15 14:48 UTC by Jussi Eloranta
Modified:	2012-11-14 15:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-11-14 15:29:02 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
requested stat file for the currently running program (4.52 KB, text/plain) 2012-05-16 14:23 UTC, Jussi Eloranta	no flags	Details
View All

Description Jussi Eloranta 2012-05-15 14:48:39 UTC

Description of problem:

I am running a long simulation (32 procs for several days) and noticed that this process will not show on top (run without options) after couple of days. I can see from the output file of the simulation that the process runs as efficiently as it did in the beginning though. If I run top -u eloranta (my username that is), I see the following:

top - 07:38:02 up 14 days, 12:07,  3 users,  load average: 31.62, 31.61, 31.58
Tasks: 266 total,   1 running, 265 sleeping,   0 stopped,   0 zombie
Cpu(s): 97.4%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
Mem:  82488412k total, 71671784k used, 10816628k free,   314936k buffers
Swap: 84574204k total,        0k used, 84574204k free, 66254464k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 5346 eloranta  20   0 19640 1452 1008 R  0.3  0.0   0:00.18 top                
 2438 eloranta  20   0  123m 2172 1076 S  0.0  0.0   0:00.31 sshd               
 2439 eloranta  20   0  121m 2652 1432 S  0.0  0.0   0:00.07 tcsh               
 5303 eloranta  20   0  123m 2320 1116 S  0.0  0.0   0:00.10 sshd               
 5304 eloranta  20   0  121m 2644 1424 S  0.0  0.0   0:00.09 tcsh               
63867 eloranta  20   0 3571m  73m 1568 S  0.0  0.1  5130948h classical    

Notice how top thinks that the process is not using any CPU but in reality it is nicely getting all 32 cpus and going as it was in the beginning (I monitor wall clock time / iteration constantly). Also ps uaxw on the process shows that the consumed CPU time is not changing:

eloranta 63867 3689759  0.0 2186252 75648 ?    Rl   May09 21525737:37 ./classical

So it appears that there is some kind of maximum for the consumed cpu time reported?

This is quite annoying bug as it gave me chills that my long simulation had crashed...

Comment 1 Jaromír Cápík 2012-05-16 11:03:08 UTC

Hello Jussi.

May I ask you to collect some debug data?

Once the top starts showing the 0% load, please enter the following command (and replace the <pid> placeholder with PID of the "classical" process. Then wait approximately 10 seconds and stop it by pressing CTRL+C ...

while : ;do cat /proc/<pid>/stat; sleep 1; done > process-stat.txt

And ... please, don't stop the "classical" process once you have the result. I might need more information after analysing it.

Attach the process-stat.txt here, please.

Thank you.

Regards,
Jaromir.

Comment 2 Jaromír Cápík 2012-05-16 11:19:17 UTC

Maybe one more note.

It would be nice to have a similar file recording the process state "shortly" before it falls down to 0% ... it doesn't necessarily be shortly before it happens ... let say one day before? Just don't do that immediately after the process start.

So ... if it is possible, provide me with 2 files ... one recording stats before the issue appears and one after ... 

Or .... if you have enough disc space, you can start the recording shortly after starting the "classical" process and stop it when the issue appears. The recording takes nearly 200 bytes per second (=16MB/day). I believe your drive can handle that.

Please, let me know.

Regards,
Jaromir.

Comment 3 Jussi Eloranta 2012-05-16 14:23:01 UTC

Created attachment 584982 [details]
requested stat file for the currently running program

Comment 4 Jaromír Cápík 2012-05-16 15:07:35 UTC

Hi Jussi.

And that's it. The counters are not changing. So, if you're sure the process takes the whole CPU, then this is very likely a kernel issue.

Would you like me to change the component to kernel?

Thanks and have a nice day.

Regards,
Jaromir.

Comment 5 Jussi Eloranta 2012-05-16 15:10:55 UTC

Yes, it is running fine as evidenced by its output. It outputs data every iteration and prints out the wall clock time / iter. The load average also stays at 32 (the process takes 32 cpus). Yes, change this to kernel - it would nice to get this sorted out.

Comment 6 Jaromír Cápík 2012-05-18 16:50:46 UTC

Ok, thank you. Changing to kernel then.

Comment 7 Josh Boyer 2012-09-18 15:09:10 UTC

What kernel version was this with?  Do the recent 3.4 or 3.5 updates fix it?

Comment 8 Dave Jones 2012-10-23 15:34:50 UTC

# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 9 Justin M. Forbes 2012-11-14 15:29:02 UTC

With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.

Note You need to log in before you can comment on or make changes to this bug.