Description of problem: I am running a long simulation (32 procs for several days) and noticed that this process will not show on top (run without options) after couple of days. I can see from the output file of the simulation that the process runs as efficiently as it did in the beginning though. If I run top -u eloranta (my username that is), I see the following: top - 07:38:02 up 14 days, 12:07, 3 users, load average: 31.62, 31.61, 31.58 Tasks: 266 total, 1 running, 265 sleeping, 0 stopped, 0 zombie Cpu(s): 97.4%us, 0.0%sy, 0.0%ni, 1.9%id, 0.0%wa, 0.7%hi, 0.0%si, 0.0%st Mem: 82488412k total, 71671784k used, 10816628k free, 314936k buffers Swap: 84574204k total, 0k used, 84574204k free, 66254464k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5346 eloranta 20 0 19640 1452 1008 R 0.3 0.0 0:00.18 top 2438 eloranta 20 0 123m 2172 1076 S 0.0 0.0 0:00.31 sshd 2439 eloranta 20 0 121m 2652 1432 S 0.0 0.0 0:00.07 tcsh 5303 eloranta 20 0 123m 2320 1116 S 0.0 0.0 0:00.10 sshd 5304 eloranta 20 0 121m 2644 1424 S 0.0 0.0 0:00.09 tcsh 63867 eloranta 20 0 3571m 73m 1568 S 0.0 0.1 5130948h classical Notice how top thinks that the process is not using any CPU but in reality it is nicely getting all 32 cpus and going as it was in the beginning (I monitor wall clock time / iteration constantly). Also ps uaxw on the process shows that the consumed CPU time is not changing: eloranta 63867 3689759 0.0 2186252 75648 ? Rl May09 21525737:37 ./classical So it appears that there is some kind of maximum for the consumed cpu time reported? This is quite annoying bug as it gave me chills that my long simulation had crashed...
Hello Jussi. May I ask you to collect some debug data? Once the top starts showing the 0% load, please enter the following command (and replace the <pid> placeholder with PID of the "classical" process. Then wait approximately 10 seconds and stop it by pressing CTRL+C ... while : ;do cat /proc/<pid>/stat; sleep 1; done > process-stat.txt And ... please, don't stop the "classical" process once you have the result. I might need more information after analysing it. Attach the process-stat.txt here, please. Thank you. Regards, Jaromir.
Maybe one more note. It would be nice to have a similar file recording the process state "shortly" before it falls down to 0% ... it doesn't necessarily be shortly before it happens ... let say one day before? Just don't do that immediately after the process start. So ... if it is possible, provide me with 2 files ... one recording stats before the issue appears and one after ... Or .... if you have enough disc space, you can start the recording shortly after starting the "classical" process and stop it when the issue appears. The recording takes nearly 200 bytes per second (=16MB/day). I believe your drive can handle that. Please, let me know. Regards, Jaromir.
Created attachment 584982 [details] requested stat file for the currently running program
Hi Jussi. And that's it. The counters are not changing. So, if you're sure the process takes the whole CPU, then this is very likely a kernel issue. Would you like me to change the component to kernel? Thanks and have a nice day. Regards, Jaromir.
Yes, it is running fine as evidenced by its output. It outputs data every iteration and prints out the wall clock time / iter. The load average also stays at 32 (the process takes 32 cpus). Yes, change this to kernel - it would nice to get this sorted out.
Ok, thank you. Changing to kernel then.
What kernel version was this with? Do the recent 3.4 or 3.5 updates fix it?
# Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.