+++ This bug was initially created as a clone of Bug #167800 +++ Escalated to Bugzilla from IssueTracker -- Additional comment from tao on 2005-09-08 08:30 EST -- From User-Agent: XML-RPC Customer's description: Today, a long-running process (over 10,000 minutes) terminated. I run top all the time in cumulative time mode, and the init process displayed many hours of cumulative time (like over 50,000 hours, I imagine, but I do not remember exactly. When the long-running process exited, this tipped the total time for init into the days range, but now it displays the time as " -2226d init" The machine has 4 processors and has been up over two months, so the top of the top output is like this: 07:47:32 up 70 days, 1:42, 3 users, load average: 4.33, 4.33, 4.23 153 processes: 148 sleeping, 5 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.7% 98.7% 0.3% 0.0% 0.0% 0.0% 0.0% cpu00 2.5% 96.5% 0.6% 0.0% 0.0% 0.0% 0.1% cpu01 0.5% 99.3% 0.1% 0.0% 0.0% 0.0% 0.0% cpu02 0.0% 99.8% 0.1% 0.0% 0.0% 0.0% 0.0% cpu03 0.0% 99.4% 0.5% 0.0% 0.0% 0.0% 0.0% Mem: 4100844k av, 4066792k used, 34052k free, 0k shrd, 130960k buff 3138880k actv, 609996k in_d, 76464k in_c Swap: 8193076k av, 204k used, 8192872k free 3499268k cached PID PPID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM CTIME COMMAND 20702 20697 boinc 39 19 15092 14M 1192 R N 24.8 0.3 68:04 setiathome_4.02 20705 20700 boinc 39 19 46236 45M 4024 R N 24.8 1.1 68:21 hadsm3um_4.13_i 20703 20697 boinc 39 19 15064 14M 1172 R N 24.6 0.3 68:05 setiathome_4.02 20704 20701 boinc 39 19 46380 45M 4016 R N 24.5 1.1 68:13 hadsm3um_4.13_i 6199 21684 root 15 0 35796 26M 6912 S 0.7 0.6 11:03 X 20407 1 jdbeyer 15 0 14920 14M 7916 S 0.1 0.3 0:24 gnome-terminal 1 0 root 15 0 496 496 440 S 0.0 0.0 -2226d init My machine runs RHEL 3 ES and is fully up to date. In particular, $ rpm -qf /usr/bin/top procps-2.0.17-13.7 This event sent from IssueTracker by jnansi issue 78854 -- Additional comment from tao on 2005-09-08 08:30 EST -- From User-Agent: XML-RPC I'm not sure how this could have occured. It seems like he would have had to have had the process running for years in order to overflow 31 bits. I've looked over the code, and see where the top code seems to convert the time values from ticks as an unsigned long long to an unsigned int (in seconds) and then to a signed int. This seems like a bad idea, so I've rolled a patch to make top use an unsigned long long type for all process time values. I've done some cursory testing on a fairly idle box that was newly installed and it seems to work, but it probably needs to be tested on a RHEL3 machine with some fairly significant uptime to make sure the conversion routine still works as expected. Ultimately though, this may have been due to a problem with how the kernel is reporting these values in the first place. Issue escalated to Support Engineering Group by: jlayton. jlayton assigned to issue for Production Support. Category set to: Applications Internal Status set to 'Waiting on SEG' Status set to: Waiting on Tech This event sent from IssueTracker by jnansi issue 78854 -- Additional comment from tao on 2005-09-08 08:30 EST -- From User-Agent: XML-RPC Asked him to send the contents of /proc/1/stat if the box is still in this state, so setting to waiting on support for now. Internal Status set to 'Waiting on Support' This event sent from IssueTracker by jnansi issue 78854 -- Additional comment from tao on 2005-09-08 08:30 EST -- From User-Agent: XML-RPC File uploaded: init.stat This event sent from IssueTracker by jnansi issue 78854 it_file 45558 -- Additional comment from tao on 2005-09-08 08:30 EST -- From User-Agent: XML-RPC Attaching the output from "cat /proc/1/stat". Customer's box is (fortunately) still in this state. The cumulative user time value here seems to be -1943179762. The the stat file seems to output this value as: kernel_timeval_to_clock_t(&task->cutime) kernel_timeval is a struct with sec and usec as unsigned ints, and the clock_t seems to boil down to a long int, so it looks like that's overflowing. If we want to collect any data from this guy's box, we'll probably want to get it soon before he reboots. Severity set to: Normal Priority set to: 3 This event sent from IssueTracker by jnansi issue 78854 -- Additional comment from tao on 2005-09-08 08:31 EST -- From User-Agent: XML-RPC File uploaded: linux-2.4.21-procstat-luint.patch This event sent from IssueTracker by jnansi issue 78854 it_file 45559 -- Additional comment from tao on 2005-09-08 08:31 EST -- From User-Agent: XML-RPC Ok, attaching proposed patch to fix this. kernel_timeval_to_clock_t returns an unsigned long. The sprintf routine in array.c that displays this file, however, was displaying this as a signed long. This patch changes the format for the cutime and cstime from %ld to %lu, which should prevent the overflow from occuring. This event sent from IssueTracker by jnansi issue 78854 -- Additional comment from jnansi on 2005-09-08 08:34 EST -- Created an attachment (id=118590) output of cat /proc/1/stat -- Additional comment from jnansi on 2005-09-08 08:34 EST -- Created an attachment (id=118591) Proposed patch -- Additional comment from staubach on 2005-10-07 13:21 EST -- Created an attachment (id=119715) Proposed patch -- Additional comment from staubach on 2005-10-07 13:26 EST -- The problem appears to be pretty much as described. The kernel accumulates the user and system times in the cutime and cstime elements of the task_struct. The proc file system prints these elements after converting them to "clock" ticks. These ticks are currently defined to be 100 per second. The problem was that although the original values are unsigned long entities, the converted values were being printed as signed long entities. Interestingly, the other "clock" tick entities were being printed as unsigned long entities already.
Created attachment 119724 [details] Proposed patch
The situation and changes for RHEL-4 are pretty much the same as those for RHEL-3, with one small exception. In addition to the fields, cutime and cstime, being printed as signed long integers instead of unsigned long integers, the field, it_real_value, was also being printed as a signed long integer even though it is an unsigned long integer before being converted to "clock" ticks.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html