Bug 170146 - CRM648268: kernel reporting init process cutime as very large negative value
Summary: CRM648268: kernel reporting init process cutime as very large negative value
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Peter Staubach
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168429
TreeView+ depends on / blocked
 
Reported: 2005-10-07 18:58 UTC by Peter Staubach
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-07 20:19:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Proposed patch (540 bytes, patch)
2005-10-07 19:27 UTC, Peter Staubach
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:808 0 normal SHIPPED_LIVE Important: kernel security update 2005-10-27 04:00:00 UTC
Red Hat Product Errata RHSA-2006:0132 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3 2006-03-09 16:31:00 UTC

Description Peter Staubach 2005-10-07 18:58:22 UTC
+++ This bug was initially created as a clone of Bug #167800 +++

Escalated to Bugzilla from IssueTracker

-- Additional comment from tao on 2005-09-08 08:30 EST --
From User-Agent: XML-RPC

Customer's description:



Today, a long-running process (over 10,000 minutes) terminated. I run top all
the time in cumulative time mode, and the init process displayed many hours of
cumulative time (like over 50,000 hours, I imagine, but I do not remember
exactly. When the long-running process exited, this tipped the total time for
init into the days range, but now it displays the time as " -2226d init" 

 

The machine has 4 processors and has been up over two months, so the top of the
top output is like this: 

 

 07:47:32  up 70 days,  1:42,  3 users,  load average: 4.33, 4.33, 4.23 

153 processes: 148 sleeping, 5 running, 0 zombie, 0 stopped 

CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle 

           total    0.7%   98.7%    0.3%   0.0%     0.0%    0.0%    0.0% 

           cpu00    2.5%   96.5%    0.6%   0.0%     0.0%    0.0%    0.1% 

           cpu01    0.5%   99.3%    0.1%   0.0%     0.0%    0.0%    0.0% 

           cpu02    0.0%   99.8%    0.1%   0.0%     0.0%    0.0%    0.0% 

           cpu03    0.0%   99.4%    0.5%   0.0%     0.0%    0.0%    0.0% 

Mem:  4100844k av, 4066792k used,   34052k free,       0k shrd,  130960k buff 

                   3138880k actv,  609996k in_d,   76464k in_c 

Swap: 8193076k av,     204k used, 8192872k free                 3499268k cached 

                                                                               
               

  PID  PPID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  CTIME COMMAND 

20702 20697 boinc     39  19 15092  14M  1192 R N  24.8  0.3  68:04 setiathome_4.02 

20705 20700 boinc     39  19 46236  45M  4024 R N  24.8  1.1  68:21 hadsm3um_4.13_i 

20703 20697 boinc     39  19 15064  14M  1172 R N  24.6  0.3  68:05 setiathome_4.02 

20704 20701 boinc     39  19 46380  45M  4016 R N  24.5  1.1  68:13 hadsm3um_4.13_i 

 6199 21684 root      15   0 35796  26M  6912 S     0.7  0.6  11:03 X 

20407     1 jdbeyer   15   0 14920  14M  7916 S     0.1  0.3   0:24 gnome-terminal 

    1     0 root      15   0   496  496   440 S     0.0  0.0 -2226d init 

 

My machine runs RHEL 3 ES and is fully up to date. In particular, 

 

$ rpm -qf /usr/bin/top 

procps-2.0.17-13.7
This event sent from IssueTracker by jnansi
 issue 78854

-- Additional comment from tao on 2005-09-08 08:30 EST --
From User-Agent: XML-RPC

I'm not sure how this could have occured. It seems like he would have had
to have had the process running for years in order to overflow 31 bits. 



I've looked over the code, and see where the top code seems to convert the
time values from ticks as an unsigned long long to an unsigned int (in
seconds) and then to a signed int. This seems like a bad idea, so I've
rolled a patch to make top use an unsigned long long type for all process
time values. I've done some cursory testing on a fairly idle box that was
newly installed and it seems to work, but it probably needs to be tested on
a RHEL3 machine with some fairly significant uptime to make sure the
conversion routine still works as expected.



Ultimately though, this may have been due to a problem with how the kernel
is reporting these values in the first place.




Issue escalated to Support Engineering Group by: jlayton.
jlayton assigned to issue for Production Support.

Category set to: Applications

Internal Status set to 'Waiting on SEG'
Status set to: Waiting on Tech

This event sent from IssueTracker by jnansi
 issue 78854

-- Additional comment from tao on 2005-09-08 08:30 EST --
From User-Agent: XML-RPC

Asked him to send the contents of /proc/1/stat if the box is still in this
state, so setting to waiting on support for now.




Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by jnansi
 issue 78854

-- Additional comment from tao on 2005-09-08 08:30 EST --
From User-Agent: XML-RPC

File uploaded: init.stat

This event sent from IssueTracker by jnansi
 issue 78854
it_file 45558

-- Additional comment from tao on 2005-09-08 08:30 EST --
From User-Agent: XML-RPC

Attaching the output from "cat /proc/1/stat". Customer's box is
(fortunately) still in this state.



The cumulative user time value here seems to be -1943179762.



The the stat file seems to output this value as:



kernel_timeval_to_clock_t(&task->cutime)



kernel_timeval is a struct with sec and usec as unsigned ints, and the
clock_t seems to boil down to a long int, so it looks like that's
overflowing.



If we want to collect any data from this guy's box, we'll probably want
to get it soon before he reboots.



Severity set to: Normal
Priority set to: 3

This event sent from IssueTracker by jnansi
 issue 78854

-- Additional comment from tao on 2005-09-08 08:31 EST --
From User-Agent: XML-RPC

File uploaded: linux-2.4.21-procstat-luint.patch

This event sent from IssueTracker by jnansi
 issue 78854
it_file 45559

-- Additional comment from tao on 2005-09-08 08:31 EST --
From User-Agent: XML-RPC

Ok, attaching proposed patch to fix this.



kernel_timeval_to_clock_t returns an unsigned long. The sprintf routine in
array.c that displays this file, however, was displaying this as a signed
long. This patch changes the format for the cutime and cstime from %ld to
%lu, which should prevent the overflow from occuring.






This event sent from IssueTracker by jnansi
 issue 78854

-- Additional comment from jnansi on 2005-09-08 08:34 EST --
Created an attachment (id=118590)
output of cat /proc/1/stat


-- Additional comment from jnansi on 2005-09-08 08:34 EST --
Created an attachment (id=118591)
Proposed patch


-- Additional comment from staubach on 2005-10-07 13:21 EST --
Created an attachment (id=119715)
Proposed patch


-- Additional comment from staubach on 2005-10-07 13:26 EST --
The problem appears to be pretty much as described.  The kernel accumulates
the user and system times in the cutime and cstime elements of the task_struct.

The proc file system prints these elements after converting them to "clock"
ticks.  These ticks are currently defined to be 100 per second.  The problem
was that although the original values are unsigned long entities, the converted
values were being printed as signed long entities.

Interestingly, the other "clock" tick entities were being printed as unsigned
long entities already.

Comment 1 Peter Staubach 2005-10-07 19:27:36 UTC
Created attachment 119724 [details]
Proposed patch

Comment 2 Peter Staubach 2005-10-07 19:28:31 UTC
The situation and changes for RHEL-4 are pretty much the same as those for
RHEL-3, with one small exception.  In addition to the fields, cutime and
cstime, being printed as signed long integers instead of unsigned long
integers, the field, it_real_value, was also being printed as a signed
long integer even though it is an unsigned long integer before being
converted to "clock" ticks.

Comment 6 Red Hat Bugzilla 2006-03-07 20:19:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html



Note You need to log in before you can comment on or make changes to this bug.