Description of problem: This is the description of the problem from a customer, after verifying it I couldn't find anything wrong. -- ETIME field is not obtained from procps but rather computed as a difference between current time and process start time. /* elapsed wall clock time, [[dd-]hh:]mm:ss format (not same as "time") */ static int pr_etime(char *restrict const outbuf, const proc_t *restrict const pp){ unsigned long t; unsigned dd,hh,mm,ss; char *cp = outbuf; t = seconds_since_boot - (unsigned long)(pp->start_time / Hertz); ss = t%60; t /= 60; mm = t%60; t /= 60; hh = t%24; t /= 24; dd = t; cp +=( dd ? snprintf(cp, COLWID, "%u-", dd) : 0 ); cp +=( (dd || hh) ? snprintf(cp, COLWID, "%02u:", hh) : 0 ); cp += snprintf(cp, COLWID, "%02u:%02u", mm, ss) ; return (int)(cp-outbuf); } If t = -1, it will print exactly the value customer sees. I copied the sources above to a standalone C-program, set t=-1 and the output is {alexs 10:17:55} etime 49710-06:28:15 How can we obtain t=-1? The value 'seconds_since_boot' is global and obtained after parsing the arguments: ps/global.c: seconds_since_boot = uptime(0,0); After that, 'ps' starts to read from /proc directory the processes info and parse/format/print it. I can see a possibility of the following race: (a) ps starts running and sets seconds_since_boot (b) just after that a new process is created. As a result, it will have start_time > seconds_since_boot (c) when we eventually obtain and print info for that process, seconds_since_boot - (unsigned long)(pp->start_time / Hertz) is negative. Obviously, for this to occur we need a heavily loaded system (so that there are many processes - this will make 'ps' run for a longer time) and many new processes per second. An obvious fix would be to add an explicit test to pr_etime() so that if we see a negative value, we'll replace it by 0. E.g.: t = seconds_since_boot - (unsigned long)(pp->start_time / Hertz); if (t < 0) t = 0; The problem is not RH-specific and is even workarounded in some tools, e.g. http://ar chive.groundworkopensource.com/groundwork-opensource/tags/GROUNDWORK_MONITO R_5_1_0/monitor-core/nagios/plugins/check_procs.c if (hyphcnt > 0) { sscanf(etime, "%d-%d:%d:%d", &days, &hours, &minutes, &seconds); /* linux 2.6.5/2.6.6 reporting some processes with infinite * elapsed times for some reason */ if (days == 49710) { return 0; } -- Version-Release number of selected component (if applicable): procps-3.2.3-8.6 How reproducible: Not easily, not often. Please read below a comment explaining the way it's suspected to be reproduced: "Obviously, for this to occur we need a heavily loaded system (so that there are many processes - this will make 'ps' run for a longer time) and many new processes per second." Steps to Reproduce: 1. Unknown 2. ps -eo uname,pid,ppid,stime,etime,cmd Actual results: % ps -eo uname,pid,ppid,stime,etime,cmd ETIME shows: 49710-06:28:15 Expected results: Correct ETIME. Additional info: Proposed patch attached.
Created attachment 295142 [details] procps-etime.patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 336033 [details] a new patch
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0950.html