Description of problem: gettimeofday() goes backwards randomly Version-Release number of selected component (if applicable): How reproducible: Compile c program below and run. #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <time.h> int main(int argc, char *argv[]) { int i,x,rs; long long secs,elapse_time,start,end; struct timeval t1; printf("starting program\n"); x=0; for(i=0;i<10000;i++) { rs = gettimeofday(&t1,0); secs = t1.tv_sec; start = secs * 1000000 + t1.tv_usec; rs = gettimeofday(&t1,0); secs = t1.tv_sec; end = secs * 1000000 + t1.tv_usec; x++; if(x >= 1000) { x=0; printf("."); fflush(NULL);} elapse_time = end-start; if(end<start) { printf("\nerror:(%i)end time=%ld, start time=%20ld : time diff=%20ld\n",i,end,start,elapse_time); } } exit(0); } Steps to Reproduce: 1. Compile program 2. Run 3. Actual results: (Time diff and frequency of errors vary on each run) error:(402)end time=1149011014759468, start time= 1149011014850382 : time diff= -90914 . error:(1051)end time=1149011014760467, start time= 1149011014851331 : time diff= -90864 error:(1717)end time=1149011014761454, start time= 1149011014852399 : time diff= -90945 .. error:(3067)end time=1149011014763405, start time= 1149011014854292 : time diff= -90887 error:(3739)end time=1149011014764465, start time= 1149011014855398 : time diff= -90933 . error:(4401)end time=1149011014765430, start time= 1149011014856382 : time diff= -90952 . error:(5068)end time=1149011014766406, start time= 1149011014857346 : time diff= -90940 error:(5743)end time=1149011014767411, start time= 1149011014858375 : time diff= -90964 . error:(6409)end time=1149011014768464, start time= 1149011014859377 : time diff= -90913 .. error:(8436)end time=1149011014771445, start time= 1149011014862320 : time diff= -90875 Expected results: Time diff should never go negative. On a working box, we can run it hundreds of times with no errors: starting program .......... Additional info: RHAS 4 Update 3 x86_64 Kernel 2.6.9-34.ELlargesmp This problem only occurs on the IBM x460 server, and only when it is "merged" with another partition via an external cable. (Each half of the pair of x460 servers contains 4 dual-core CPUs and 64GB of RAM.) Main server: IBM 8872-6RU Second half: IBM 8874-2RU I rebooted, bypassed the partition merge, and then ran the test again on just one half and the above program repeatedly without errors. I'm guessing there is a timing issue between the two halves of the server. I've searched bugzilla for this bug, and there have been several reports of similar problems in the last year or so. The problem we are experiencing is very hardware specific, however. It does not seem to be related to the kernel version. The x86 version of the kernel (2.6.9-34.ELhugemem) does not have this problem.
Known issue involving use of the HPET time source that is already fixed in the U 4 kernel.
Tested with pre-beta kernel-largesmp-2.6.9-37, and the problem is indeed fixed. Rebooted with the "nohpet" kernel option and the problem is now gone on kernel-largesmp-2.6.9-34 as well. Thank you for the quick attention to this matter.
committed in stream U4. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html