Bug 193838 - gettimeofday goes backwards on IBM x460 merged servers
Summary: gettimeofday goes backwards on IBM x460 merged servers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Brian Maly
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 181409
TreeView+ depends on / blocked
 
Reported: 2006-06-02 01:19 UTC by TDS HCM
Modified: 2007-11-30 22:07 UTC (History)
1 user (show)

Fixed In Version: RHSA-2006-0575
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-10 23:26:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0575 0 normal SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 4 2006-08-10 04:00:00 UTC

Description TDS HCM 2006-06-02 01:19:55 UTC
Description of problem:
gettimeofday() goes backwards randomly

Version-Release number of selected component (if applicable):


How reproducible:
Compile c program below and run.

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

int  main(int argc, char *argv[])
{
        int i,x,rs;
        long long secs,elapse_time,start,end;
        struct timeval t1;

        printf("starting program\n");

        x=0;
        for(i=0;i<10000;i++) {
                rs = gettimeofday(&t1,0);
                secs = t1.tv_sec;
                start = secs * 1000000 + t1.tv_usec;

                rs = gettimeofday(&t1,0);
                secs = t1.tv_sec;
                end = secs * 1000000 + t1.tv_usec;
                x++;
                if(x >= 1000) { x=0; printf("."); fflush(NULL);}

                elapse_time   = end-start;
                if(end<start) {
                        printf("\nerror:(%i)end time=%ld, start time=%20ld :
time diff=%20ld\n",i,end,start,elapse_time);
                }
        }
        exit(0);
}


Steps to Reproduce:
1. Compile program
2. Run
3.
  
Actual results:
(Time diff and frequency of errors vary on each run)

error:(402)end time=1149011014759468, start time=    1149011014850382 : time
diff=              -90914
.
error:(1051)end time=1149011014760467, start time=    1149011014851331 : time
diff=              -90864

error:(1717)end time=1149011014761454, start time=    1149011014852399 : time
diff=              -90945
..
error:(3067)end time=1149011014763405, start time=    1149011014854292 : time
diff=              -90887

error:(3739)end time=1149011014764465, start time=    1149011014855398 : time
diff=              -90933
.
error:(4401)end time=1149011014765430, start time=    1149011014856382 : time
diff=              -90952
.
error:(5068)end time=1149011014766406, start time=    1149011014857346 : time
diff=              -90940

error:(5743)end time=1149011014767411, start time=    1149011014858375 : time
diff=              -90964
.
error:(6409)end time=1149011014768464, start time=    1149011014859377 : time
diff=              -90913
..
error:(8436)end time=1149011014771445, start time=    1149011014862320 : time
diff=              -90875


Expected results:
Time diff should never go negative. On a working box, we can run it hundreds of
times with no errors:

starting program
..........



Additional info:

RHAS 4 Update 3 x86_64
Kernel 2.6.9-34.ELlargesmp

This problem only occurs on the IBM x460 server, and only when it is "merged"
with another partition via an external cable. (Each half of the pair of x460
servers contains 4 dual-core CPUs and 64GB of RAM.)

Main server: IBM 8872-6RU
Second half: IBM 8874-2RU

I rebooted, bypassed the partition merge, and then ran the test again on just
one half and the above program repeatedly without errors.

I'm guessing there is a timing issue between the two halves of the server.

I've searched bugzilla for this bug, and there have been several reports of
similar problems in the last year or so. The problem we are experiencing is very
hardware specific, however. It does not seem to be related to the kernel version.

The x86 version of the kernel (2.6.9-34.ELhugemem) does not have this problem.

Comment 1 Kurtis Rader 2006-06-02 21:18:51 UTC
Known issue involving use of the HPET time source that is already fixed in the U
4 kernel.




Comment 2 TDS HCM 2006-06-02 21:36:41 UTC
Tested with pre-beta kernel-largesmp-2.6.9-37, and the problem is indeed fixed.

Rebooted with the "nohpet" kernel option and the problem is now gone on
kernel-largesmp-2.6.9-34 as well.

Thank you for the quick attention to this matter.

Comment 3 Jason Baron 2006-06-06 18:48:22 UTC
committed in stream U4. A test kernel with this patch is available from
http://people.redhat.com/~jbaron/rhel4/


Comment 7 Red Hat Bugzilla 2006-08-10 23:26:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html



Note You need to log in before you can comment on or make changes to this bug.