Bug 128428

Summary:	Opteron gettimeofday granularity problem
Product:	Red Hat Enterprise Linux 3	Reporter:	Walt Kopy <walter.kopy>
Component:	kernel	Assignee:	Jim Paradis <jparadis>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.0	CC:	bfox, cww, dkl, eparis, hooverma, jbaron, jlayton, john.genego, k.georgiou, ltroan, peterm, petrides, riel, skhader, tao, tkincaid, wilmer, xc_support
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHSA-2005-663	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-09-28 14:23:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	156320

Description Walt Kopy 2004-07-22 19:51:24 UTC

Description of problem:
With RedHat AS x86 on opteron using the 2.4.21-9 kernel, the
gettimeofday function demonstrated microsecond resolution. However in
RedHat update 2 the 2.4.21-15 kernel, the gettimeofday function
resolution appears to be less. It is now equal to that of a Linux
clock tick, .01 sec.

This problem causes MPI benchmarks like Pallas and b_eff to produce
innacurate results. Since these benchmarks are important tools used
to evaluate linux cluster performance this problem is considered to
be serious.


Version-Release number of selected component (if applicable):
kernel-2.4.21-15.EL


How reproducible:
Everytime


Steps to Reproduce:
1.
2.
3.
Compile and run the following program which can be used to demonstrate
the problem:

#include <sys/time.h>
#include <stdio.h>

int
main(int argc, char* argv[])
{
  struct timeval start, prev, recent;
  long diff;
  int underflows = 0, overflows = 0;
#define LIMIT 20000
  int counts[LIMIT];

  bzero(counts, sizeof(counts));

  gettimeofday(&start, NULL);
  prev = start;
  while (1) {
    gettimeofday(&recent, NULL);
    diff = ((recent.tv_sec  - prev.tv_sec)*1000000 +
            (recent.tv_usec - prev.tv_usec));
    if (diff < 0)
      underflows += 1;
    else if (diff > LIMIT)
      overflows += 1;
    else
      counts[diff] += 1;
    diff = ((recent.tv_sec  - start.tv_sec)*1000000 +
            (recent.tv_usec - start.tv_usec));
    if (diff > 5000000)         /* 5 seconds */
      break;
    prev = recent;
  }
  if (underflows > 0)
    printf("%9d intervals took less than %6d microseconds\n",
           underflows, 0);
  if (overflows > 0)
    printf("%9d intervals took more than %6d microseconds\n",
           overflows, LIMIT);
  for (diff = 0; diff < LIMIT; diff += 1)
    if (counts[diff] > 0)
      printf("%9d intervals took %9s %6d microseconds\n",
             counts[diff], "", diff);
  return 0;
}



  
Actual results:
[root@crs1 root]# uname -a
Linux crs1 2.4.21-15.ELsmp #1 SMP Thu Apr 22 00:09:01 EDT 2004 x86_64
x86_64 x86_64 GNU/Linux

[root@crs1 root]# ./dll
138948453 intervals took                0 microseconds
      318 intervals took            10000 microseconds
      182 intervals took            10001 microseconds

[root@crs1 root]#  ./dll
138909560 intervals took                0 microseconds
      501 intervals took            10000 microseconds


Expected results:
On a system with microsecond resolution, like 2.4.21-9, you see output
like this:

  5283049 intervals took                0 microseconds
  4931788 intervals took                1 microseconds
     3150 intervals took                2 microseconds
      367 intervals took                3 microseconds
      363 intervals took                4 microseconds
      666 intervals took                5 microseconds
     1220 intervals took                6 microseconds
     2507 intervals took                7 microseconds
     1618 intervals took                8 microseconds
      582 intervals took                9 microseconds
      132 intervals took               10 microseconds
      152 intervals took               11 microseconds
      152 intervals took               12 microseconds
       31 intervals took               13 microseconds
       13 intervals took               14 microseconds
        9 intervals took               15 microseconds
       12 intervals took               16 microseconds
        4 intervals took               17 microseconds
        9 intervals took               18 microseconds
        6 intervals took               19 microseconds
       10 intervals took               20 microseconds
        5 intervals took               21 microseconds


Additional info:
This problem causes MPI benchmarks like Pallas and b_eff to produce
innacurate results. Since these benchmarks are important tools used
to evaluate linux cluster performance this problem is considered to
be serious.

Comment 2 Jim Paradis 2004-08-24 18:38:35 UTC

By default, systems without an HPET timer fall back to using the PIT
timer (which has .01 s resolution).  Although the TSC timer is
available for finer resolution, we disabled it by default due to
another problem.  

A workaround for the problem you're seeing right now is to enable the
TSC timer by specifying the "tsc" parameter at boot time.  Meanwhile
I'll revisit the TSC timer issue.

Comment 6 Jim Tong 2004-12-28 08:56:39 UTC

Can someone at RedHat elaborate what the TSC timer issue is?
I have an 4 way opteron system, is it safe to use tsc or cyclone as
the kernel parameter at boot time?

Comment 7 Arjan van de Ven 2004-12-30 14:15:26 UTC

cyclone is only appropriate for IBM x440 line of machines

TSC will work generally, but there may be unreliabilities in case of clockdrift
(which is more likely the more cpus you have) between the cpus.
(temperature differences alone can cause this over longer time).
HPET is the most reliable method, and in theory all modern systems have one..

Comment 8 Jim Tong 2005-01-04 00:23:01 UTC

So Jim Paradis, meant that "we disabled it by default due to another 
problem" was entirely the clock skew problem?  Our system doesn't 
even detect HPET, does it mean it doesn't have it.  Do I have to ask 
someone at AMD that?

Comment 9 Jim Paradis 2005-01-04 17:03:53 UTC

It wasn't just clock skew; there was a synchronization problem such
that clock updates (e.g. via ntpdate) would occasionally update the
wrong half of the doubleword and you'd see the year jump to something
like 586562 (See Bug 114869).

Comment 10 Jeff Layton 2005-01-19 19:08:27 UTC

So, is this TSC problem an issue on UP AMD64 systems or on the EM64T,
or is it a problem only with NUMA machines? If it's not a problem on
these arches then perhaps we should consider making TSC the default on
them?

Comment 15 Syed Khader Vali 2005-05-11 07:36:07 UTC

So, is there a fix to this problem yet ? This problem has not been fixed as of
RHEL 3 AS Update 4.

Comment 16 Mark Hoover 2005-05-24 14:10:00 UTC

We are encountering a situation where the lack of precision is causing an 
issue with some Oracle timestamps.  The database was exported from Oracle on 
Solaris into Oracle on Linux.

> 5/18/2005 5:28:17.358617 PM
> 5/18/2005 5:28:17.408617 PM

As you can see, the timestamps all end in "8617"

Comment 21 Jim Paradis 2005-06-15 21:38:16 UTC

I am currently investigating the feasibility of backporting the ACPI
power-management timer code from RHEL4 so as to make another free-running timer
available for timekeeping.  I have already backported this to another version of
the 2.4 kernel, so this should not be terribly difficult.

Comment 28 Ernie Petrides 2005-07-20 07:48:48 UTC

A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-33.EL).

Comment 29 Jim Paradis 2005-07-20 14:29:22 UTC

Note that to take advantage of the fix just committed one has to boot with the
"pmtmr" boot command line option.  This should be noted in the errata documentation.

Comment 30 Jatin Nansi 2005-07-21 10:56:21 UTC

Should this information be noted in the release notes for U6?

Comment 32 Ernie Petrides 2005-07-22 01:42:07 UTC

To kbaxley, not "clock=pmtmr", just "pmtmr".

Comment 46 Ernie Petrides 2005-09-12 18:31:50 UTC

Hi, Bastien.  Please show the output of /proc/cmdline on the boot-up
that shows the failure, and please also indicate whether the reproducer
program in the initial comment of this bug report also fails.  Thanks.

Comment 48 Bastien Nocera 2005-09-26 13:22:48 UTC

Re comment #46, the customer wasn't using the pmtmr option as mentioned in
comment #29.

Comment 49 Red Hat Bugzilla 2005-09-28 14:23:45 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html

Comment 53 Peter Martuccelli 2006-10-20 21:06:52 UTC

*** Bug 210889 has been marked as a duplicate of this bug. ***

Comment 54 Ernie Petrides 2006-10-20 23:09:02 UTC

Undoing dup, because this bug was fixed in U6 and bug 210889 was entered on U8.