Bug 452185

Summary: gettimeofday() jumping into the future
Product: Red Hat Enterprise Linux 5 Reporter: Need Real Name <eaversa>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED DUPLICATE QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: low    
Version: 5.1CC: dzickus, jcm, jfeeney
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-28 09:42:41 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
Attached is a c program which reproduces the issue (tooj.c) none

Description Need Real Name 2008-06-19 18:15:22 EDT
Description of problem:

We are seeing that calls to gettimeofday() will sometimes jump about 4,398
seconds into the future, while successive calls will return a "normal" time.
 
Note the following captured via strace:
gettimeofday({1213898402, 539681}, NULL) = 0
select(1024, [4 5 6], NULL, NULL, {31, 498915}) = 0 (Timeout)
gettimeofday({1213902832, 86557}, NULL) = 0
gettimeofday({1213898434, 40076}, NULL) = 0
getppid()                               = 1
gettimeofday({1213898434, 40135}, NULL) = 0
 
The second call to gettimeofday above returns a value of 1213902832, while the
subsequent call returns a value of 1213898434.
 
We first started noticing this on a server on which we installed RHEL 5.1.
 
This appears to be a known issue in the Linux Kernel Mailing list, refer to this
thread:  
http://lkml.org/lkml/2008/3/30/123
 
I cannot determine if this is fixed in a RedHat product or bug release. This has
a high impact on our production systems as it can cause timers to not return in
a reasonable period of time.
 

Version-Release number of selected component (if applicable):

We are running RHEL 5.1, with the following kernel:
 
Linux ritma01903 2.6.18-53.el5PAE #1 SMP Wed Oct 10 16:48:18 EDT 2007 i686 i686
i386 GNU/Linux 
How reproducible:
Happens randomly

Steps to Reproduce:
1. See details in this thread http://lkml.org/lkml/2008/3/30/123
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Need Real Name 2008-06-20 00:04:40 EDT
Created attachment 309895 [details]
Attached is a c program which reproduces the issue (tooj.c)

tooj.c (attached as plain text file) will reproduce the issue (tooj is Time out
of Joint).

compile: gcc -o tooj tooj.c

Run tooj:
nohup ./tooj 1 2 d > tooj.out 2>&1 &

Wait for tooj to exit.

Output looks like this:

15746 tries, current time 1213920125.618630, last time 1213920124.617435
15746 tries, current time 1213920125.618672, last time 1213920125.618630

15747 tries, current time 1213920126.619806, last time 1213920125.618672
15747 tries, current time 1213920126.619808, last time 1213920126.619806

15748 tries, current time 1213920127.620980, last time 1213920126.619808
15748 tries, current time 1213920127.620981, last time 1213920127.620980

15749 tries, current time 1213920128.622184, last time 1213920127.620981
15749 tries, current time 1213920128.622186, last time 1213920128.622184

15750 tries, current time 1213920129.623360, last time 1213920128.622186
15750 tries, current time 1213920129.623362, last time 1213920129.623360

15751 tries, current time 1213920130.625604, last time 1213920129.623362
15751 tries, current time 1213920130.625606, last time 1213920130.625604

15752 tries, current time 1213920131.627833, last time 1213920130.625606
15752 tries, current time 1213920131.627834, last time 1213920131.627833

15753 tries, current time 1213924530.674522, last time 1213920131.627834
time out of joint! 15753 tries, current time 1213920132.628019, last time
1213924530.674522
Comment 2 Prarit Bhargava 2008-06-23 09:00:53 EDT
Hi eaversa,

Could you send me details about the hardware you're using?

What is the vendor and model, and what type of processors are in the box?

Thanks,

P.
Comment 3 Need Real Name 2008-06-23 09:34:58 EDT
Server is a Dell 1950, has two physical cores, each physical CPU is quad core:
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 6
cpu MHz         : 3159.003
cache size      : 6144 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
                                          pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe nx lm constant_tsc pni                                   
       monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 6317.53
Comment 4 Need Real Name 2008-06-23 09:36:39 EDT
Not sure from your comment whether you read the email thread from lkml so I am
posting it again here:

http://lkml.org/lkml/2008/3/30/123

This post below seems to match what we are observing:

 We've been seeing some strange behaviour on some of our applications
> recently. I've tracked this down to gettimeofday() returning spurious
> values occasionally.
>
> Specifically, gettimeofday() will suddenly, for a single call, return
> a value about 4398 seconds (~1 hour 13 minutes) in the future. The
> following call goes back to a normal value.

I have also seen this.

> This seems to be occurring when the clock source goes slightly
> backwards for a single call. In
> kernel/time/timekeeping.c:__get_nsec_offset(), we have this:
> cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
>
> So a small decrease in time here will (this is all unsigned
> arithmetic) give us a very large cycle_delta. cyc2ns() then multiplies
> this by some value, then right shifts by 22. The resulting value (in
> nanoseconds) is approximately 4398 seconds; this gets added on to the
> xtime value, giving us our jump into the future. The next call to
> gettimeofday() returns to normal as we don't have this huge nanosecond
> offset.
Comment 5 Prarit Bhargava 2008-06-23 10:00:11 EDT
(In reply to comment #4)
> Not sure from your comment whether you read the email thread from lkml so I am
> posting it again here:
> 
> http://lkml.org/lkml/2008/3/30/123
>

Yup, I read it -- I was just looking for HW so I can reproduce and test the fix ;)

P.
Comment 6 Need Real Name 2008-06-23 10:26:51 EDT
So I take it that the code is not patched in RHEL 5.2 either?
Comment 7 Prarit Bhargava 2008-06-23 10:41:29 EDT
Nope, the above patch is NOT in RHEL 5.2.

P.
Comment 9 Prarit Bhargava 2008-06-23 10:44:58 EDT
eaversa,

There have been a few patches submitted, but not yet committed, to the RHEL5
tree that might also be responsible for your problem.

Can I put together a patch (or if you prefer I can build a kernel) for you to test?

P.
Comment 10 Need Real Name 2008-06-23 11:18:38 EDT
Prarit,

Not quite clear on what you mean. You don't mean patches not yet committed which
might be responsible for the problem- you mean patches which might resolve a
problem?

I would be happy run a kernel to test this fix out, however in the interests of
running a controlled test, I would prefer if you build the kernel. In order to
understand of what it is that I am taking delivery I would like to see the patches.

-Eric
Comment 12 Need Real Name 2008-06-24 08:58:09 EDT
Any update on the kernel that I am to run?
Comment 13 Prarit Bhargava 2008-06-24 09:04:48 EDT
Eric, 

Please be patient.  I'm testing with your test code on a Dell box ATM and will
send you patches and kernel in the next few days.

Thanks,

P.
Comment 14 Prarit Bhargava 2008-06-24 09:12:45 EDT
Eric,

It seems the last commit to the RHEL5 tree may have fixed your problem.  Could
you please test with 

http://people.redhat.com/dzickus/el5/94.el5/

and let me know the result?

Thanks,

P.
Comment 15 Need Real Name 2008-06-24 09:42:16 EDT
Thanks, I will try it out.
Comment 16 Prarit Bhargava 2008-06-26 08:27:20 EDT
Eric,

Have you had a chance to try out the new kernel?

No rush,

P.
Comment 17 Need Real Name 2008-07-01 09:33:03 EDT
We have been trying it out. Seems to be better. I want to get back to you after
we have done more testing.

-Eric
Comment 18 Prarit Bhargava 2008-07-01 09:43:36 EDT
Hi Eric,

Sure no problem -- mind if I leave this in NEEDINFO until you get back to us? 
That will make my bug tracking a bit easier :)

I'm almost 100% sure (given the bug description) that the patch I mentioned
fixes your problem.

Please be aware, however, there are some other known issues in RHEL5.  Most
notably an annoying 'off by 1 usec' error.  Patches to fix some of these issues
have been submitted to the source tree but have not yet been committed.

P.
Comment 19 Prarit Bhargava 2008-07-28 08:59:10 EDT
Hi Eric -- have you had a chance to test this?

P.
Comment 20 Need Real Name 2008-07-28 09:36:54 EDT
We have moved to 2.6.18-92.1.6.el5PAE which appears to contain a TSC fix as per
RedHat release notes. This appears to resolve the problem. As far as I am
concerned you can close this as fixed.

Regards,
Eric
Comment 21 Prarit Bhargava 2008-07-28 09:42:41 EDT

*** This bug has been marked as a duplicate of 443435 ***