Bug 452185
| Summary: | gettimeofday() jumping into the future | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Need Real Name <eaversa> | ||||
| Component: | kernel | Assignee: | Prarit Bhargava <prarit> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Martin Jenner <mjenner> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 5.1 | CC: | dzickus, jcm, jfeeney | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | i686 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2008-07-28 13:42:41 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 309895 [details]
Attached is a c program which reproduces the issue (tooj.c)
tooj.c (attached as plain text file) will reproduce the issue (tooj is Time out
of Joint).
compile: gcc -o tooj tooj.c
Run tooj:
nohup ./tooj 1 2 d > tooj.out 2>&1 &
Wait for tooj to exit.
Output looks like this:
15746 tries, current time 1213920125.618630, last time 1213920124.617435
15746 tries, current time 1213920125.618672, last time 1213920125.618630
15747 tries, current time 1213920126.619806, last time 1213920125.618672
15747 tries, current time 1213920126.619808, last time 1213920126.619806
15748 tries, current time 1213920127.620980, last time 1213920126.619808
15748 tries, current time 1213920127.620981, last time 1213920127.620980
15749 tries, current time 1213920128.622184, last time 1213920127.620981
15749 tries, current time 1213920128.622186, last time 1213920128.622184
15750 tries, current time 1213920129.623360, last time 1213920128.622186
15750 tries, current time 1213920129.623362, last time 1213920129.623360
15751 tries, current time 1213920130.625604, last time 1213920129.623362
15751 tries, current time 1213920130.625606, last time 1213920130.625604
15752 tries, current time 1213920131.627833, last time 1213920130.625606
15752 tries, current time 1213920131.627834, last time 1213920131.627833
15753 tries, current time 1213924530.674522, last time 1213920131.627834
time out of joint! 15753 tries, current time 1213920132.628019, last time
1213924530.674522
Hi eaversa, Could you send me details about the hardware you're using? What is the vendor and model, and what type of processors are in the box? Thanks, P. Server is a Dell 1950, has two physical cores, each physical CPU is quad core:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5460 @ 3.16GHz
stepping : 6
cpu MHz : 3159.003
cache size : 6144 KB
physical id : 1
siblings : 4
core id : 3
cpu cores : 4
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe nx lm constant_tsc pni
monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 6317.53
Not sure from your comment whether you read the email thread from lkml so I am posting it again here: http://lkml.org/lkml/2008/3/30/123 This post below seems to match what we are observing: We've been seeing some strange behaviour on some of our applications > recently. I've tracked this down to gettimeofday() returning spurious > values occasionally. > > Specifically, gettimeofday() will suddenly, for a single call, return > a value about 4398 seconds (~1 hour 13 minutes) in the future. The > following call goes back to a normal value. I have also seen this. > This seems to be occurring when the clock source goes slightly > backwards for a single call. In > kernel/time/timekeeping.c:__get_nsec_offset(), we have this: > cycle_delta = (cycle_now - clock->cycle_last) & clock->mask; > > So a small decrease in time here will (this is all unsigned > arithmetic) give us a very large cycle_delta. cyc2ns() then multiplies > this by some value, then right shifts by 22. The resulting value (in > nanoseconds) is approximately 4398 seconds; this gets added on to the > xtime value, giving us our jump into the future. The next call to > gettimeofday() returns to normal as we don't have this huge nanosecond > offset. (In reply to comment #4) > Not sure from your comment whether you read the email thread from lkml so I am > posting it again here: > > http://lkml.org/lkml/2008/3/30/123 > Yup, I read it -- I was just looking for HW so I can reproduce and test the fix ;) P. So I take it that the code is not patched in RHEL 5.2 either? Nope, the above patch is NOT in RHEL 5.2. P. eaversa, There have been a few patches submitted, but not yet committed, to the RHEL5 tree that might also be responsible for your problem. Can I put together a patch (or if you prefer I can build a kernel) for you to test? P. Prarit, Not quite clear on what you mean. You don't mean patches not yet committed which might be responsible for the problem- you mean patches which might resolve a problem? I would be happy run a kernel to test this fix out, however in the interests of running a controlled test, I would prefer if you build the kernel. In order to understand of what it is that I am taking delivery I would like to see the patches. -Eric Any update on the kernel that I am to run? Eric, Please be patient. I'm testing with your test code on a Dell box ATM and will send you patches and kernel in the next few days. Thanks, P. Eric, It seems the last commit to the RHEL5 tree may have fixed your problem. Could you please test with http://people.redhat.com/dzickus/el5/94.el5/ and let me know the result? Thanks, P. Thanks, I will try it out. Eric, Have you had a chance to try out the new kernel? No rush, P. We have been trying it out. Seems to be better. I want to get back to you after we have done more testing. -Eric Hi Eric, Sure no problem -- mind if I leave this in NEEDINFO until you get back to us? That will make my bug tracking a bit easier :) I'm almost 100% sure (given the bug description) that the patch I mentioned fixes your problem. Please be aware, however, there are some other known issues in RHEL5. Most notably an annoying 'off by 1 usec' error. Patches to fix some of these issues have been submitted to the source tree but have not yet been committed. P. Hi Eric -- have you had a chance to test this? P. We have moved to 2.6.18-92.1.6.el5PAE which appears to contain a TSC fix as per RedHat release notes. This appears to resolve the problem. As far as I am concerned you can close this as fixed. Regards, Eric *** This bug has been marked as a duplicate of 443435 *** |
Description of problem: We are seeing that calls to gettimeofday() will sometimes jump about 4,398 seconds into the future, while successive calls will return a "normal" time. Note the following captured via strace: gettimeofday({1213898402, 539681}, NULL) = 0 select(1024, [4 5 6], NULL, NULL, {31, 498915}) = 0 (Timeout) gettimeofday({1213902832, 86557}, NULL) = 0 gettimeofday({1213898434, 40076}, NULL) = 0 getppid() = 1 gettimeofday({1213898434, 40135}, NULL) = 0 The second call to gettimeofday above returns a value of 1213902832, while the subsequent call returns a value of 1213898434. We first started noticing this on a server on which we installed RHEL 5.1. This appears to be a known issue in the Linux Kernel Mailing list, refer to this thread: http://lkml.org/lkml/2008/3/30/123 I cannot determine if this is fixed in a RedHat product or bug release. This has a high impact on our production systems as it can cause timers to not return in a reasonable period of time. Version-Release number of selected component (if applicable): We are running RHEL 5.1, with the following kernel: Linux ritma01903 2.6.18-53.el5PAE #1 SMP Wed Oct 10 16:48:18 EDT 2007 i686 i686 i386 GNU/Linux How reproducible: Happens randomly Steps to Reproduce: 1. See details in this thread http://lkml.org/lkml/2008/3/30/123 2. 3. Actual results: Expected results: Additional info: