Hide Forgot
Description of problem: We are seeing that calls to gettimeofday() will sometimes jump about 4,398 seconds into the future, while successive calls will return a "normal" time. Note the following captured via strace: gettimeofday({1213898402, 539681}, NULL) = 0 select(1024, [4 5 6], NULL, NULL, {31, 498915}) = 0 (Timeout) gettimeofday({1213902832, 86557}, NULL) = 0 gettimeofday({1213898434, 40076}, NULL) = 0 getppid() = 1 gettimeofday({1213898434, 40135}, NULL) = 0 The second call to gettimeofday above returns a value of 1213902832, while the subsequent call returns a value of 1213898434. We first started noticing this on a server on which we installed RHEL 5.1. This appears to be a known issue in the Linux Kernel Mailing list, refer to this thread: http://lkml.org/lkml/2008/3/30/123 I cannot determine if this is fixed in a RedHat product or bug release. This has a high impact on our production systems as it can cause timers to not return in a reasonable period of time. Version-Release number of selected component (if applicable): We are running RHEL 5.1, with the following kernel: Linux ritma01903 2.6.18-53.el5PAE #1 SMP Wed Oct 10 16:48:18 EDT 2007 i686 i686 i386 GNU/Linux How reproducible: Happens randomly Steps to Reproduce: 1. See details in this thread http://lkml.org/lkml/2008/3/30/123 2. 3. Actual results: Expected results: Additional info:
Created attachment 309895 [details] Attached is a c program which reproduces the issue (tooj.c) tooj.c (attached as plain text file) will reproduce the issue (tooj is Time out of Joint). compile: gcc -o tooj tooj.c Run tooj: nohup ./tooj 1 2 d > tooj.out 2>&1 & Wait for tooj to exit. Output looks like this: 15746 tries, current time 1213920125.618630, last time 1213920124.617435 15746 tries, current time 1213920125.618672, last time 1213920125.618630 15747 tries, current time 1213920126.619806, last time 1213920125.618672 15747 tries, current time 1213920126.619808, last time 1213920126.619806 15748 tries, current time 1213920127.620980, last time 1213920126.619808 15748 tries, current time 1213920127.620981, last time 1213920127.620980 15749 tries, current time 1213920128.622184, last time 1213920127.620981 15749 tries, current time 1213920128.622186, last time 1213920128.622184 15750 tries, current time 1213920129.623360, last time 1213920128.622186 15750 tries, current time 1213920129.623362, last time 1213920129.623360 15751 tries, current time 1213920130.625604, last time 1213920129.623362 15751 tries, current time 1213920130.625606, last time 1213920130.625604 15752 tries, current time 1213920131.627833, last time 1213920130.625606 15752 tries, current time 1213920131.627834, last time 1213920131.627833 15753 tries, current time 1213924530.674522, last time 1213920131.627834 time out of joint! 15753 tries, current time 1213920132.628019, last time 1213924530.674522
Hi eaversa, Could you send me details about the hardware you're using? What is the vendor and model, and what type of processors are in the box? Thanks, P.
Server is a Dell 1950, has two physical cores, each physical CPU is quad core: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU X5460 @ 3.16GHz stepping : 6 cpu MHz : 3159.003 cache size : 6144 KB physical id : 1 siblings : 4 core id : 3 cpu cores : 4 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm bogomips : 6317.53
Not sure from your comment whether you read the email thread from lkml so I am posting it again here: http://lkml.org/lkml/2008/3/30/123 This post below seems to match what we are observing: We've been seeing some strange behaviour on some of our applications > recently. I've tracked this down to gettimeofday() returning spurious > values occasionally. > > Specifically, gettimeofday() will suddenly, for a single call, return > a value about 4398 seconds (~1 hour 13 minutes) in the future. The > following call goes back to a normal value. I have also seen this. > This seems to be occurring when the clock source goes slightly > backwards for a single call. In > kernel/time/timekeeping.c:__get_nsec_offset(), we have this: > cycle_delta = (cycle_now - clock->cycle_last) & clock->mask; > > So a small decrease in time here will (this is all unsigned > arithmetic) give us a very large cycle_delta. cyc2ns() then multiplies > this by some value, then right shifts by 22. The resulting value (in > nanoseconds) is approximately 4398 seconds; this gets added on to the > xtime value, giving us our jump into the future. The next call to > gettimeofday() returns to normal as we don't have this huge nanosecond > offset.
(In reply to comment #4) > Not sure from your comment whether you read the email thread from lkml so I am > posting it again here: > > http://lkml.org/lkml/2008/3/30/123 > Yup, I read it -- I was just looking for HW so I can reproduce and test the fix ;) P.
So I take it that the code is not patched in RHEL 5.2 either?
Nope, the above patch is NOT in RHEL 5.2. P.
eaversa, There have been a few patches submitted, but not yet committed, to the RHEL5 tree that might also be responsible for your problem. Can I put together a patch (or if you prefer I can build a kernel) for you to test? P.
Prarit, Not quite clear on what you mean. You don't mean patches not yet committed which might be responsible for the problem- you mean patches which might resolve a problem? I would be happy run a kernel to test this fix out, however in the interests of running a controlled test, I would prefer if you build the kernel. In order to understand of what it is that I am taking delivery I would like to see the patches. -Eric
Any update on the kernel that I am to run?
Eric, Please be patient. I'm testing with your test code on a Dell box ATM and will send you patches and kernel in the next few days. Thanks, P.
Eric, It seems the last commit to the RHEL5 tree may have fixed your problem. Could you please test with http://people.redhat.com/dzickus/el5/94.el5/ and let me know the result? Thanks, P.
Thanks, I will try it out.
Eric, Have you had a chance to try out the new kernel? No rush, P.
We have been trying it out. Seems to be better. I want to get back to you after we have done more testing. -Eric
Hi Eric, Sure no problem -- mind if I leave this in NEEDINFO until you get back to us? That will make my bug tracking a bit easier :) I'm almost 100% sure (given the bug description) that the patch I mentioned fixes your problem. Please be aware, however, there are some other known issues in RHEL5. Most notably an annoying 'off by 1 usec' error. Patches to fix some of these issues have been submitted to the source tree but have not yet been committed. P.
Hi Eric -- have you had a chance to test this? P.
We have moved to 2.6.18-92.1.6.el5PAE which appears to contain a TSC fix as per RedHat release notes. This appears to resolve the problem. As far as I am concerned you can close this as fixed. Regards, Eric
*** This bug has been marked as a duplicate of 443435 ***