Bug 262481

Summary: something deeply wrong with gettimeofday
Product: [Fedora] Fedora Reporter: Bill Nottingham <notting>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: medium    
Version: rawhideCC: adam, drepper.fsp, fedora, jakub, rvokal, tglx, valdis.kletnieks, zing
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-09-10 19:06:59 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 235703    

Description Bill Nottingham 2007-08-29 01:12:20 EDT
Description of problem:

After 5 minutes of uptime, the system clock goes screwy.

When running a simple 'while /bin/true ; date ; uptime ; sleep 2 ; done' loop:


Wed Aug 29 01:01:14 EDT 2007
 01:01:14 up 4 min,  4 users,  load average: 0.13, 0.40, 0.21
Wed Aug 29 01:01:16 EDT 2007
 01:01:16 up 4 min,  4 users,  load average: 0.13, 0.40, 0.21
Wed Aug 29 02:09:36 EDT 2007
 01:01:18 up 5 min,  4 users,  load average: 0.12, 0.39, 0.21
Wed Aug 29 02:09:38 EDT 2007
 01:01:20 up 5 min,  4 users,  load average: 0.12, 0.39, 0.21
Wed Aug 29 02:09:40 EDT 2007
 01:01:22 up 5 min,  4 users,  load average: 0.19, 0.40, 0.22

At the 5 minute mark, the time jumps forward an hour as reported by date, *even
though the kernel variant appears to be unchanged*.

If you then try and reset the time (with 'date -s "01:04"):

Wed Aug 29 02:11:06 EDT 2007
 01:02:48 up 6 min,  4 users,  load average: 0.99, 0.60, 0.30
Wed Aug 29 02:12:19 EDT 2007
 01:04:01 up 6 min,  4 users,  load average: 0.99, 0.60, 0.30
Wed Aug 29 02:12:21 EDT 2007
 01:04:03 up 6 min,  4 users,  load average: 0.99, 0.61, 0.30

the kernel's date (as seen by uptime) is correctly set, but the date as returned
by /bin/date remains wrong (although it changes by the same amount.)

Reproduced with multiple kernels; since it appears to be an issue reading the
time (and seeing the changelogs), pushing to glibc.

Version-Release number of selected component (if applicable):

2.6.90-13
Comment 1 Bill Nottingham 2007-08-29 09:51:26 EDT
2.6.90-11 appears to be behaving better in short testing.
Comment 2 Jakub Jelinek 2007-08-29 10:01:11 EDT
Likely a kernel bug then.  The change in 2.6.90-12/13 is just that for
gettimeofday it calls gettimeofday@@LINUX_2.6 in kernel's VDSO if available,
previously it would always call the vsyscall 0xffffffffff600000ul.
gettimeofday in glibc is just a wrapper around either of those, only when
that vsyscall or vdso call returns >= -4095UL (== error value), it instead
returns -1 and sets errno to -retval.
Comment 3 Chuck Ebbert 2007-08-29 11:34:38 EDT
(In reply to comment #1)
> 2.6.90-11 appears to be behaving better in short testing.

Does booting the kernel with "vdso=0" fix things with the newer glibc?
Comment 4 Jakub Jelinek 2007-08-29 13:14:24 EDT
*** Bug 264301 has been marked as a duplicate of this bug. ***
Comment 5 Valdis Kletnieks 2007-08-29 13:32:42 EDT
Yes, vdso=0 works as a workaround.
Comment 6 Valdis Kletnieks 2007-08-29 13:34:47 EDT
Forgot to add - if it's a kernel bug, it's probably not confined to Fedora
kernels - I got bit on both 2.6.22-rc6-mm1 and 2.6.23-rc3-mm1.  I haven't
checked a vanilla Linus kernel yet.
Comment 7 Chuck Ebbert 2007-08-29 13:56:41 EDT
Almost certainly caused by the x86_64 vdso patch -- either that or the new glibc
vdso support for x86_64 is broken.
Comment 8 Ulrich Drepper 2007-08-29 15:02:00 EDT
(In reply to comment #7)
> either that or the new glibc vdso support for x86_64 is broken.

All glibc does is jump to the provided address.  Not much room to make mistakes.
Comment 9 Dave Jones 2007-08-29 15:11:46 EDT
I wonder if this started happening when we added the 64bit tickless patches
(which afaik are in -mm, which explains why Valdis saw it there).

Valdis, can you check if it reproduces on Linus' tree ?

thanks.
Comment 10 Thomas Gleixner 2007-08-29 15:26:17 EDT
The tickless patches are not changing the VDSO stuff.

Thanks,
   tglx
Comment 11 Valdis Kletnieks 2007-08-29 15:34:39 EDT
23-rc3-mm1 doesn't have the x86_64 tickless code, Andrew dropped it for the nonce.

I'll replicate against a Linus -rc3/-rc4/-git later tonight and see what shakes out.
Comment 12 Valdis Kletnieks 2007-08-30 10:13:34 EDT
I took a Linus 2.6.22 tarball, applied 2.6.23-rc3 to it, built it - and the
problem is there too.  So whatever the issue is, it's in mainline kernels as
well as the -mm and Fedora kernels.

I found this quote from #2 interesting:

gettimeofday in glibc is just a wrapper around either of those, only when
that vsyscall or vdso call returns >= -4095UL (== error value), it instead
returns -1 and sets errno to -retval.

mostly because the time offset is just about 4095/4096 seconds....
Comment 13 Chuck Ebbert 2007-08-31 13:30:24 EDT
Has anyone tried to see if this problem happens with different clock sources?
Comment 14 Chuck Ebbert 2007-09-07 15:08:19 EDT
How to change your clocksource:

Look at /sys/devices/system/clocksource/clocksource0/current_clocksource
and available_clocksource. Pick something different from available_clocksource
and add a kernel boot parameter using that:

    clocksource=<whatever>

Comment 15 Thorsten Leemhuis 2007-09-07 16:27:50 EDT
booted with "clocksource=acpi_pm" and problem vanished; seems the CPU enters C3
again now as well (according to powertop it did not do that with recent rawhide
kernels)

Default clocksource was "hpet" beforehand; I can try "jiffies" or "tsc" as well
if hat is any help.
Comment 16 Chuck Ebbert 2007-09-07 16:46:54 EDT
What clocksource was it using originally? hpet?

Trying jiffies or tsc could be interesting, but will probably be flaky on SMP
and/or with cpufreq.
Comment 17 Thorsten Leemhuis 2007-09-07 16:57:15 EDT
(In reply to comment #16)
> What clocksource was it using originally? hpet?

yes, hpet.

> Trying jiffies or tsc could be interesting, but will probably be flaky on SMP
> and/or with cpufreq.

That's what I assumed.

Are there any hpet specific options I could try to narrow down the problem further?
Comment 18 Valdis Kletnieks 2007-09-08 20:21:15 EDT
Confirming - hpet clocksource causes the clock warp, but acpi_pm clocksource
works.  I wasn't brave enough to test jiffies or tsc, I'm on a x86_64 SMP. ;)
Comment 19 Valdis Kletnieks 2007-09-08 23:33:42 EDT
Over on the lkml thread on this subject, Andi Kleen pointed out that since the
vdso runs in ring 3, only the hpet and tsc clocksources are available - which
probably means that when you're using acpi_pm, it's forced into a different
codepath that avoids whatever the bug we're seeing....
Comment 20 Valdis Kletnieks 2007-09-10 15:13:56 EDT
Andi Kleen posted a patch - Chuck Ebbert finally tracked this down.

http://lkml.org/lkml/2007/9/9/8 has the patch.

Congrats, Chuck! :)
Comment 21 Chuck Ebbert 2007-09-10 17:28:31 EDT
In kernel-2.6.23-0.171.rc5.git1

Comment 22 Chuck Ebbert 2007-09-10 19:06:59 EDT
Private build of kernel 0.171 works here where previous kernel failed, closing
as fixed.