Bug 262481

Summary: something deeply wrong with gettimeofday
Product: [Fedora] Fedora Reporter: Bill Nottingham <notting>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: medium    
Version: rawhideCC: adam, drepper, fedora, jakub, rvokal, tglx, valdis.kletnieks, zing
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-09-10 23:06:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 235703    

Description Bill Nottingham 2007-08-29 05:12:20 UTC
Description of problem:

After 5 minutes of uptime, the system clock goes screwy.

When running a simple 'while /bin/true ; date ; uptime ; sleep 2 ; done' loop:


Wed Aug 29 01:01:14 EDT 2007
 01:01:14 up 4 min,  4 users,  load average: 0.13, 0.40, 0.21
Wed Aug 29 01:01:16 EDT 2007
 01:01:16 up 4 min,  4 users,  load average: 0.13, 0.40, 0.21
Wed Aug 29 02:09:36 EDT 2007
 01:01:18 up 5 min,  4 users,  load average: 0.12, 0.39, 0.21
Wed Aug 29 02:09:38 EDT 2007
 01:01:20 up 5 min,  4 users,  load average: 0.12, 0.39, 0.21
Wed Aug 29 02:09:40 EDT 2007
 01:01:22 up 5 min,  4 users,  load average: 0.19, 0.40, 0.22

At the 5 minute mark, the time jumps forward an hour as reported by date, *even
though the kernel variant appears to be unchanged*.

If you then try and reset the time (with 'date -s "01:04"):

Wed Aug 29 02:11:06 EDT 2007
 01:02:48 up 6 min,  4 users,  load average: 0.99, 0.60, 0.30
Wed Aug 29 02:12:19 EDT 2007
 01:04:01 up 6 min,  4 users,  load average: 0.99, 0.60, 0.30
Wed Aug 29 02:12:21 EDT 2007
 01:04:03 up 6 min,  4 users,  load average: 0.99, 0.61, 0.30

the kernel's date (as seen by uptime) is correctly set, but the date as returned
by /bin/date remains wrong (although it changes by the same amount.)

Reproduced with multiple kernels; since it appears to be an issue reading the
time (and seeing the changelogs), pushing to glibc.

Version-Release number of selected component (if applicable):

2.6.90-13

Comment 1 Bill Nottingham 2007-08-29 13:51:26 UTC
2.6.90-11 appears to be behaving better in short testing.

Comment 2 Jakub Jelinek 2007-08-29 14:01:11 UTC
Likely a kernel bug then.  The change in 2.6.90-12/13 is just that for
gettimeofday it calls gettimeofday@@LINUX_2.6 in kernel's VDSO if available,
previously it would always call the vsyscall 0xffffffffff600000ul.
gettimeofday in glibc is just a wrapper around either of those, only when
that vsyscall or vdso call returns >= -4095UL (== error value), it instead
returns -1 and sets errno to -retval.

Comment 3 Chuck Ebbert 2007-08-29 15:34:38 UTC
(In reply to comment #1)
> 2.6.90-11 appears to be behaving better in short testing.

Does booting the kernel with "vdso=0" fix things with the newer glibc?


Comment 4 Jakub Jelinek 2007-08-29 17:14:24 UTC
*** Bug 264301 has been marked as a duplicate of this bug. ***

Comment 5 Valdis Kletnieks 2007-08-29 17:32:42 UTC
Yes, vdso=0 works as a workaround.

Comment 6 Valdis Kletnieks 2007-08-29 17:34:47 UTC
Forgot to add - if it's a kernel bug, it's probably not confined to Fedora
kernels - I got bit on both 2.6.22-rc6-mm1 and 2.6.23-rc3-mm1.  I haven't
checked a vanilla Linus kernel yet.

Comment 7 Chuck Ebbert 2007-08-29 17:56:41 UTC
Almost certainly caused by the x86_64 vdso patch -- either that or the new glibc
vdso support for x86_64 is broken.


Comment 8 Ulrich Drepper 2007-08-29 19:02:00 UTC
(In reply to comment #7)
> either that or the new glibc vdso support for x86_64 is broken.

All glibc does is jump to the provided address.  Not much room to make mistakes.


Comment 9 Dave Jones 2007-08-29 19:11:46 UTC
I wonder if this started happening when we added the 64bit tickless patches
(which afaik are in -mm, which explains why Valdis saw it there).

Valdis, can you check if it reproduces on Linus' tree ?

thanks.

Comment 10 Thomas Gleixner 2007-08-29 19:26:17 UTC
The tickless patches are not changing the VDSO stuff.

Thanks,
   tglx


Comment 11 Valdis Kletnieks 2007-08-29 19:34:39 UTC
23-rc3-mm1 doesn't have the x86_64 tickless code, Andrew dropped it for the nonce.

I'll replicate against a Linus -rc3/-rc4/-git later tonight and see what shakes out.

Comment 12 Valdis Kletnieks 2007-08-30 14:13:34 UTC
I took a Linus 2.6.22 tarball, applied 2.6.23-rc3 to it, built it - and the
problem is there too.  So whatever the issue is, it's in mainline kernels as
well as the -mm and Fedora kernels.

I found this quote from #2 interesting:

gettimeofday in glibc is just a wrapper around either of those, only when
that vsyscall or vdso call returns >= -4095UL (== error value), it instead
returns -1 and sets errno to -retval.

mostly because the time offset is just about 4095/4096 seconds....

Comment 13 Chuck Ebbert 2007-08-31 17:30:24 UTC
Has anyone tried to see if this problem happens with different clock sources?

Comment 14 Chuck Ebbert 2007-09-07 19:08:19 UTC
How to change your clocksource:

Look at /sys/devices/system/clocksource/clocksource0/current_clocksource
and available_clocksource. Pick something different from available_clocksource
and add a kernel boot parameter using that:

    clocksource=<whatever>



Comment 15 Thorsten Leemhuis 2007-09-07 20:27:50 UTC
booted with "clocksource=acpi_pm" and problem vanished; seems the CPU enters C3
again now as well (according to powertop it did not do that with recent rawhide
kernels)

Default clocksource was "hpet" beforehand; I can try "jiffies" or "tsc" as well
if hat is any help.

Comment 16 Chuck Ebbert 2007-09-07 20:46:54 UTC
What clocksource was it using originally? hpet?

Trying jiffies or tsc could be interesting, but will probably be flaky on SMP
and/or with cpufreq.


Comment 17 Thorsten Leemhuis 2007-09-07 20:57:15 UTC
(In reply to comment #16)
> What clocksource was it using originally? hpet?

yes, hpet.

> Trying jiffies or tsc could be interesting, but will probably be flaky on SMP
> and/or with cpufreq.

That's what I assumed.

Are there any hpet specific options I could try to narrow down the problem further?

Comment 18 Valdis Kletnieks 2007-09-09 00:21:15 UTC
Confirming - hpet clocksource causes the clock warp, but acpi_pm clocksource
works.  I wasn't brave enough to test jiffies or tsc, I'm on a x86_64 SMP. ;)

Comment 19 Valdis Kletnieks 2007-09-09 03:33:42 UTC
Over on the lkml thread on this subject, Andi Kleen pointed out that since the
vdso runs in ring 3, only the hpet and tsc clocksources are available - which
probably means that when you're using acpi_pm, it's forced into a different
codepath that avoids whatever the bug we're seeing....

Comment 20 Valdis Kletnieks 2007-09-10 19:13:56 UTC
Andi Kleen posted a patch - Chuck Ebbert finally tracked this down.

http://lkml.org/lkml/2007/9/9/8 has the patch.

Congrats, Chuck! :)

Comment 21 Chuck Ebbert 2007-09-10 21:28:31 UTC
In kernel-2.6.23-0.171.rc5.git1



Comment 22 Chuck Ebbert 2007-09-10 23:06:59 UTC
Private build of kernel 0.171 works here where previous kernel failed, closing
as fixed.