229938 – ntpd resetting time about +0.2 seconds every 45 minutes

Bug 229938 - ntpd resetting time about +0.2 seconds every 45 minutes

Summary: ntpd resetting time about +0.2 seconds every 45 minutes

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-02-24 18:24 UTC by Paul Dickson
Modified:	2008-05-07 01:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-07 01:13:33 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
The log entries in /var/log/messages AND lspci output (5.41 KB, text/plain) 2007-02-24 18:24 UTC, Paul Dickson	no flags	Details
dmesg (25.14 KB, text/plain) 2007-04-04 02:48 UTC, Ask Bjørn Hansen	no flags	Details
View All

Description Paul Dickson 2007-02-24 18:24:59 UTC

Description of problem:
For this new kernel, ntpd is resetting time every 45 minues on the order of +0.2
seconds.

Feb 24 06:09:56 white ntpd[2261]: time reset +0.186409 s

Version-Release number of selected component (if applicable):
kernel-2.6.20-1.2940.fc7

How reproducible:
Always

Steps to Reproduce:
1. Look at /var/log/messages
2.
3.
  
Actual results:
About every 45 minutes, there is a new ntpd log message because it needs to
advance the clock about +0.2 seconds

Expected results:
The clock should be accurate enough to avoid resetting the clock.

Additional info:

Comment 1 Paul Dickson 2007-02-24 18:24:59 UTC

Created attachment 148745 [details]
The log entries in /var/log/messages AND lspci output

Comment 2 Chuck Ebbert 2007-02-24 19:20:42 UTC

Was it accurate with older kernels?

If so, which ones?

Comment 3 Paul Dickson 2007-02-25 04:41:07 UTC

I've switched to kernel-2.6.20-1.2942.fc7.  While I'm still getting positive
resets on the order of +0.15 seconds, the period between the resets is getting
longer and longer today (~90, ~120, ~180).

Feb 24 14:48:02 white ntpd[2070]: time reset +0.142277 s
Feb 24 16:15:48 white ntpd[2070]: time reset +0.151547 s
Feb 24 18:20:50 white ntpd[2070]: time reset +0.158067 s
Feb 24 21:15:43 white ntpd[2070]: time reset +0.144398 s

My previous kernel before 2940 was 2.6.20-1.2932.fc7 (which I booted on Feb 17).
There were no time resets for that kernel.

It's possible that ntpd is recalibrating the kernel as of 2940.  Decreasing time
resets and now increasing periods seem to reflect that.

Comment 4 john stultz 2007-02-27 19:09:30 UTC

Hmm. The ~.15s adjustments are expected, as .15 is the slew boundary (w/ default
config, if you get a drift larger then .15s it will use settimeofday rather then
slew the clock). However the frequency at which your seeing these time resets is
quite unexpected.

What is the drift value on your machine? (check /etc/ntp/drift  or
/var/lib/ntp/ntp.drift).

Comment 5 Paul Dickson 2007-03-01 06:03:53 UTC

It's not always resetted in the same directions.  Times also seem random.

/var/log/messages:Feb 25 22:44:57 white ntpd[2070]: time reset -0.157131 s
/var/log/messages:Feb 26 01:03:02 white ntpd[2070]: time reset -0.147778 s
/var/log/messages:Feb 26 04:18:52 white ntpd[2070]: time reset -0.149802 s
/var/log/messages:Feb 27 15:08:18 white ntpd[2070]: time reset +0.154075 s
/var/log/messages:Feb 27 20:41:58 white ntpd[2070]: time reset +0.150290 s
/var/log/messages:Feb 28 04:27:39 white ntpd[2070]: time reset -0.168500 s
/var/log/messages:Feb 28 06:44:25 white ntpd[2070]: time reset -0.143982 s
/var/log/messages:Feb 28 09:58:56 white ntpd[2070]: time reset -0.134518 s
/var/log/messages:Feb 28 15:31:46 white ntpd[2070]: time reset -0.147648 s

[root@white 22:56:40 ~]# more /etc/ntp/drift
/etc/ntp/drift: No such file or directory
[root@white 22:58:33 ~]# more /var/lib/ntp/ntp.drift
/var/lib/ntp/ntp.drift: No such file or directory
[root@white 22:58:47 ~]# more /var/lib/ntp/drift
164.271

Comment 6 Ask Bjørn Hansen 2007-04-04 01:11:41 UTC

FWIW, since upgrading to 2.6.20 (the latest x86_64-xen FC6 version) I see dramatic instability in the 
timekeeping too.

(a couple of FreeBSD boxes on the same network with the same time servers are within a few ms of the 
correct time).

[root@app1 ~]# ntpq -c pe
     remote           refid      st t when poll reach   delay   offset  jitter
===============================================================
===============
xclock-a.develoo 204.123.2.72     2 u  577 1024  377    7.200  -190.34 597.043
+clock-b.develoo 164.67.62.194    2 u  315 1024  377    0.001  243.504 434.286
-gw.bur.sol      10.0.201.1       3 u  328 1024  377    0.001  169.540 372.136
xgw-a.la.sol     216.52.237.152   3 u  601 1024  377    0.001  582.896 768.883
xgw-b.la.sol     216.52.237.151   3 u  602 1024  377    0.001  586.233 768.665
x10.0.201.116    .PPS.            1 u  662 1024  377    0.001  466.949 183.192
+tick.ucla.edu   .PSC.            1 u  339 1024  377    0.001  269.785 460.487
*tick.usnogps.na .USNO.           1 u  838 1024  377    0.001  318.837 115.632

I saw the following come by on lkml, it might be related...  I haven't had time to try the patch.

	From: 	  tglx
	Subject: 	Re: [PATCH] Avoid time_offset overflows
	Date: 	March 25, 2007 4:07:52 AM PDT
	To: 	  akpm
	Cc: 	  zippel, johnstul.com, mingo, linux-kernel.org
	Reply-To: 	  tglx

On Sun, 2007-03-25 at 01:09 -0800, Andrew Morton wrote:
On Sat, 24 Mar 2007 06:20:45 +0100 (CET) Roman Zippel <zippel> wrote:

Hi,

On Fri, 23 Mar 2007, john stultz wrote:

@@ -314,8 +314,8 @@ #endif
 		    freq_adj += time_freq;
 		    freq_adj = min(freq_adj, (s64)MAXFREQ_NSEC);
 		    time_freq = max(freq_adj, (s64)-MAXFREQ_NSEC);
-		    time_offset = (time_offset / NTP_INTERVAL_FREQ)
-		    			<< SHIFT_UPDATE;
+		    do_div(time_offset, NTP_INTERVAL_FREQ);
+		    time_offset <<= SHIFT_UPDATE;
 		} /* STA_PLL */
 	    } /* txc->modes & ADJ_OFFSET */
 	    if (txc->modes & ADJ_TICK)

This is wrong, time_offset is signed and do_div is unsigned.
In general I planned to do the same change, but the do_div API could use a 
little cleanup to provide some clear function for signed/unsigned divide 
(hopefully with a better name than div_long_long_rem_signed or 
do_div_llr).


Can we do a minimal thing for 2.6.21, worry about API beautification later?

Here you go. It's ugly, but it should do the trick for now.

	tglx

Index: linux-2.6/kernel/time/ntp.c
===============================================================
====
--- linux-2.6.orig/kernel/time/ntp.c
+++ linux-2.6/kernel/time/ntp.c
@@ -196,7 +196,7 @@ void __attribute__ ((weak)) notify_arch_
  */
 int do_adjtimex(struct timex *txc)
 {
-	long mtemp, save_adjust;
+	long mtemp, save_adjust, rem;
 	s64 freq_adj, temp64;
 	int result;

@@ -314,7 +314,9 @@ int do_adjtimex(struct timex *txc)
 		    freq_adj += time_freq;
 		    freq_adj = min(freq_adj, (s64)MAXFREQ_NSEC);
 		    time_freq = max(freq_adj, (s64)-MAXFREQ_NSEC);
-		    do_div(time_offset, NTP_INTERVAL_FREQ);
+		    time_offset = div_long_long_rem_signed(time_offset,
+							   NTP_INTERVAL_FREQ,
+							   &rem);
 		    time_offset <<= SHIFT_UPDATE;
 		} /* STA_PLL */
 	    } /* txc->modes & ADJ_OFFSET */

Comment 7 john stultz 2007-04-04 02:29:00 UTC

Can you attach dmesg output?

Comment 8 Ask Bjørn Hansen 2007-04-04 02:49:00 UTC

Created attachment 151640 [details]
dmesg

I don't see anything in dmesg that looks related, but attached anyway.	I'll
try recompiling the kernel with the patch from above to check if it makes a
difference...

Comment 9 john stultz 2007-04-04 03:54:15 UTC

Hmm.. What was the last known working kernel version?

Also can you reproduce the issue without running xen?

Comment 10 Ask Bjørn Hansen 2007-04-04 19:18:09 UTC

It wasn't great in 2.6.19-1.2911.6.5.fc6xen either, but much better:

Linux db2.la.sol 2.6.19-1.2911.6.5.fc6xen #1 SMP Sun Mar 4 16:23:59 EST 2007 x86_64 x86_64 
x86_64 GNU/Linux
     remote           refid      st t when poll reach   delay   offset  jitter
===============================================================
===============
+clock-a.develoo 207.200.81.113   2 u  224  256  377    0.001  -91.769 189.461
 clock-b.develoo 10.0.201.116     2 u  218  512  377    0.001  -69.820 173.134
+gw.bur.sol      10.0.201.1       3 u   41  512  377    0.001  -89.930 170.287
+gw-a.la.sol     216.52.237.153   3 u   38  512  377    0.001  -81.617 161.867
+gw-b.la.sol     216.52.237.152   3 u  219  512  377    0.001  -89.302 158.438
*10.0.201.116    .PPS.            1 u  222  512  377    0.001  -85.575 160.003


Without xen in the latest kernel it's close to sane:

Linux app3.la.sol 2.6.20-1.2933.fc6 #1 SMP Mon Mar 19 11:00:19 EDT 2007 x86_64 x86_64 x86_64 
GNU/Linux
     remote           refid      st t when poll reach   delay   offset  jitter
===============================================================
===============
+clock-a.develoo 207.200.81.113   2 u   35   64  377    2.127   66.149  19.534
 clock-b.develoo 10.0.201.116     2 u   30   64  377    2.391   66.327   5.292
-gw.bur.sol      10.0.201.1       3 u   33   64  377   13.863   48.056   1.971
+gw-a.la.sol     216.52.237.153   3 u   34   64  377    0.764   65.999   1.241
-gw-b.la.sol     216.52.237.152   3 u   34   64  377    0.735   63.250   1.432
*10.0.201.116    .PPS.            1 u   25   64  377   13.197   66.117   7.241
 

And with Xen and the patch above it seems Pretty Good (not sure if that makes any sense)...

Linux app2.la.sol 2.6.20-1.2933solfo1xen #1 SMP Tue Apr 3 23:59:03 PDT 2007 x86_64 x86_64 
x86_64 GNU/Linux
     remote           refid      st t when poll reach   delay   offset  jitter
===============================================================
===============
+clock-a.develoo 207.200.81.113   2 u   16  256  377    0.001    2.134 462.272
+clock-b.develoo 10.0.201.116     2 u    4  256  377    0.001    8.422 470.036
+gw.bur.sol      10.0.201.1       3 u    3  256  377    0.001   -0.360 484.335
+gw-a.la.sol     216.52.237.153   3 u    5  256  377    0.001    7.211 467.613
+gw-b.la.sol     216.52.237.152   3 u   56  256  377    0.001   -9.638 446.548
*10.0.201.116    .PPS.            1 u  195  256  377    0.001   34.328 489.772

Comment 11 Paul Dickson 2007-06-27 23:57:32 UTC

FYI - The time resets that started this report got further and further apart. 
My current 5 log files (/var/log/message*) don't have them and I think they
stopped within a month of them starting.

Comment 12 Bug Zapper 2008-04-03 19:14:41 UTC

Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 13 Bug Zapper 2008-05-07 01:13:31 UTC

This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.

If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

Note You need to log in before you can comment on or make changes to this bug.