863533 – problems with very short polling intervals

Bug 863533 - problems with very short polling intervals

Summary: problems with very short polling intervals

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	chrony
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Miroslav Lichvar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	904515 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-05 18:15 UTC by Endre "Hrebicek" Balint-Nagy
Modified:	2014-09-24 01:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:	chrony-1.30-0.1.pre1.fc21
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-06-09 13:49:50 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
when chronyd is cheated by the kernel (150.47 KB, image/png) 2013-03-08 05:26 UTC, Endre "Hrebicek" Balint-Nagy	no flags	Details
The control graph (smokeping from 192.168.1.3 to 192.168.1.5 (45.50 KB, image/png) 2013-03-08 05:31 UTC, Endre "Hrebicek" Balint-Nagy	no flags	Details
View All

Description Endre "Hrebicek" Balint-Nagy 2012-10-05 18:15:20 UTC

Description of problem:
Especially while communicating with peers (almost indifferent if chrony or ntpd)
the estimate of RTT is awfully skewed. (upto 512 seconds instead of 250-450 usec measured by smokeping with NTP sized packets.)
Version-Release number of selected component (if applicable):

chrony-1.27-0.4.pre1.el6.x86_64, but I dont think it is different with other distro versions. The issue is out of scope for RHEL.

How reproducible:
Configure peering with one or more peers on the same LAN.
(My test network is an RHEL6.3 destop Fujitsu-Siemens Esprimo P,
an i386 ACER aspire notebook running debian/wheezy, a Raspberry Pi (armv6) running raspbian - also debian/wheezy, chrony-1.27-pre1 on all boxes) 

It is almost allways badly skewed on my LAN.
Steps to Reproduce:
1. Configure NTP peers on the same LAN.
2. Observe the simptom.
3.
  
Actual results:


Expected results:


Additional info:
For now it is a low key issue but can make a bad opinion on chrony. With this issue open I can not propose chrony for public NTP service.

Comment 1 Miroslav Lichvar 2012-10-08 11:37:15 UTC

Can you please attach the chrony's measurements log and a "tcpdump port 123" output when this happened?

Comment 2 Endre "Hrebicek" Balint-Nagy 2012-11-02 12:20:06 UTC

It looks like the wrong estimates come in when we are not synchronizing to the server/peer because of its higher startum, so I have to ignore them when processing the log files. Maybe once we have the wrong RTT estimate, we will ignore the server/peer in the future if a maxdelay option is in effect. Will investigate this particular case. Otherwise the estimate is irrelevant, but misleading, but in fact it is not a real bug.

Comment 3 Endre "Hrebicek" Balint-Nagy 2013-01-26 14:10:24 UTC

Ahoj Mirku!

Reviewed the issue again, the cause is that in some cases, suggested by the NTP rfc, we throw away the packet, and logging synthetic data (worst estimate) even when the packet contains useful information. Please never log synthetic data in primary data columns! (Up to you to open a column for those figures of course. I think the NTP standard does not applies to the log files, has only responsibility over on-wire data and application semantic but the logged data is not covered - thus should not misguide me. The RTT value is still fine when the offset is unusable.)

Comment 4 Endre "Hrebicek" Balint-Nagy 2013-03-08 05:26:29 UTC

Created attachment 706917 [details]
when chronyd is cheated by the kernel

Sometimes chronyd gets misleaded by the kernel. I checked with strace, the timestamps are taken from SO_TIMESTAMP icmp messages.
Observe the 8ms offset from 7am to 11am, then the gap caused by a reboot.
(Blue line is the RTT.) A 8ms offset without a sustained asymmetric. I checked on the smokeping graph on the other side.

Comment 5 Endre "Hrebicek" Balint-Nagy 2013-03-08 05:31:09 UTC

Created attachment 706918 [details]
The control graph (smokeping from 192.168.1.3 to 192.168.1.5

The control smokeping graph not showing irregularities from 7am-11am.

Comment 6 Endre "Hrebicek" Balint-Nagy 2013-03-08 05:32:16 UTC

According to the evidence in the last 2 png files readdressing to the kernel.

Comment 7 Endre "Hrebicek" Balint-Nagy 2013-03-08 05:46:32 UTC

The offending host ran 3.8.1-fc18.x86_64 kernel and chronyd was misleaded by 8+ msec on a 100Mbit Ethernet link, where normal rtt is between 400 and 500 microsecs. Making a better than 1ms clock synchronization - which is my first goal - is spoiled by this issue.
Will make some reproducer soon.
(Stepping the clock by a dozen millisecs will do, I hope.)

I observed similar issues with 3.0.0-el7 upto the current fedora kernels.
(The pre 3.0.0 kernels do the tsc calibration poorly for my uses, so no observations.)
[bne@bandi chrony]$ uname -r; ifconfig em1
3.8.1-201.fc18.x86_64
em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.5  netmask 255.255.255.0  broadcast 192.168.1.255

Comment 8 Endre "Hrebicek" Balint-Nagy 2013-03-08 05:53:14 UTC

Taking the QA role on myself as I am deeply enough involved already and have wasted a lot of time and sweat on this. (I think it is unusual a bit, but my job title is kernel QE associate, so I am on my own field when messing in.)

Comment 9 Endre "Hrebicek" Balint-Nagy 2013-03-08 06:10:12 UTC

The first PNG file is the network latency seen by chonyd running on 192.168.1.5 (Fujitsu-Siemens ESPRIMO-P, Fedora18), the second one is the smokepings view from 192.168.1.3, NTP sized (90 byte on wire) packets running debian squezee on an acer aspire notebook. The link was 100Mbit/sec Ethernet. 
bne@acer:~$ cat /etc/debian_version ; uname -r
7.0
3.7.4-acer
 .
Chronyd was using SO_TIMESTAMP socket option to request timing information from the kernel.

Further details and evidence and a reproducer follows soon.

Ondra

Comment 10 Endre "Hrebicek" Balint-Nagy 2013-03-08 06:13:43 UTC

read Comment 4 as ... ceased by a reboot.

Comment 11 Dave Jones 2013-03-08 20:37:28 UTC

you're going to get a faster resolution of this by reporting it directly upstream to the networking maintainers at netdev.org

Comment 12 Endre "Hrebicek" Balint-Nagy 2013-03-14 19:05:07 UTC

I see no evidence pointing towards a kernel issue, so moving back to chrony.

Comment 13 Endre "Hrebicek" Balint-Nagy 2013-03-14 19:09:16 UTC

Use of minpoll 0 and maxdelay on peers seem triggering the issue.
Removing both remove the issue coming up.
minpoll 3 is quite enough, I will check with maxdelay + minpoll 3.

Comment 14 Endre "Hrebicek" Balint-Nagy 2013-03-14 19:16:39 UTC

Undoing in the end the relation to the kernel permanently.

Comment 15 Miroslav Lichvar 2013-04-30 16:59:51 UTC

Do you still see the problem? If it happens only when minpoll/maxpoll is set to 0, it could be an instability in the clock discipline.

Is it reproducible when minpoll/maxpoll is set to at least 2? Is it reproducible when the servers specified in the config file have the noselect flag?

Please attach the chrony's tracking and measurements logs when that happens.

Comment 16 Endre "Hrebicek" Balint-Nagy 2013-05-03 03:35:57 UTC

with minpoll 0 maxpoll 2 peers it works, will go down to 0 1 then 0 0.

Comment 17 Endre "Hrebicek" Balint-Nagy 2013-05-04 06:48:14 UTC

I have to refine my answer. The maxpoll 2 works sometimes and even then short-living, but the minpoll 3 is the safe version.

Comment 18 Miroslav Lichvar 2013-05-06 10:06:13 UTC

Thanks for the data. Can you please try also adding noselect to the servers in the config and maxpoll 0? This will effectively disable the clock discipline (the tracking output should show 0.0.0.0) and should confirm the problem is there.

Comment 19 Endre "Hrebicek" Balint-Nagy 2013-05-07 09:59:41 UTC

Yes, comment#7 in  https://bugzilla.redhat.com/show_bug.cgi?id=904515#c7 relates to this bug.

Comment 20 Miroslav Lichvar 2013-06-21 11:38:44 UTC

*** Bug 904515 has been marked as a duplicate of this bug. ***

Comment 21 Fedora End Of Life 2013-12-21 09:03:26 UTC

This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 22 Miroslav Lichvar 2014-05-20 15:25:25 UTC

The offset control in the Linux driver was rewritten in the upstream git and it now allows stable clock control even with very short update intervals. This will be in chrony-1.30.

Note You need to log in before you can comment on or make changes to this bug.