Red Hat Bugzilla – Bug 863533
problems with very short polling intervals
Last modified: 2014-09-23 21:29:57 EDT
Description of problem:
Especially while communicating with peers (almost indifferent if chrony or ntpd)
the estimate of RTT is awfully skewed. (upto 512 seconds instead of 250-450 usec measured by smokeping with NTP sized packets.)
Version-Release number of selected component (if applicable):
chrony-1.27-0.4.pre1.el6.x86_64, but I dont think it is different with other distro versions. The issue is out of scope for RHEL.
Configure peering with one or more peers on the same LAN.
(My test network is an RHEL6.3 destop Fujitsu-Siemens Esprimo P,
an i386 ACER aspire notebook running debian/wheezy, a Raspberry Pi (armv6) running raspbian - also debian/wheezy, chrony-1.27-pre1 on all boxes)
It is almost allways badly skewed on my LAN.
Steps to Reproduce:
1. Configure NTP peers on the same LAN.
2. Observe the simptom.
For now it is a low key issue but can make a bad opinion on chrony. With this issue open I can not propose chrony for public NTP service.
Can you please attach the chrony's measurements log and a "tcpdump port 123" output when this happened?
It looks like the wrong estimates come in when we are not synchronizing to the server/peer because of its higher startum, so I have to ignore them when processing the log files. Maybe once we have the wrong RTT estimate, we will ignore the server/peer in the future if a maxdelay option is in effect. Will investigate this particular case. Otherwise the estimate is irrelevant, but misleading, but in fact it is not a real bug.
Reviewed the issue again, the cause is that in some cases, suggested by the NTP rfc, we throw away the packet, and logging synthetic data (worst estimate) even when the packet contains useful information. Please never log synthetic data in primary data columns! (Up to you to open a column for those figures of course. I think the NTP standard does not applies to the log files, has only responsibility over on-wire data and application semantic but the logged data is not covered - thus should not misguide me. The RTT value is still fine when the offset is unusable.)
Created attachment 706917 [details]
when chronyd is cheated by the kernel
Sometimes chronyd gets misleaded by the kernel. I checked with strace, the timestamps are taken from SO_TIMESTAMP icmp messages.
Observe the 8ms offset from 7am to 11am, then the gap caused by a reboot.
(Blue line is the RTT.) A 8ms offset without a sustained asymmetric. I checked on the smokeping graph on the other side.
Created attachment 706918 [details]
The control graph (smokeping from 192.168.1.3 to 192.168.1.5
The control smokeping graph not showing irregularities from 7am-11am.
According to the evidence in the last 2 png files readdressing to the kernel.
The offending host ran 3.8.1-fc18.x86_64 kernel and chronyd was misleaded by 8+ msec on a 100Mbit Ethernet link, where normal rtt is between 400 and 500 microsecs. Making a better than 1ms clock synchronization - which is my first goal - is spoiled by this issue.
Will make some reproducer soon.
(Stepping the clock by a dozen millisecs will do, I hope.)
I observed similar issues with 3.0.0-el7 upto the current fedora kernels.
(The pre 3.0.0 kernels do the tsc calibration poorly for my uses, so no observations.)
[bne@bandi chrony]$ uname -r; ifconfig em1
em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.5 netmask 255.255.255.0 broadcast 192.168.1.255
Taking the QA role on myself as I am deeply enough involved already and have wasted a lot of time and sweat on this. (I think it is unusual a bit, but my job title is kernel QE associate, so I am on my own field when messing in.)
The first PNG file is the network latency seen by chonyd running on 192.168.1.5 (Fujitsu-Siemens ESPRIMO-P, Fedora18), the second one is the smokepings view from 192.168.1.3, NTP sized (90 byte on wire) packets running debian squezee on an acer aspire notebook. The link was 100Mbit/sec Ethernet.
bne@acer:~$ cat /etc/debian_version ; uname -r
Chronyd was using SO_TIMESTAMP socket option to request timing information from the kernel.
Further details and evidence and a reproducer follows soon.
read Comment 4 as ... ceased by a reboot.
you're going to get a faster resolution of this by reporting it directly upstream to the networking maintainers at firstname.lastname@example.org
I see no evidence pointing towards a kernel issue, so moving back to chrony.
Use of minpoll 0 and maxdelay on peers seem triggering the issue.
Removing both remove the issue coming up.
minpoll 3 is quite enough, I will check with maxdelay + minpoll 3.
Undoing in the end the relation to the kernel permanently.
Do you still see the problem? If it happens only when minpoll/maxpoll is set to 0, it could be an instability in the clock discipline.
Is it reproducible when minpoll/maxpoll is set to at least 2? Is it reproducible when the servers specified in the config file have the noselect flag?
Please attach the chrony's tracking and measurements logs when that happens.
with minpoll 0 maxpoll 2 peers it works, will go down to 0 1 then 0 0.
I have to refine my answer. The maxpoll 2 works sometimes and even then short-living, but the minpoll 3 is the safe version.
Thanks for the data. Can you please try also adding noselect to the servers in the config and maxpoll 0? This will effectively disable the clock discipline (the tracking output should show 0.0.0.0) and should confirm the problem is there.
Yes, comment#7 in https://bugzilla.redhat.com/show_bug.cgi?id=904515#c7 relates to this bug.
*** Bug 904515 has been marked as a duplicate of this bug. ***
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '18'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 18's end of life.
Thank you for reporting this issue and we are sorry that we may not be
able to fix it before Fedora 18 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior to Fedora 18's end of life.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The offset control in the Linux driver was rewritten in the upstream git and it now allows stable clock control even with very short update intervals. This will be in chrony-1.30.