Description of problem: The NTP protocol leaves the option to select an NTP server having equal stratum to our, which would widen the number available NTP references. Version-Release number of selected component (if applicable): new feautor erequest, not tied yet to any milestone. How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Please forget the original subject, it probably does not applies to chrony. The real problem is, when a peer restarts - probably only when with -r option - the packets of the peer always fail the test 3, the packet having zero origin timestamp. I'll attach the packet dump.
Created attachment 687788 [details] tcpdump of a failed handsake
the 192.168.1.5 host is a fedora18 (chrony1.27-git1ca844), the other is a debian with chrony-1.27-git1ca844 too (my build). The condition triggering this is still unclear. Most interesting that 192.168.1.5 is synchronizing now from 192.168.1.3 (a raspberry pi, configured similarly, raspbian with chrony) while 192.168.1.4 is problematic.
In a quick test with two peers and -r option I didn't see it. Do you have some steps to reproduce it?
My raspberryPi is fell out of testing ATM, the SDHC card is over its write budget. The acer laptop and the Fujistsu-Siemens Esprimo are still available, and systematically never synchonize to each other. As I see, they are lacking the stratumweight 0 directive, added. I will report on behavior later, the missing stratumweight 0 can be the cause.
BTW the strartumweight 0 is the functional equivalent of cohort, as I understood the docs.
As I am understanding the problem, it mostly arouses when one of the peers restarts - yes, I am restarting chronyd every then and now - , the restarted peer has no better idea than sending a zero reference timestamp, which makes the other end not to respond to it, and we are in a trap situation. BTW the reference implementation does not have this issue while talking to an other reference implementation, while (maybe sometimes) with chronyd has. (Currently my Pi runs the reference implementation while I compile again chrony. The old copy died with the SDHC card.) This morning I am leaving Brno, come back on 7th. Cheers till 7th!
Ahoj Mirku! I am back to Brno. When I am removing the /var/{state,lib}/chrony/192.168.1.*.dat files, the recognition of peers happens normally, otherwise we are sticking on the zero originate timestamp issue. I am using the maxdelay option on every selectable server/peer directive and also use the dumponexit directive. Ondras.
I see a possible solution: whenever we see from a peer a zero origin timestamp - denoting a peer restart, we have to drop the last sent one if it exists, thus the normal peer setup can follow.
Maybe adding an "and port is udp 123" clasuse to the previous sentence leaves a possibility for a parallel ntpdate -u -d run without disturbing the peer.
Hmm, I changed the minpoll 0 maxpoll 0 to minpoll 3 maxpoll 3 and now I am getting reasonable RTTs. Perhaps the formula calculating the RTT (delta) is not suitable below the 3 poll value?!
Later I changed the asssigment inst->skew = WORST_CASE_FREQ_BOUND; to inst->skew = -WORST_CASE_FREQ_BOUND; __LINE__ 486 __FILE__ sourcestats.c The picture is promising - the underestimated RTTs mostly vanished. See: http://thinkpink.usersys.redhat.com/cgi-bin/chrony-rrd.cgi click on 10.11.160.238 link. The turquoise line is the smallest RTT seen. It was reset 2 times where the pale red&blue lines are forking off. The blue line is the RTT, the green is the clock offset and the dark green is its moving average. The pale lines are the min/max of the offset since last reset. I do not see issues with the -2000PPM initial drift instead of 2000PPM, but in the beginning it will overestimate the RTT, not underestimate. (I know, this is a hack, and solves the issue only partially.) The peering code surely needs some rework as chronyd ATM is not near to the reference implementation w.r.t. peering - while it is desirable to do it better in chrony. Endre/Ondra
It is wise to limit the restarts to say one per hour, as it is trivial to spoof such packets and thus a DoS-ing the peer.
Endre, can you please summarise your findings? Is the problem that chrony refuses to sync to a peer when the peer restarts? Does it happen only when the peer uses the -r option? Does it need to use also the maxdelay option?
Ahoj Mirku! I am almost always using maxdelay, did not worked without it for a long time. I am restarting every then and now chronyd, with -r switch. I had to increase minpoll to 3, and to remove the .dat file for every configured peer before issuing the chronyd -r. The issue arises when one of the peers insists on the last origin timestamp. As I see the .dat file removal before the new start solves (a bit of mistery, how, because the peer still has the now stale origin timestamp.) the issue - but that is a workaround only. If you have a few test scenarios to run, don not hesitate to list them! (I have an i386 chronyd - the debian one, an x86_64 on F18 and an Rpi with debian+ntpd, which I am intended to replace with chrony soon. Additionally I have an ntpd peer in Hungary, also debian. With the .dat file removed before chronyd -r hack made all of them to work together seamlessly.) Probably a lone peer usually survives if maxdelay is not in use, but I am a bit unsure, as I intentionally ceased the peering test for RHEL7. I can resume it of course, if it helps your investigation. (Anyway I have so perverse intention to torture F18 a bit in beaker in near future.) Cheers Ondra
After inspecting the code I found out that when the peer has received packet with poll 0, the transmit timeout is set incorrectly, which results in one peer sending the packets and the other being silent. I guess this is the problem you are hitting and it's not related to the -r option. A similar problem can occur when the remote poll interval is shorter than local minimum polling interval. The peer will be periodically rescheduling its transmit timeout, but never actually sending the packet. I'll see if I can fix both issues.
This should be now fixed in upstream git and will be in chrony-1.28.
chrony-1.28-0.1.pre1.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/chrony-1.28-0.1.pre1.fc19
chrony-1.28-0.1.pre1.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.
chrony-1.28-1.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/chrony-1.28-1.fc18
chrony-1.28-1.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report.