903635 – peer mode issue

Bug 903635 - peer mode issue

Summary: peer mode issue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	chrony
Sub Component:
Version:	18
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Miroslav Lichvar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-24 13:07 UTC by Endre "Hrebicek" Balint-Nagy
Modified:	2014-09-24 01:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:	chrony-1.28-1.fc18
Clone Of:
Environment:
Last Closed:	2013-06-05 11:31:31 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
tcpdump of a failed handsake (6.65 KB, application/octet-stream) 2013-01-26 01:36 UTC, Endre "Hrebicek" Balint-Nagy	no flags	Details
View All

Description Endre "Hrebicek" Balint-Nagy 2013-01-24 13:07:51 UTC

Description of problem:
The NTP protocol leaves the option to select an NTP server having equal stratum to our, which would widen the number available NTP references.

Version-Release number of selected component (if applicable):
new feautor erequest, not tied yet to any milestone.

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Endre "Hrebicek" Balint-Nagy 2013-01-26 01:35:24 UTC

Please forget the original subject, it probably does not applies to chrony.
The real problem is, when a peer restarts - probably only when with -r option - the packets of the peer always fail the test 3, the packet having zero origin timestamp. I'll attach the packet dump.

Comment 2 Endre "Hrebicek" Balint-Nagy 2013-01-26 01:36:18 UTC

Created attachment 687788 [details]
tcpdump of a failed handsake

Comment 3 Endre "Hrebicek" Balint-Nagy 2013-01-26 03:32:45 UTC

the 192.168.1.5 host is a fedora18 (chrony1.27-git1ca844), the other is a debian with chrony-1.27-git1ca844 too (my build).
The condition triggering this is still unclear. Most interesting that 192.168.1.5 is synchronizing now from 192.168.1.3 (a raspberry pi, configured similarly, raspbian with chrony) while 192.168.1.4 is problematic.

Comment 4 Miroslav Lichvar 2013-02-01 14:15:55 UTC

In a quick test with two peers and -r option I didn't see it. Do you have some steps to reproduce it?

Comment 5 Endre "Hrebicek" Balint-Nagy 2013-02-01 15:04:34 UTC

My raspberryPi is fell out of testing ATM, the SDHC card is over its write budget.
The acer laptop and the Fujistsu-Siemens Esprimo are still available, and systematically never synchonize to each other.
As I see, they are lacking the stratumweight 0 directive, added. I will report on behavior later, the missing stratumweight 0 can be the cause.

Comment 6 Endre "Hrebicek" Balint-Nagy 2013-02-01 15:08:50 UTC

BTW the strartumweight 0 is the functional equivalent of cohort, as I understood the docs.

Comment 7 Endre "Hrebicek" Balint-Nagy 2013-02-02 05:06:24 UTC

As I am understanding the problem, it mostly arouses when one of the peers restarts - yes, I am restarting chronyd every then and now - , the restarted peer has no better idea than sending a zero reference timestamp, which makes the other end not to respond to it, and we are in a trap situation. BTW the reference implementation does not have this issue while talking to an other reference implementation, while (maybe sometimes) with chronyd has. (Currently my Pi runs the reference implementation while I compile again chrony. The old copy died with the SDHC card.)
This morning I am leaving Brno, come back on 7th.

Cheers till 7th!

Comment 8 Endre "Hrebicek" Balint-Nagy 2013-02-07 02:29:41 UTC

Ahoj Mirku! I am back to Brno.
When I am removing the /var/{state,lib}/chrony/192.168.1.*.dat files, the recognition of peers happens normally, otherwise we are sticking on the zero originate timestamp issue.
I am using the maxdelay option on every selectable server/peer directive and also use the dumponexit directive.

Ondras.

Comment 9 Endre "Hrebicek" Balint-Nagy 2013-02-07 02:39:29 UTC

I see a possible solution: whenever we see from a peer a zero origin timestamp - denoting a peer restart, we have to drop the last sent one if it exists, thus the normal peer setup can follow.

Comment 10 Endre "Hrebicek" Balint-Nagy 2013-02-07 02:44:01 UTC

Maybe adding an "and port is udp 123" clasuse to the previous sentence leaves a possibility for a parallel ntpdate -u -d run without disturbing the peer.

Comment 11 Endre "Hrebicek" Balint-Nagy 2013-02-07 13:50:22 UTC

Hmm, I changed the minpoll 0 maxpoll 0  to minpoll 3 maxpoll 3 and now I am getting reasonable RTTs. Perhaps the formula calculating the RTT (delta) is not suitable below the 3 poll value?!

Comment 12 Endre "Hrebicek" Balint-Nagy 2013-02-08 19:19:51 UTC

Later I changed the asssigment     inst->skew = WORST_CASE_FREQ_BOUND;
to     inst->skew = -WORST_CASE_FREQ_BOUND;
__LINE__ 486 __FILE__ sourcestats.c
The picture is promising - the underestimated RTTs mostly vanished.
See: http://thinkpink.usersys.redhat.com/cgi-bin/chrony-rrd.cgi
click on 10.11.160.238 link.
The turquoise line is the smallest RTT seen. It was reset 2 times where the pale red&blue lines are forking off. The blue line is the RTT, the green is the clock offset and the dark green is its moving average. The pale lines are the min/max of the offset since last reset.

I do not see issues with the -2000PPM initial drift instead of 2000PPM, but in the beginning it will overestimate the RTT, not underestimate. (I know, this is a hack, and solves the issue only partially.)

The peering code surely needs some rework as chronyd ATM is not near to the reference implementation w.r.t. peering - while it is desirable to do it better in chrony.

Endre/Ondra

Comment 14 Endre "Hrebicek" Balint-Nagy 2013-02-09 09:54:33 UTC

It is wise to limit the restarts to say one per hour, as it is trivial to spoof such packets and thus a DoS-ing the peer.

Comment 15 Miroslav Lichvar 2013-02-11 14:41:18 UTC

Endre,

can you please summarise your findings?

Is the problem that chrony refuses to sync to a peer when the peer restarts? Does it happen only when the peer uses the -r option? Does it need to use also the maxdelay option?

Comment 16 Endre "Hrebicek" Balint-Nagy 2013-02-11 20:55:02 UTC

Ahoj Mirku!
I am almost always using maxdelay, did not worked without it for a long time.
I am restarting every then and now chronyd, with -r switch.
I had to increase minpoll to 3, and to remove the .dat file for every configured peer before issuing the chronyd -r. The issue arises when one of the peers insists on the last origin timestamp.
As I see the .dat file removal before the new start solves (a bit of mistery, how, because the peer still has the now stale origin timestamp.) the issue - but that is a workaround only.
If you  have a few test scenarios to run, don not hesitate to list them!
(I have an i386 chronyd - the debian one, an x86_64 on F18 and an Rpi with debian+ntpd, which I am intended to replace with chrony soon. Additionally I have an ntpd peer in Hungary, also debian. With the .dat file removed before chronyd -r hack made all of them to work together seamlessly.)
Probably a lone peer usually survives if maxdelay is not in use, but I am a bit unsure, as I intentionally ceased the peering test for RHEL7. I can resume it of course, if it helps your investigation. (Anyway I have so perverse intention to torture F18 a bit in beaker in near future.)
Cheers Ondra

Comment 17 Miroslav Lichvar 2013-05-29 15:44:39 UTC

After inspecting the code I found out that when the peer has received packet with poll 0, the transmit timeout is set incorrectly, which results in one peer sending the packets and the other being silent. I guess this is the problem you are hitting and it's not related to the -r option.

A similar problem can occur when the remote poll interval is shorter than local minimum polling interval. The peer will be periodically rescheduling its transmit timeout, but never actually sending the packet.

I'll see if I can fix both issues.

Comment 18 Miroslav Lichvar 2013-06-05 11:31:31 UTC

This should be now fixed in upstream git and will be in chrony-1.28.

Comment 19 Fedora Update System 2013-06-21 16:31:14 UTC

chrony-1.28-0.1.pre1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/chrony-1.28-0.1.pre1.fc19

Comment 20 Fedora Update System 2013-07-02 00:33:29 UTC

chrony-1.28-0.1.pre1.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 21 Fedora Update System 2013-07-18 15:12:32 UTC

chrony-1.28-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/chrony-1.28-1.fc18

Comment 22 Fedora Update System 2013-08-06 00:22:30 UTC

chrony-1.28-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.