Bug 1361382

Summary:	ntp-wait hangs after boot for a long time, unless ntpd is restarted
Product:	[Fedora] Fedora	Reporter:	Ian Wienand <iwienand>
Component:	ntp	Assignee:	Miroslav Lichvar <mlichvar>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	24	CC:	mlichvar
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-01 10:02:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ian Wienand 2016-07-29 02:16:32 UTC

I have a script which used a ntp-wait call to make sure the time was in sync.  It took well over 10 minutes for it to work.

When the host boots, ntpd puts into the logs

---
Jul 29 01:58:53 iwienand-f24-test systemd[1]: Starting Network Time Service...
Jul 29 01:58:53 iwienand-f24-test ntpd[391]: ntpd 4.2.6p5 Thu Jun  2 15:26:58 UTC 2016 (1)
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: proto: precision = 0.168 usec
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: ntp_io: estimated max descriptors: 1024, initial socket boundary: 16
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Listen and drop on 1 v6wildcard :: UDP 123
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Listen normally on 2 lo 127.0.0.1 UDP 123
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Listen normally on 3 lo ::1 UDP 123
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Listening on routing socket on fd #20 for interface updates
Jul 29 01:58:53 iwienand-f24-test systemd[1]: Started Network Time Service.
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Deferring DNS for 0.centos.pool.ntp.org 1
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Deferring DNS for 1.centos.pool.ntp.org 1
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: Deferring DNS for 2.centos.pool.ntp.org 1
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: 0.0.0.0 c016 06 restart
Jul 29 01:58:53 iwienand-f24-test ntpd[404]: signal_no_reset: signal 17 had flags 4000000
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: 0.0.0.0 c012 02 freq_set kernel 0.000 PPM
Jul 29 01:58:53 iwienand-f24-test ntpd[399]: 0.0.0.0 c011 01 freq_not_set
Jul 29 01:58:55 iwienand-f24-test ntpd_intres[404]: host name not found: 0.centos.pool.ntp.org
Jul 29 01:58:55 iwienand-f24-test ntpd_intres[404]: host name not found: 1.centos.pool.ntp.org
Jul 29 01:58:55 iwienand-f24-test ntpd_intres[404]: host name not found: 2.centos.pool.ntp.org
Jul 29 01:58:57 iwienand-f24-test ntpd[399]: Listen normally on 4 ens3 172.99.106.189 UDP 123
Jul 29 01:58:57 iwienand-f24-test ntpd[399]: Listen normally on 5 ens3 fe80::f816:3eff:fe41:cf53 UDP 123
Jul 29 01:58:57 iwienand-f24-test ntpd[399]: new interface(s) found: waking up resolver
Jul 29 01:58:59 iwienand-f24-test ntpd_intres[404]: DNS 0.centos.pool.ntp.org -> 96.244.96.19
Jul 29 01:58:59 iwienand-f24-test ntpd_intres[404]: DNS 1.centos.pool.ntp.org -> 162.213.2.253
Jul 29 01:58:59 iwienand-f24-test ntpd_intres[404]: DNS 2.centos.pool.ntp.org -> 202.118.1.81
Jul 29 01:59:00 iwienand-f24-test ntpd[399]: 0.0.0.0 c61c 0c clock_step +0.610964 s
Jul 29 01:59:00 iwienand-f24-test ntpd[399]: 0.0.0.0 c614 04 freq_mode
Jul 29 01:59:01 iwienand-f24-test ntpd[399]: 0.0.0.0 c618 08 no_sys_peer
---

I instrumented ntp-wait with its existing print calls, and you get it looping around doing

---
bash-4.3# /usr/bin/time ntp-wait
associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer,status <c618>, leap <leap_alarm>, sync <sync_ntp>
associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer,status <c618>, leap <leap_alarm>, sync <sync_ntp>
associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer,status <c618>, leap <leap_alarm>, sync <sync_ntp>
associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer,status <c618>, leap <leap_alarm>, sync <sync_ntp>
...
---

As far as I can tell, we have peers that are connected

---
bash-4.3# ntpq -pcrv
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*eqo3xkmpo4hgof7 216.218.192.202  2 u   63   64    7   43.756   -0.875   0.107
+cheezum.mattnor 216.218.192.202  2 u   62   64    7    4.187    0.096   0.044
+209.208.79.69   130.207.244.240  2 u   64   64    7   31.411    1.296   0.104
associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer,
version="ntpd 4.2.6p5 Thu Jun  2 15:26:58 UTC 2016 (1)",
processor="x86_64", system="Linux/4.6.4-301.fc24.x86_64", leap=11,
stratum=3, precision=-22, rootdelay=44.229, rootdisp=1958.823,
refid=45.33.43.25,
reftime=db453411.22e5bcb3  Fri, Jul 29 2016  1:46:57.136,
clock=db453450.254b8127  Fri, Jul 29 2016  1:48:00.145, peer=43299, tc=6,
mintc=3, offset=0.000, frequency=0.000, sys_jitter=1.315,
clk_jitter=0.033, clk_wander=0.000
---

Clearly it's something about the "no_sys_peer" in the status return, but it's not exactly clear what that means.

If you restart ntpd, it immediately becomes good again

---
Jul 29 02:02:19 iwienand-f24-test systemd[1]: Stopping Network Time Service...
Jul 29 02:02:19 iwienand-f24-test systemd[1]: Stopped Network Time Service.
Jul 29 02:02:19 iwienand-f24-test systemd[1]: Starting Network Time Service...
Jul 29 02:02:19 iwienand-f24-test systemd[1]: Started Network Time Service.
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: proto: precision = 0.158 usec
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: ntp_io: estimated max descriptors: 1024, initial socket boundary: 16
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: Listen and drop on 1 v6wildcard :: UDP 123
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: Listen normally on 2 lo 127.0.0.1 UDP 123
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: Listen normally on 3 ens3 172.99.106.189 UDP 123
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: Listen normally on 4 lo ::1 UDP 123
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: Listen normally on 5 ens3 fe80::f816:3eff:fe41:cf53 UDP 123
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: Listening on routing socket on fd #22 for interface updates
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: 0.0.0.0 c016 06 restart
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: 0.0.0.0 c012 02 freq_set kernel 0.000 PPM
Jul 29 02:02:19 iwienand-f24-test ntpd[829]: 0.0.0.0 c011 01 freq_not_set
Jul 29 02:02:20 iwienand-f24-test ntpd[829]: 0.0.0.0 c614 04 freq_mode
---

at this point, ntp-wait returns instantly

---
bash-4.3# /usr/bin/time ntp-wait -v
Waiting for ntpd to synchronize...  associd=0 status=0614 leap_none, sync_ntp, 1 event, freq_mode,status <0614>, leap <leap_none>, sync <sync_ntp>
OK!
0.00user 0.01system 0:00.02elapsed 90%CPU (0avgtext+0avgdata 5416maxresident)k
128inputs+0outputs (0major+893minor)pagefaults 0swaps
---

(it is looking for "leap_none" in the output)

So I guess it has something to do with the network bring-up after the daemon start and the way it is connecting.  I'm not sure what times out after about 10-15 minutes to make it start working, but something does.

Comment 1 Miroslav Lichvar 2016-07-29 09:05:00 UTC

The problem is in the leap_alarm flag, i.e. ntpd is reporting it's not synchronized. The reason for that is the initial clock step, after which ntpd enters a state in which it measures the frequency of the clock and this takes about 900 seconds (it can be configured with the tos stepout command). This prevents ntpd from constantly stepping when the frequency has changed so much that the loop is not able to lock.

You can prevent the initial step by enabling the ntpdate service, so the offset is close to zero when ntpd starts. Or you can switch to chrony, which doesn't suffer from these problems.

Comment 2 Ian Wienand 2016-07-31 23:28:55 UTC

Forgive me for reopening ... I just can't quite correlate your explanation to why after a restart of the ntpd service, ntp-wait (i.e. really just thin wrapper around ntpq -pcrv) returns instantly?

The difference to me is that the first time ntpd starts up, the network is not yet ready (host not found stuff) and then it detects the interfaces and starts, while after the restart the network is up right away.  Does the first invocation get stuck in this loop because the network isn't there, but restarting, it never needs to go into it?

Or ... maybe something like the first invocation of ntpd has brought the clock closer to reality, and so the second time it starts up it can just jump to the new time without stepping?

Some context; some CI jobs recently replaced "ntpdate" calls based on the premise that ntpdate was deprecated [1].  This has lead to very long timeouts for some centos jobs as they make ntp-wait calls.  I could just put a "service ntpd restart" in the script before the ntp-wait call, but that seems wrong...

[1] https://review.openstack.org/#/c/299677

Comment 3 Miroslav Lichvar 2016-08-01 07:03:22 UTC

This is not related to network connection. This is about the initial offset and the state of the ntpd's loop. When the initial offset is larger than 0.128s, ntpd will step the clock and then it will wait for at least 900 seconds (in default configuration) before it reports it's in the synchronized state.

If the initial offset is smaller than 0.128s, it will not step the clock and will go straight to the synchronized state. One way to make sure the initial offset is small is to run ntpdate before ntpd. In Fedora/RHEL/CentOS you can enable the ntpdate service. You shouldn't need to restart ntpd, that's generally a bad idea.

Does this help?

Comment 4 Ian Wienand 2016-08-01 09:14:22 UTC

Ahh, I think I see what happens -- on the first start, the sync is more than 0.128s so it steps the clock and puts it into the wait, which holds up ntp-wait

When I kill ntpd & restart it, the clock is now actually basically in sync so the 900s synchronize doesn't kick in?

Comment 5 Miroslav Lichvar 2016-08-01 09:20:38 UTC

Yes. By restarting ntpd you make it forget it has stepped the clock. In this case it helps, but in some other cases (e.g. wrong value in the driftfile) it would not.

Comment 6 Ian Wienand 2016-08-01 10:02:05 UTC

Thanks, that makes sense.

I guess the confusion comes from running the ntpdate service, which everything says is deprecated but after more research I find doesn't actually call ntpdate.

Probably our CI testing is just very vulnerable to this -- since we boot fresh vm's constantly and never reuse anything, we seem likely to be out more than 0.128s each time.  Again i guess the answer is something like "short-lived CI vm's are more like dial-up boxes and chronyd is a better choice than ntpd which is for long-lived servers".

Comment 7 Miroslav Lichvar 2016-08-01 10:33:31 UTC

Yes, ntp upstream has deprecated ntpdate a long time ago, but it's still included in their latest releases. As a replacement they suggest to use sntp or "ntpd -q". On Fedora you can use the sntp service.

As a long-term solution it's probably best to switch to chrony. It's the default NTP client in Fedora/RHEL and it can synchronize the clock much faster than ntpd. In scripts you can use "chronyc waitsync" if you need to wait until the clock is synchronized.