Bug 1361382
Summary: | ntp-wait hangs after boot for a long time, unless ntpd is restarted | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ian Wienand <iwienand> |
Component: | ntp | Assignee: | Miroslav Lichvar <mlichvar> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 24 | CC: | mlichvar |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-01 10:02:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ian Wienand
2016-07-29 02:16:32 UTC
The problem is in the leap_alarm flag, i.e. ntpd is reporting it's not synchronized. The reason for that is the initial clock step, after which ntpd enters a state in which it measures the frequency of the clock and this takes about 900 seconds (it can be configured with the tos stepout command). This prevents ntpd from constantly stepping when the frequency has changed so much that the loop is not able to lock. You can prevent the initial step by enabling the ntpdate service, so the offset is close to zero when ntpd starts. Or you can switch to chrony, which doesn't suffer from these problems. Forgive me for reopening ... I just can't quite correlate your explanation to why after a restart of the ntpd service, ntp-wait (i.e. really just thin wrapper around ntpq -pcrv) returns instantly? The difference to me is that the first time ntpd starts up, the network is not yet ready (host not found stuff) and then it detects the interfaces and starts, while after the restart the network is up right away. Does the first invocation get stuck in this loop because the network isn't there, but restarting, it never needs to go into it? Or ... maybe something like the first invocation of ntpd has brought the clock closer to reality, and so the second time it starts up it can just jump to the new time without stepping? Some context; some CI jobs recently replaced "ntpdate" calls based on the premise that ntpdate was deprecated [1]. This has lead to very long timeouts for some centos jobs as they make ntp-wait calls. I could just put a "service ntpd restart" in the script before the ntp-wait call, but that seems wrong... [1] https://review.openstack.org/#/c/299677 This is not related to network connection. This is about the initial offset and the state of the ntpd's loop. When the initial offset is larger than 0.128s, ntpd will step the clock and then it will wait for at least 900 seconds (in default configuration) before it reports it's in the synchronized state. If the initial offset is smaller than 0.128s, it will not step the clock and will go straight to the synchronized state. One way to make sure the initial offset is small is to run ntpdate before ntpd. In Fedora/RHEL/CentOS you can enable the ntpdate service. You shouldn't need to restart ntpd, that's generally a bad idea. Does this help? Ahh, I think I see what happens -- on the first start, the sync is more than 0.128s so it steps the clock and puts it into the wait, which holds up ntp-wait When I kill ntpd & restart it, the clock is now actually basically in sync so the 900s synchronize doesn't kick in? Yes. By restarting ntpd you make it forget it has stepped the clock. In this case it helps, but in some other cases (e.g. wrong value in the driftfile) it would not. Thanks, that makes sense. I guess the confusion comes from running the ntpdate service, which everything says is deprecated but after more research I find doesn't actually call ntpdate. Probably our CI testing is just very vulnerable to this -- since we boot fresh vm's constantly and never reuse anything, we seem likely to be out more than 0.128s each time. Again i guess the answer is something like "short-lived CI vm's are more like dial-up boxes and chronyd is a better choice than ntpd which is for long-lived servers". Yes, ntp upstream has deprecated ntpdate a long time ago, but it's still included in their latest releases. As a replacement they suggest to use sntp or "ntpd -q". On Fedora you can use the sntp service. As a long-term solution it's probably best to switch to chrony. It's the default NTP client in Fedora/RHEL and it can synchronize the clock much faster than ntpd. In scripts you can use "chronyc waitsync" if you need to wait until the clock is synchronized. |