Description of problem: After a laptop sleep to ram, chrony stops tracking its upstream servers. Version-Release number of selected component (if applicable): chrony.x86_64 1.30-1.fc20 @updates e How reproducible: always Steps to Reproduce: 1.sleep a laptop 2.resume from sleep 3.chronyc sources 4. notice all the servers are marked with "10y" since the last update and null values for all the other items. [wolfgang@ancho ~]$ chronyc sources 210 Number of sources = 4 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^? time1.example.com 0 8 0 10y +0ns[ +0ns] +/- 0ns ^? time2.example.com 0 8 0 10y +0ns[ +0ns] +/- 0ns ^? time3.example.com 0 8 0 10y +0ns[ +0ns] +/- 0ns ^? time4.example.com 0 8 0 10y +0ns[ +0ns] +/- 0ns Actual results: chronyd is wedged Expected results: chronyd is not wedged Additional info: "systemd restart chrony" fixes things.
Since 1.30, chronyd resets all source statistics when a forward time jump is detected (e.g. suspend/resume), but it should make a new measurement at most in 2 poll intervals. With the NetworkManager dispatcher script used in the Fedora package it should switch the sources from offline to online mode and make a new measurements quickly after resume. How long did you wait before getting that "chronyc sources" output after resuming from suspend? Are you using the default chrony configuration? It might help to see chronyd messages in syslog around that suspend. There was a similar report upstream [1], but this seems to be something different, as Poll in your sources output is 8 and not the default minpoll of 6. [1] http://comments.gmane.org/gmane.comp.time.chrony.user/995
Here is the stripped of comments version of the file. Only the server names were edited out. The poll interval of chronyc sources changes as a function of time. It was just 7 before, then it is 8, now it is 9. It may well have been 6 initially. I'll do an extended sleep after sending this to test. It just struck me that BZ1119054 - IPv6 addresses going away temporarily on my interfaces may be interacting here. All the servers all have IPv6 addresses and chrony selected the IPv6 addresses. There is a 5-10 minute pause after resume before the addresses return to my interfaces. In any case the addresses have long returned to my interfaces and chrony is still stuck when the "chronyc -n sources" was done below. driftfile /var/lib/chrony/drift rtcsyncc makestep 1 3 deny bindcmdaddress 127.0.0.1 bindcmdaddress ::1 keyfile /etc/chrony.keys commandkey 1 generatecommandkey noclientlog logchange 0.5 logdir /var/log/chrony # # all servers resolve to ipv6 addresses (chrony -n sources) some, but # not all have ipv4 addresses available, but not selected by chrony. # server a.example.com iburst server b.example.com iburst server c.example.com iburst server d.example.com iburst server e.example.com iburst server f.example.com iburst server g.example.com iburst rtconutc cmdallow 127.0.0.1 cmdallow ::1 cmddeny [root@ancho wolfgang]# chronyc -n sources 210 Number of sources = 7 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^? A:A:A:A:A:A:A:A 0 9 0 10y +0ns[ +0ns] +/- 0ns ^? B:B:B:B:B:B:B:B 0 9 0 10y +0ns[ +0ns] +/- 0ns ^? C:C:C:C:C:C:C:C 0 9 0 10y +0ns[ +0ns] +/- 0ns ^? D:D:D:D:D:D:D:D 0 9 0 10y +0ns[ +0ns] +/- 0ns ^? E:E:E:E:E:E:E:E 0 9 0 10y +0ns[ +0ns] +/- 0ns ^? F:F:F:F:F:F:F:F 0 9 0 10y +0ns[ +0ns] +/- 0ns ^? G:G:G:G:G:G:G:G 0 9 0 10y +0ns[ +0ns] +/- 0ns [root@ancho wolfgang]#
Does "chronyc -a burst 1/1" or "chronyc -a -m offline online" get it going? Can you please try it with "tcpdump -i any port 123" or similar and see if chronyd is sending NTP packets, but not getting any response? Also, it might be useful to see the debugging output from chronyd started as "chronyd -d -d".
Created attachment 922768 [details] transcript of sleep / resume, chrony burst, online toggle
Created attachment 922769 [details] chronyd -d -d log
Created attachment 922770 [details] tcpdump port 123 log
burst - no effect online toggle - got it working with n tcpdump and chronyd -d -d logs uploaded along with transcript of the shell commands and results.
Thanks for the data. It seems the problem is in connecting the client sockets, it fails as the routing is not ready when the chronyc online command is issued and chronyd doesn't try to reconnect later. As a workaround you can add "acquisitionport 123" to chrony.conf, so chronyd always uses its server socket to query other servers like it used to do before 1.30.
(In reply to Miroslav Lichvar from comment #8) > As a workaround you can add "acquisitionport 123" to chrony.conf, so chronyd > always uses its server socket to query other servers like it used to do > before 1.30. Thanks for the analysis and workaround! Will do.
*** Bug 1126590 has been marked as a duplicate of this bug. ***
This should be now fixed in upstream git. http://git.tuxfamily.org/chrony/chrony.git/?p=chrony/chrony.git;a=commit;h=43cca04c339ff4e919b3b51a7eac993a3bc3d060
Here is a scratch build of the current upstream git, which will be probably released as 1.30.1 soon. http://koji.fedoraproject.org/koji/taskinfo?taskID=7243933 Can you please install the package and see if it fixes the problem for you?
Created attachment 924289 [details] transcript of test run of koji 1.30-1 git chrony I'm sorry, this code never seemed to get unstuck. Koji chrony from the above link was installed and the system rebooted. This is a different laptop from the initial report and an i686 system instead of a x86_64. The other laptop is a 40 minute drive away. This machine is also an up to date fedora 20 system that does a nightly yum update.
And for completeness the journal syslog and yum info. [root@pearl ~]# journalctl -b0 -u chronyd -- Logs begin at Sun 2013-11-17 06:59:44 PST, end at Tue 2014-08-05 16:01:04 PDT. -- Aug 05 14:41:20 pearl.wsrcc.com chronyd[490]: chronyd version 1.30 starting Aug 05 14:41:21 pearl.wsrcc.com chronyd[490]: Frequency -3.811 +/- 0.257 ppm read from /var/lib/chrony/drift Aug 05 14:41:21 pearl.wsrcc.com systemd[1]: Started NTP client/server. Aug 05 14:41:36 pearl.wsrcc.com chronyd[490]: Selected source 2601:9:1c80:aa0:a60:6eff:fe74:6fe2 Aug 05 14:48:55 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0::1 offline Aug 05 14:48:55 pearl.wsrcc.com chronyd[490]: Source 2601:9:8580:f94::1 offline Aug 05 14:48:55 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0:a60:6eff:fe74:6fe2 offline Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Forward time jump detected! Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Can't synchronise: no reachable sources Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0::1 online Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Source 2601:9:8580:f94::1 online Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0:a60:6eff:fe74:6fe2 online [root@pearl ~]# grep chrony /var/log/yum.log Aug 05 14:39:33 Updated: chrony-1.30-1.git.fc20.test.i686 Aug 05 14:40:06 Installed: chrony-debuginfo-1.30-1.git.fc20.test.i686
Hm, in my testing with IPv6 and interface going down and up it seems to reconnect without problems. Can you please repeat this with chronyd -d -d and attach the output here so we can see if sendmsg() or connect() is returning any errors?
Created attachment 925117 [details] chronyd -d -d for git test version The ipv6 addresses came back between these two times. [root@pearl ~]# date Fri Aug 8 00:56:09 PDT 2014 [root@pearl ~]# ifconfig lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 144 bytes 12404 (12.1 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 144 bytes 12404 (12.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 p1p1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 00:23:8b:1b:e8:41 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.73.206 netmask 255.255.255.0 broadcast 192.168.73.255 inet6 fe80::223:4eff:fe1c:585c prefixlen 64 scopeid 0x20<link> ether 00:23:4e:1c:58:5c txqueuelen 1000 (Ethernet) RX packets 1180 bytes 150016 (146.5 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 853 bytes 112533 (109.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@pearl ~]# date Fri Aug 8 00:58:30 PDT 2014 [root@pearl ~]# date Fri Aug 8 01:00:36 PDT 2014 [root@pearl ~]# ifconfig lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 144 bytes 12404 (12.1 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 144 bytes 12404 (12.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 p1p1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 00:23:8b:1b:e8:41 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.73.206 netmask 255.255.255.0 broadcast 192.168.73.255 inet6 fd47:3cf0:52aa:8000:223:4eff:fe1c:585c prefixlen 64 scopeid 0x0<global> inet6 2601:9:8580:f94:223:4eff:fe1c:585c prefixlen 64 scopeid 0x0<global> inet6 fe80::223:4eff:fe1c:585c prefixlen 64 scopeid 0x20<link> ether 00:23:4e:1c:58:5c txqueuelen 1000 (Ethernet) RX packets 1302 bytes 158312 (154.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 920 bytes 124039 (121.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
From the log it looks like the reconnection succeeded and the packets are being sent. I'm not sure what's wrong. Can you please check tcpdump output and "netstat -n -p | grep chronyd" if the local addresses are correct?
*** Bug 1128432 has been marked as a duplicate of this bug. ***
The upstream code was changed to create new socket for each request, this should fix all problems mentioned in this report. (although I'm still curious what exactly was happening in the test from comment #16) I'll prepare Fedora update shortly.
chrony-1.30-2.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/chrony-1.30-2.fc20
Sorry, having a hard time rerunning tests. The i686 machine used in the last few tests (including comment#16) had a bad yum update and is now killing user processes with a kprintf every few seconds. The x86_64 machine is taking an hour or so to get its ipv6 address after a restart. As for the new socket, that was going to be my suggestion too. The tcpdump across a sleep/resume breaks at the resume with a gripe that the interface went away. To get any data I had to put the tcpdump in a while loop. It took ~20 loops before the tcpdump took.
Package chrony-1.30-2.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing chrony-1.30-2.fc20' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-9492/chrony-1.30-2.fc20 then log in and leave karma (feedback).
chrony-1.30-2.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.
chrony-1.31-1.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/chrony-1.31-1.el6
chrony-1.31-1.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/chrony-1.31-1.el5
chrony-1.31-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
chrony-1.31-1.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.