1124059 – chrony doesn't resume after a sleep.

Bug 1124059 - chrony doesn't resume after a sleep.

Summary: chrony doesn't resume after a sleep.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	chrony
Sub Component:
Version:	20
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Miroslav Lichvar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1126590 1128432 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-28 21:47 UTC by Wolfgang Rupprecht
Modified:	2014-11-01 22:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:	chrony-1.30-2.fc20
Clone Of:
Environment:
Last Closed:	2014-08-23 01:59:49 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
transcript of sleep / resume, chrony burst, online toggle (12.08 KB, text/plain) 2014-07-31 00:36 UTC, Wolfgang Rupprecht	no flags	Details
chronyd -d -d log (343.01 KB, text/plain) 2014-07-31 00:37 UTC, Wolfgang Rupprecht	no flags	Details
tcpdump port 123 log (20.00 KB, text/plain) 2014-07-31 00:38 UTC, Wolfgang Rupprecht	no flags	Details
transcript of test run of koji 1.30-1 git chrony (36.33 KB, text/x-log) 2014-08-05 22:53 UTC, Wolfgang Rupprecht	no flags	Details
chronyd -d -d for git test version (44.94 KB, text/plain) 2014-08-08 08:15 UTC, Wolfgang Rupprecht	no flags	Details
View All

Description Wolfgang Rupprecht 2014-07-28 21:47:57 UTC

Description of problem:
After a laptop sleep to ram, chrony stops tracking its upstream servers.

Version-Release number of selected component (if applicable):
chrony.x86_64                        1.30-1.fc20                        @updates
e
How reproducible:
always

Steps to Reproduce:
1.sleep a laptop
2.resume from sleep
3.chronyc sources
4. notice all the servers are marked with "10y" since the last update and null values for all the other items.

[wolfgang@ancho ~]$ chronyc sources 
210 Number of sources = 4
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^? time1.example.com              0   8     0   10y     +0ns[   +0ns] +/-    0ns
^? time2.example.com              0   8     0   10y     +0ns[   +0ns] +/-    0ns
^? time3.example.com              0   8     0   10y     +0ns[   +0ns] +/-    0ns
^? time4.example.com              0   8     0   10y     +0ns[   +0ns] +/-    0ns

Actual results:
chronyd is wedged

Expected results:
chronyd is not wedged

Additional info:
"systemd restart chrony" fixes things.

Comment 1 Miroslav Lichvar 2014-07-29 09:46:30 UTC

Since 1.30, chronyd resets all source statistics when a forward time jump is detected (e.g. suspend/resume), but it should make a new measurement at most in 2 poll intervals. With the NetworkManager dispatcher script used in the Fedora package it should switch the sources from offline to online mode and make a new measurements quickly after resume.

How long did you wait before getting that "chronyc sources" output after resuming from suspend? Are you using the default chrony configuration? It might help to see chronyd messages in syslog around that suspend.

There was a similar report upstream [1], but this seems to be something different, as Poll in your sources output is 8 and not the default minpoll of 6.

[1] http://comments.gmane.org/gmane.comp.time.chrony.user/995

Comment 2 Wolfgang Rupprecht 2014-07-29 15:05:21 UTC

Here is the stripped of comments version of the file.  Only the server names were edited out.  The poll interval of chronyc sources changes as a function of time.   It was just 7 before, then it is 8, now it is 9.  It may well have been 6 initially. I'll do an extended sleep after sending this to test.   

It just struck me that BZ1119054 - IPv6 addresses going away temporarily on my interfaces may be interacting here.   All the servers all have IPv6 addresses and chrony selected the IPv6 addresses. There is a 5-10 minute pause after resume before the addresses return to my interfaces.  In any case the addresses have long returned to my interfaces and chrony is still stuck when the "chronyc -n sources" was done below.

driftfile /var/lib/chrony/drift
rtcsyncc
makestep 1 3
deny
bindcmdaddress 127.0.0.1
bindcmdaddress ::1
keyfile /etc/chrony.keys
commandkey 1
generatecommandkey
noclientlog
logchange 0.5
logdir /var/log/chrony
#
# all servers resolve to ipv6 addresses (chrony -n sources) some, but
# not all have ipv4 addresses available, but not selected by chrony.
#
server a.example.com iburst
server b.example.com iburst
server c.example.com iburst
server d.example.com iburst
server e.example.com iburst
server f.example.com iburst
server g.example.com iburst
rtconutc
cmdallow 127.0.0.1
cmdallow ::1
cmddeny

[root@ancho wolfgang]# chronyc -n sources
210 Number of sources = 7
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^? A:A:A:A:A:A:A:A		 0   9     0   10y     +0ns[   +0ns] +/-    0ns
^? B:B:B:B:B:B:B:B               0   9     0   10y     +0ns[   +0ns] +/-    0ns
^? C:C:C:C:C:C:C:C               0   9     0   10y     +0ns[   +0ns] +/-    0ns
^? D:D:D:D:D:D:D:D               0   9     0   10y     +0ns[   +0ns] +/-    0ns
^? E:E:E:E:E:E:E:E               0   9     0   10y     +0ns[   +0ns] +/-    0ns
^? F:F:F:F:F:F:F:F               0   9     0   10y     +0ns[   +0ns] +/-    0ns
^? G:G:G:G:G:G:G:G               0   9     0   10y     +0ns[   +0ns] +/-    0ns
[root@ancho wolfgang]#

Comment 3 Miroslav Lichvar 2014-07-29 15:27:40 UTC

Does "chronyc -a burst 1/1" or "chronyc -a -m offline online" get it going? Can you please try it with "tcpdump -i any port 123" or similar and see if chronyd is sending NTP packets, but not getting any response?

Also, it might be useful to see the debugging output from chronyd started as "chronyd -d -d".

Comment 4 Wolfgang Rupprecht 2014-07-31 00:36:09 UTC

Created attachment 922768 [details]
transcript of sleep / resume, chrony burst, online toggle

Comment 5 Wolfgang Rupprecht 2014-07-31 00:37:20 UTC

Created attachment 922769 [details]
chronyd -d -d log

Comment 6 Wolfgang Rupprecht 2014-07-31 00:38:14 UTC

Created attachment 922770 [details]
tcpdump port 123 log

Comment 7 Wolfgang Rupprecht 2014-07-31 00:40:40 UTC

burst  - no effect
online toggle - got it working with n

tcpdump and chronyd -d -d logs uploaded along with transcript of the shell commands and results.

Comment 8 Miroslav Lichvar 2014-07-31 14:08:40 UTC

Thanks for the data. It seems the problem is in connecting the client sockets, it fails as the routing is not ready when the chronyc online command is issued and chronyd doesn't try to reconnect later.

As a workaround you can add "acquisitionport 123" to chrony.conf, so chronyd always uses its server socket to query other servers like it used to do before 1.30.

Comment 9 Wolfgang Rupprecht 2014-07-31 14:12:02 UTC

(In reply to Miroslav Lichvar from comment #8)
> As a workaround you can add "acquisitionport 123" to chrony.conf, so chronyd
> always uses its server socket to query other servers like it used to do
> before 1.30.

Thanks for the analysis and workaround! Will do.

Comment 10 Miroslav Lichvar 2014-08-05 08:36:10 UTC

*** Bug 1126590 has been marked as a duplicate of this bug. ***

Comment 11 Miroslav Lichvar 2014-08-05 11:22:48 UTC

This should be now fixed in upstream git.

http://git.tuxfamily.org/chrony/chrony.git/?p=chrony/chrony.git;a=commit;h=43cca04c339ff4e919b3b51a7eac993a3bc3d060

Comment 12 Miroslav Lichvar 2014-08-05 14:58:01 UTC

Here is a scratch build of the current upstream git, which will be probably released as 1.30.1 soon.

http://koji.fedoraproject.org/koji/taskinfo?taskID=7243933

Can you please install the package and see if it fixes the problem for you?

Comment 13 Wolfgang Rupprecht 2014-08-05 22:53:05 UTC

Created attachment 924289 [details]
transcript of test run of koji 1.30-1 git chrony

I'm sorry, this code never seemed to get unstuck.  Koji chrony from the above link was installed and the system rebooted.   This is a different laptop from the initial report and an i686 system instead of a x86_64.  The other laptop is a 40 minute drive away.   This machine is also an up to date fedora 20 system that does a nightly yum update.

Comment 14 Wolfgang Rupprecht 2014-08-05 23:10:02 UTC

And for completeness the journal syslog and yum info.

[root@pearl ~]# journalctl -b0 -u chronyd
-- Logs begin at Sun 2013-11-17 06:59:44 PST, end at Tue 2014-08-05 16:01:04 PDT. --
Aug 05 14:41:20 pearl.wsrcc.com chronyd[490]: chronyd version 1.30 starting
Aug 05 14:41:21 pearl.wsrcc.com chronyd[490]: Frequency -3.811 +/- 0.257 ppm read from /var/lib/chrony/drift
Aug 05 14:41:21 pearl.wsrcc.com systemd[1]: Started NTP client/server.
Aug 05 14:41:36 pearl.wsrcc.com chronyd[490]: Selected source 2601:9:1c80:aa0:a60:6eff:fe74:6fe2
Aug 05 14:48:55 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0::1 offline
Aug 05 14:48:55 pearl.wsrcc.com chronyd[490]: Source 2601:9:8580:f94::1 offline
Aug 05 14:48:55 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0:a60:6eff:fe74:6fe2 offline
Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Forward time jump detected!
Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Can't synchronise: no reachable sources
Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0::1 online
Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Source 2601:9:8580:f94::1 online
Aug 05 15:10:52 pearl.wsrcc.com chronyd[490]: Source 2601:9:1c80:aa0:a60:6eff:fe74:6fe2 online
[root@pearl ~]# grep chrony /var/log/yum.log
Aug 05 14:39:33 Updated: chrony-1.30-1.git.fc20.test.i686
Aug 05 14:40:06 Installed: chrony-debuginfo-1.30-1.git.fc20.test.i686

Comment 15 Miroslav Lichvar 2014-08-06 08:25:08 UTC

Hm, in my testing with IPv6 and interface going down and up it seems to reconnect without problems.

Can you please repeat this with chronyd -d -d and attach the output here so we can see if sendmsg() or connect() is returning any errors?

Comment 16 Wolfgang Rupprecht 2014-08-08 08:15:32 UTC

Created attachment 925117 [details]
chronyd -d -d for git test version

The ipv6 addresses came back between these two times.

[root@pearl ~]# date
Fri Aug  8 00:56:09 PDT 2014
[root@pearl ~]# ifconfig 
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 144  bytes 12404 (12.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 144  bytes 12404 (12.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

p1p1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:23:8b:1b:e8:41  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.73.206  netmask 255.255.255.0  broadcast 192.168.73.255
        inet6 fe80::223:4eff:fe1c:585c  prefixlen 64  scopeid 0x20<link>
        ether 00:23:4e:1c:58:5c  txqueuelen 1000  (Ethernet)
        RX packets 1180  bytes 150016 (146.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 853  bytes 112533 (109.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@pearl ~]# date
Fri Aug  8 00:58:30 PDT 2014
[root@pearl ~]# date
Fri Aug  8 01:00:36 PDT 2014
[root@pearl ~]# ifconfig 
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 144  bytes 12404 (12.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 144  bytes 12404 (12.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

p1p1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:23:8b:1b:e8:41  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.73.206  netmask 255.255.255.0  broadcast 192.168.73.255
        inet6 fd47:3cf0:52aa:8000:223:4eff:fe1c:585c  prefixlen 64  scopeid 0x0<global>
        inet6 2601:9:8580:f94:223:4eff:fe1c:585c  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::223:4eff:fe1c:585c  prefixlen 64  scopeid 0x20<link>
        ether 00:23:4e:1c:58:5c  txqueuelen 1000  (Ethernet)
        RX packets 1302  bytes 158312 (154.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 920  bytes 124039 (121.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Comment 17 Miroslav Lichvar 2014-08-08 09:37:50 UTC

From the log it looks like the reconnection succeeded and the packets are being sent. I'm not sure what's wrong. Can you please check tcpdump output and "netstat -n -p | grep chronyd" if the local addresses are correct?

Comment 18 Miroslav Lichvar 2014-08-11 07:12:38 UTC

*** Bug 1128432 has been marked as a duplicate of this bug. ***

Comment 19 Miroslav Lichvar 2014-08-15 10:02:54 UTC

The upstream code was changed to create new socket for each request, this should fix all problems mentioned in this report. (although I'm still curious what exactly was happening in the test from comment #16)

I'll prepare Fedora update shortly.

Comment 20 Fedora Update System 2014-08-15 10:34:17 UTC

chrony-1.30-2.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/chrony-1.30-2.fc20

Comment 21 Wolfgang Rupprecht 2014-08-15 12:10:45 UTC

Sorry, having a hard time rerunning tests.   The i686 machine used in the last few tests (including comment#16) had a bad yum update and is now killing user processes with a kprintf every few seconds.  The x86_64 machine is taking an hour or so to get its ipv6 address after a restart.

As for the new socket, that was going to be my suggestion too.   The tcpdump across a sleep/resume breaks at the resume with a gripe that the interface went away.  To get any data I had to put the tcpdump in a while loop.   It took ~20 loops before the tcpdump took.

Comment 22 Fedora Update System 2014-08-16 00:35:27 UTC

Package chrony-1.30-2.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing chrony-1.30-2.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-9492/chrony-1.30-2.fc20
then log in and leave karma (feedback).

Comment 23 Fedora Update System 2014-08-23 01:59:49 UTC

chrony-1.30-2.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 24 Fedora Update System 2014-09-11 15:13:42 UTC

chrony-1.31-1.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/chrony-1.31-1.el6

Comment 25 Fedora Update System 2014-09-11 15:16:43 UTC

chrony-1.31-1.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/chrony-1.31-1.el5

Comment 26 Fedora Update System 2014-11-01 22:33:11 UTC

chrony-1.31-1.el5 has been pushed to the Fedora EPEL 5 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 27 Fedora Update System 2014-11-01 22:34:20 UTC

chrony-1.31-1.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.