679537 – ntpd startup troubles cause long delays in a boot sequence which uses systemd

Bug 679537 - ntpd startup troubles cause long delays in a boot sequence which uses systemd

Summary: ntpd startup troubles cause long delays in a boot sequence which uses systemd

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	ntp
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Miroslav Lichvar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-02-22 19:44 UTC by Michal Jaegermann
Modified:	2013-01-15 08:41 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-10-23 17:33:08 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
results of 'systemctl dump' (38.90 KB, application/x-gzip) 2011-07-18 18:21 UTC, Michal Jaegermann	no flags	Details
View All

Description Michal Jaegermann 2011-02-22 19:44:35 UTC

Description of problem:

If there is any kind of a trouble with a starting of ntpd in boot sequence (misconfiguration on either end, a network not yet up, unreachable servers for whatever reasons) then a boot sequence stops for a long while until a timeout is reached and systemd will report "Unit ntpdate.service entered failed state". What is more restarting service ntpd later is often instantenous.

There was in the past bug 216351 which looked quite similar and later was resolved not to block boot.  It appears that changes in ntpd and/or a boot sequence reintroduced this bug.

Version-Release number of selected component (if applicable):
ntp-4.2.6p3-2.fc15

How reproducible:
always if there is any kind of a connection issue

Comment 1 Miroslav Lichvar 2011-02-22 20:08:55 UTC

Is this about ntpdate or ntpd service?

ntpd is started after ntpdate, so if there is a connection issue, ntpd will be delayed until the ntpdate service timeouts. The ntpdate failed state is correct in such case. The number of retries is configurable in /etc/sysconfig/ntpdate.

Comment 2 Michal Jaegermann 2011-02-22 21:39:36 UTC

(In reply to comment #1)
> Is this about ntpdate or ntpd service?

Looking at messages on a screen I see a long delay and after that a red "Failed" talking about ntpd. OTOH digging through logs for examples indeed shows that "Unit ntpdate.service entered failed state" is preceded by "systemd[1]: ntpdate.service: main process exited, code=exited, status=1" when there are troubles.

> ntpd is started after ntpdate, so if there is a connection issue, ntpd will be
> delayed until the ntpdate service timeouts.

Yes, but if there are any issues, which is not always under a local control, one gets a nasty delay on boot.

> The ntpdate failed state is correct in such case.

Nobody said otherwise.

> The number of retries is configurable in /etc/sysconfig/ntpdate.

In other words you suggest that one should really use RETRIES=0?  How often retrying here may help?

Comment 3 Miroslav Lichvar 2011-02-22 22:00:06 UTC

(In reply to comment #2)
> Yes, but if there are any issues, which is not always under a local control,
> one gets a nasty delay on boot.

But only ntpd is delayed, or not?

The main purpose of the ntpdate service is to delay starting services which need correct time on start. Only very few services should need it (and have After=ntpdate.service in their unit files).

> In other words you suggest that one should really use RETRIES=0?  How often
> retrying here may help?

I'd suggest you to disable it, do you really need it?

Comment 4 Michal Jaegermann 2011-02-22 22:57:03 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > Yes, but if there are any issues, which is not always under a local control,
> > one gets a nasty delay on boot.
> 
> But only ntpd is delayed, or not?

No.  Everything stops and waits for this timeout.  It would be hard to notice
otherwise.  After all retries booting continues.

> I'd suggest you to disable it, do you really need it?

Not really; in particular if ntpd is configured to "jump" once in a presence of significant clock discrepancies (I did not check).  OTOH this looks like a default and on a test system I am trying not to stray too far from defaults.

Here is how to reproduce that without messing up all the remaining networking.
Configure ntpd to use as the only server some local machine which will not provide time; either because is not running such server or, possibly better, with a stratum too high.  Try to reboot, no plymouth, and watch what will happen.

Comment 5 Miroslav Lichvar 2011-02-23 11:57:11 UTC

Hm, it seems to work here differently. The system is usable even when the ntpdate service didn't finish yet.

# systemctl list-jobs 
   1 graphical.target          start           waiting
   2 multi-user.target         start           waiting
  72 systemd-...nlevel.service start           waiting
  73 ntpd.service              start           waiting
  75 ntpdate.service           start           running
 129 systemd-...ead-done.timer start           waiting

Is this expected?

Comment 6 Michal Jaegermann 2011-02-23 19:53:23 UTC

(In reply to comment #5)
> Hm, it seems to work here differently. The system is usable even when the
> ntpdate service didn't finish yet.

Not sure what I can say.  This is a sample fragment from /var/log/messages recording what happens during a boot:
....
14:02:26 localhost abrtd: Init complete, entering main loop
14:02:32 localhost ntpdate[1352]: no servers can be used, exiting
 ... (other retries and later ntpd restarts) ...
14:02:53 localhost ntpd[1358]: 0.0.0.0 c016 06 restart
14:02:53 localhost ntpd[1358]: 0.0.0.0 c012 02 freq_set kernel -17.894 PPM
14:02:54 localhost /usr/sbin/gpm[1367]: *** info [daemon/startup.c(136)]:
 ... (the rest of boot continues and the next will be smartd) ...
 
> # systemctl list-jobs 
>    1 graphical.target          start           waiting
>    2 multi-user.target         start           waiting
>   72 systemd-...nlevel.service start           waiting
>   73 ntpd.service              start           waiting
>   75 ntpdate.service           start           running
>  129 systemd-...ead-done.timer start           waiting

In what moment you issued this command?  When I am trying there is nothing in an output.  OTOH the manpage says "List jobs that are in progress" and if I booted and have a shell prompt then they are all done.

At this moment I have configured everything in such way that ntpd does NOT have this delay.

Comment 7 Michal Jaegermann 2011-03-28 18:46:19 UTC

I see that this report was flagged with NEEDINFO.  I am not sure what kind of info I can provide above of what I wrote.  If ntpd/ntpdate are getting required connections then there is no issue.  If something went wrong then the whole boot process is blocked for a long time waiting for this "ntpdate.service entered failed state".  I thought that a systemd promise was to avoid such issues.

Again - I do not understand when you are getting this 'systemctl list-jobs' in comment #5. If booting is blocked then a shell prompt is not coming yet for a looong while. With a shell prompt present the wait is clearly over.

Comment 8 Miroslav Lichvar 2011-03-29 11:57:59 UTC

I don't see the blocking here. I can login before the ntpdate service has finished and monitor it with the list-jobs command.

I'm not sure if it's a bug in the ntpd/ntpdate unit file or systemd or something else.

Comment 9 Aissen 2011-05-13 19:41:40 UTC

I'm hit by the exact same bug. New F15 install which I upgraded just after:

See /var/log/messages excerpt:

May 13 21:11:00 lisa kernel: [   69.492164] All bugs added by David S. Miller <davem>
May 13 21:11:00 lisa sm-notify[1161]: Version 1.2.3 starting
May 13 21:11:04 lisa systemd[1]: Startup finished in 3s 889ms 789us (kernel) + 5s 62ms 313us (initrd) + 1min 5s 291ms 301us (userspace) = 1min 14s 243ms 
403us.
May 13 21:12:00 lisa ntpd_intres[889]: host name not found: 0.fedora.pool.ntp.org
May 13 21:12:00 lisa ntpd_intres[889]: host name not found: 1.fedora.pool.ntp.org
May 13 21:12:00 lisa ntpd_intres[889]: host name not found: 2.fedora.pool.ntp.org
May 13 21:12:00 lisa ntpd_intres[889]: host name not found: 3.fedora.pool.ntp.org
May 13 21:14:02 lisa ntpd_intres[889]: host name not found: 0.fedora.pool.ntp.org
May 13 21:14:02 lisa ntpd_intres[889]: host name not found: 1.fedora.pool.ntp.org
May 13 21:14:02 lisa ntpd_intres[889]: host name not found: 2.fedora.pool.ntp.org
May 13 21:14:02 lisa ntpd_intres[889]: host name not found: 3.fedora.pool.ntp.org
May 13 21:15:26 lisa dbus: [system] Activating service name='net.reactivated.Fprint' (using servicehelper)


Either it's a coincidence or ntp is eating 4 minutes of boot time.

Comment 10 Aissen 2011-05-13 21:55:05 UTC

I take back my claims backed only by log messages.
After I disabled ntpd.service, boot was still slow.
systemd-analyse blame showed the culprit: udev-settle.service is taking 70 seconds to start and is blocking every other service. I'm of course using LVM on this laptop (for root and home), but it's clearly a regression over F13.
This might be material for another bug report.

Comment 11 Michal Schmidt 2011-06-07 09:57:51 UTC

Please attach the output of 'systemctl dump' after boot.

Comment 12 Michal Jaegermann 2011-06-12 08:46:29 UTC

(In reply to comment #11)
> Please attach the output of 'systemctl dump' after boot.

That would be difficult for the moment.  Right now I am "on the road" and I will not see my test system for the next month or slightly longer.  I will try to keep that in mind for a time I will be back.

Comment 13 Michal Schmidt 2011-06-14 07:28:43 UTC

OK, take your time. Let's keep the NEEDINFO flag until then.

Comment 14 Michal Jaegermann 2011-07-18 18:21:57 UTC

Created attachment 513662 [details]
results of 'systemctl dump'

> Please attach the output of 'systemctl dump' after boot.

It is attached.  Apologies that it took so long.  There is some change in a comparison with a situation from February when this bug was reported.  With the current rawhide I made a local time server to reject requests and rebooted.  This time I got:

ntpdate[1227]: no server suitable for synchronization found 
ntpdate[1405]: no server suitable for synchronization found
ntpdate[1450]: no server suitable for synchronization found
systemd[1]: ntpdate.service: main process exited, code=exited, status=1
systemd[1]: Unit ntpdate.service entered failed state.
systemd[1]: Startup finished in 1s 691ms 44us (kernel) + 10s 484ms 918us (initrd) + 1min 10s 902ms 448us (userspace) = 1min 23s 78ms 410us.

That is still substantially longer than with a normal boot where time spent
in "userspace" part will be in an order of 30+ seconds but now I do not have a "smoking gun" pointing at a party responsible for a delay (with this exception that a time stamp on the first quoted message is 12:02:38 and 12:03:16 on
'Startup finished').

Something did change although I am not entirely sure what.

Comment 15 Jóhann B. Guðmundsson 2011-07-19 14:38:15 UTC

Could you replace the legacy sysv script with this unit file and replace the tickers with what ever you have configure and reboot and measure


# cat /lib/systemd/system/ntpdate.service

[Unit]
Description=Setting Local Clock By Polling NTP servers
After=syslog.target network.target NetworkManager-wait-online.service

[Service]
Type=oneshot
EnvironmentFile=-/etc/sysconfig/ntpdate
ExecStart=/usr/sbin/ntpdate -U ntp -s -b $OPTIONS 0.fedora.pool.ntp.org 1.fedora.pool.ntp.org 2.fedora.pool.ntp.org 3.fedora.pool.ntp.org
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

run

# systemctl daemon-reload 
# systemctl enable ntpdate.target 
# reboot 

Hum I guess you could run time on the command for gestimated time it should take as in ...

# time /usr/sbin/ntpdate -U ntp -s -b -p 2 0.fedora.pool.ntp.org 1.fedora.pool.ntp.org 2.fedora.pool.ntp.org 3.fedora.pool.ntp.org

Comment 16 Michal Jaegermann 2011-07-19 22:29:26 UTC

(In reply to comment #15)

> ... and replace the
> tickers with what ever you have configure and reboot and measure
.....

> 0.fedora.pool.ntp.org
> 1.fedora.pool.ntp.org 2.fedora.pool.ntp.org 3.fedora.pool.ntp.org

The above means that you absolutely, totally and entirely miss the point.

Again!  If so happens that a time server or servers which you are trying to use are not responding _then_ an ntp service adds a substantial chunk (like over 50% of time used by a user space) to a boot time.  It is not going into background, it does not schedule a retry for later, but blocks the whole boot sequence. Indeed for test purposes I "misconfigured" my installation to cause such delay but those servers can be unreachable, even if only sometimes, for assorted reasons.

If there are no troubles getting a response from time servers then there is no delay.  Hardly surprising but not what this report is about.

> Could you replace the legacy sysv script with this unit file ...

I can try that but clearly for this test I cannot use there "reachable" servers.

Comment 17 Michal Jaegermann 2011-07-19 23:33:14 UTC

(In reply to comment #15)

> # cat /lib/systemd/system/ntpdate.service

I do have such file which belongs to ntp-4.2.6p3-4.fc16.x86_64
package

> # systemctl daemon-reload
That is ok, but then next command
> # systemctl enable ntpdate.target 
responds with: "Couldn't find ntpdate.target.".
'systemctl enable ntpdate.service' does not have problems, though.

> # reboot

Does not change anything essential in the whole picture.  It is not the end of the world, though.

Comment 18 Jóhann B. Guðmundsson 2011-07-20 00:13:25 UTC

(In reply to comment #17)
> (In reply to comment #15)
> 
> > # cat /lib/systemd/system/ntpdate.service
> 
> I do have such file which belongs to ntp-4.2.6p3-4.fc16.x86_64
> package

It's not the same file as I posted. 

The unit file in that package call the legacy sysv init script which then calls the legacy sysv init script to rule the legacy sysv unit out I created a clean unit file 

> > # systemctl daemon-reload > That is ok, but then next command
> > # systemctl enable ntpdate.target 
> responds with: "Couldn't find ntpdate.target.".
> 'systemctl enable ntpdate.service' does not have problems, though.

The target ending is my fail

> > # reboot
> 
> Does not change anything essential in the whole picture.  It is not the end of
> the world, though.

Just out of curiosity is this an upgrade or an clean install with old config file ?

It would be good if you would attache your /etc/ntp.conf then atleast we should be able to duplicated this

Comment 19 Michal Jaegermann 2011-07-20 02:25:00 UTC

(In reply to comment #18)
> (In reply to comment #17)
> > 
> > I do have such file which belongs to ntp-4.2.6p3-4.fc16.x86_64
> > package
> 
> It's not the same file as I posted.

Indeed.  That is because your file is useless for reproducing the
problem as hopefully explained in comment #16.

> Just out of curiosity is this an upgrade or an clean install with old config
> file ?

This is a "continuous upgrade" of rawhide.

> It would be good if you would attache your /etc/ntp.conf then atleast we should
> be able to duplicated this

This is just /etc/ntp.conf as included in ntp package with that difference that
instead of 'server 0.fedora.pool.ntp.org iburst', and so on, it uses, while testing, only a "bad" server on my internal network.  Useless for you as this is a non-routable address.

While not running that test a server used is also on an internal network but in a "good" state (so there is no delay this report was about).

Comment 20 Tore Anderson 2011-08-06 11:35:49 UTC

I've also had problems with long boot times since the upgrade to F15. What appear to happen for me is that both the ntpdate service and a NFS mount service gets started pretty much at the same time as NetworkManager, and way before the network has actually settled and are ready for use.

I assume that this in turn causes the NTP/NFS tasks fail on their first attempt(s), reverting to a exponential timeout-and-retry loop. When the network gets ready, the services are probably waiting for some timeout before they try again, causing a long boot delay.

The solution was to enable the NetworkManager-wait-online service:

systemctl enable NetworkManager-wait-online.service

This appears to cause the boot process to wait for the network to settle before attempting the NTP/NFS tasks, which dramatically cuts down the overall boot time.

Tore

Comment 21 Michal Jaegermann 2011-08-06 20:34:04 UTC

(In reply to comment #20)
> 
> The solution was to enable the NetworkManager-wait-online service:

Will this work if NetworkManager is NOT in use?  Apparently fixing NM so it will work always is not in cards.  Also this will not help in the original situations
when time server(s) are not responding (for whatever reasons which may, or may not, include a non-working at this moment network).

NTP used to, at least from some moment, to go into a background and retry from there.  I am not sure why it blocks now.

Comment 22 Miroslav Lichvar 2012-10-23 15:42:30 UTC

Is anyone still seeing this?

Comment 23 Michal Jaegermann 2012-10-23 17:01:48 UTC

(In reply to comment #22)
> Is anyone still seeing this?

In the meantime I switched from ntp to chrony, just to avoid nasty surprises, and distributions as well, so I am not seeing that any longer.

Comment 24 Miroslav Lichvar 2012-10-23 17:33:08 UTC

Ok, let's close this bug then.

Please reopen if you see it again.

Comment 25 Schoschi Decker 2013-01-13 03:46:08 UTC

Yes, I have seen it on Fedora 17 ARM, and Fedora 17 x86_64.  Most recently on Fedora 17 ARM, running on a Raspberry Pi.  It happens consistently, after each reboot.  Have to execute # systemctl restart ntpd.service after every reboot.

# systemctl status ntpd.service
ntpd.service - Network Time Service
          Loaded: loaded (/usr/lib/systemd/system/ntpd.service; enabled)
          Active: inactive (dead)
          CGroup: name=systemd:/system/ntpd.service

# systemctl restart ntpd.service

# systemctl status ntpd.service
ntpd.service - Network Time Service
          Loaded: loaded (/usr/lib/systemd/system/ntpd.service; enabled)
          Active: active (running) since Sat, 12 Jan 2013 19:41:43 -0800; 2s ago
         Process: 4783 ExecStart=/usr/sbin/ntpd -u ntp:ntp $OPTIONS (code=exited, status=0/SUCCESS)
        Main PID: 4784 (ntpd)
          CGroup: name=systemd:/system/ntpd.service
                  â 4784 /usr/sbin/ntpd -u ntp:ntp -g

Comment 26 Miroslav Lichvar 2013-01-14 09:22:50 UTC

I think you (In reply to comment #25)
> Yes, I have seen it on Fedora 17 ARM, and Fedora 17 x86_64.  Most recently
> on Fedora 17 ARM, running on a Raspberry Pi.  It happens consistently, after
> each reboot.  Have to execute # systemctl restart ntpd.service after every
> reboot.

The ntpd service wasn't started at all, that looks like a different bug. Please see bug #821813, if it's not a conflict in systemd with the chronyd.service.

Comment 27 Schoschi Decker 2013-01-15 02:23:12 UTC

I don't have chronyd installed all, so I don't see the two issues as related in the cases I referenced.

On my Fedora 17 ARM system... No chronyd installed...

# yum erase chronyd
Loaded plugins: langpacks, presto, refresh-packagekit
No Match for argument: chronyd
No Packages marked for removal

More over, I explicitly set NTPD to start...

# systemctl enable ntpd.service

And finally wrote a small script to validate NTPD has started, no reboot.  But that is a kludge to an issue that I think should just not happen.

Moreover, chronyd is not part of the official Fedora 17 distribution?  It is not in the default repos, where NTPD is, why would anyone think to use chronyd?  When NTPD is so universal to the greater linux world?

Is there some official best practice documentation to use chrony?

Comment 28 Miroslav Lichvar 2013-01-15 08:41:29 UTC

(In reply to comment #27)
> I don't have chronyd installed all, so I don't see the two issues as related
> in the cases I referenced.

The package is called chrony, the service is chronyd.service.

Note You need to log in before you can comment on or make changes to this bug.