1485712 – NetworkManager-wait-online.service does not wait for network to be online so nfs mounts fail

Bug 1485712 - NetworkManager-wait-online.service does not wait for network to be online so nfs mounts fail

Summary: NetworkManager-wait-online.service does not wait for network to be online so ...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	NetworkManager
Sub Component:
Version:	26
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Lubomir Rintel
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-27 12:43 UTC by Barry Scott
Modified:	2018-05-29 11:31 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-05-29 11:31:32 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Barry Scott 2017-08-27 12:43:13 UTC

Description of problem:

NetworkManager-wait-online.service does not wait for network to be online.

This is because it has the following line:

     ExecStart=/usr/bin/nm-online -s -q --timeout=30 

The -s says wait for NetworkManager to be running. It does not
wait for any network interface to be usable. Which means that
any service, like nfs mounts, will fail. And in my setup are
failing.

Version-Release number of selected component (if applicable):

NetworkManager-1.8.2-1.fc26.x86_64

How reproducible:

Add the following unit and see that its output shows errors:

[Unit]
Description=Check DNS working

Wants=network-online.target
After=network-online.target

[Service]
type=oneshot
ExecStartPre=/usr/sbin/ip addr
ExecStartPre=-/usr/bin/cat /etc/resolv.conf
ExecStart=-/usr/bin/host fender


Actual results:

$ systemctl status check-dns-working.service
● check-dns-working.service - Check DNS working
   Loaded: loaded (/etc/systemd/system/check-dns-working.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Sun 2017-08-27 12:55:17 BST; 36min ago
  Process: 1237 ExecStart=/usr/bin/host fender (code=exited, status=1/FAILURE)
  Process: 1236 ExecStartPre=/usr/bin/cat /etc/resolv.conf (code=exited, status=1/FAILURE)
  Process: 1231 ExecStartPre=/usr/sbin/ip addr (code=exited, status=0/SUCCESS)
 Main PID: 1237 (code=exited, status=1/FAILURE)

Aug 27 12:55:04 varric.chelsea.private ip[1231]:     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Aug 27 12:55:04 varric.chelsea.private ip[1231]:     inet 127.0.0.1/8 scope host lo
Aug 27 12:55:04 varric.chelsea.private ip[1231]:        valid_lft forever preferred_lft forever
Aug 27 12:55:04 varric.chelsea.private ip[1231]:     inet6 ::1/128 scope host
Aug 27 12:55:04 varric.chelsea.private ip[1231]:        valid_lft forever preferred_lft forever
Aug 27 12:55:04 varric.chelsea.private systemd[1]: Started Check DNS working.
Aug 27 12:55:04 varric.chelsea.private ip[1231]: 2: enp3s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default 
Aug 27 12:55:04 varric.chelsea.private ip[1231]:     link/ether 14:dd:a9:dc:52:da brd ff:ff:ff:ff:ff:ff
Aug 27 12:55:04 varric.chelsea.private cat[1236]: /usr/bin/cat: /etc/resolv.conf: No such file or directory
Aug 27 12:55:17 varric.chelsea.private host[1237]: ;; connection timed out; no servers could be reached


Expected results:

ip addr and host commands work.
and nfs mount can then succeed.

Additional info:

Comment 1 Thomas Haller 2017-08-28 08:38:10 UTC

> The -s says wait for NetworkManager to be running. It does not
> wait for any network interface to be usable. 

this is not true. `man nm-online`:

       -s | --wait-for-startup
           Wait for NetworkManager startup to complete, rather than waiting
           for network connectivity specifically. Startup is considered
           complete once NetworkManager has activated (or attempted to
           activate) every auto-activate connection which is available given
           the current network state. (This is generally only useful at boot
           time; after startup has completed, nm-online -s will just return
           immediately, regardless of the current network state.)


See https://bugzilla.redhat.com/show_bug.cgi?id=1483343#c3

Comment 2 Timo Ballin 2018-03-07 11:15:01 UTC

We also have a serious problems with this behaviour.

As you quoted:

"(or attempted to activate)"

which means as described in the refered bug (https://bugzilla.redhat.com/show_bug.cgi?id=1483343#c3) this is after 5s.

We have some new workstations with a 5GBASE-T Network adapter connect to a 1GBASE-T Switch. Obviosly the negotiation takes some time. Since the system boots from a M.2 NVME everything else is very fast.

I have several NFS4 mounts in the fstab like:

# admin utility mount
adminutil:/adminutil /root/adminutil nfs4 defaults,auto,_netdev 0 0
# home dirs
sr-nethomes:/nethomes /rakete/home/ldap nfs4 defaults,auto,_netdev 0 0
# tools mount
sr-tools:/tools /rakete/tools nfs4 defaults,_netdev,auto,lookupcache=positive 0 0

What I tried to express with "auto" is: I need these mounts. Really. I would even have expected that something like  NetworkManager-wait-online.service is enabled automatically since the whole remote-fs.pre target is absolutely useless without.

Even with the NetworkManager-wait-online.service activated manually I have the problem that the Workstation is useless one out of - lets say - 4 boot attempts. But quite fast I have to admit. Useless - but fast.

I clearly see in the journal the default timeout of 5 seconds. And then the nfs mounts fail and things go down the drain:

Mar 06 14:49:59 almach.pma.lan systemd[1]: Starting Network Manager Wait Online...
:
Mar 06 14:49:59 almach.pma.lan NetworkManager[1230]: <info>  [1520344199.7260] manager: (eno1): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Mar 06 14:49:59 almach.pma.lan NetworkManager[1230]: <info>  [1520344199.7269] device (eno1): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
Mar 06 14:49:59 almach.pma.lan kernel: IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
Mar 06 14:49:59 almach.pma.lan kernel: IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
:
Mar 06 14:50:04 almach.pma.lan NetworkManager[1230]: <info>  [1520344204.6456] manager: startup complete
Mar 06 14:50:04 almach.pma.lan systemd[1]: Started Network Manager Wait Online.
Mar 06 14:50:04 almach.pma.lan systemd[1]: Starting LSB: Bring up/down networking...
Mar 06 14:50:04 almach.pma.lan network[1473]: Bringing up loopback interface:  [  OK  ]
Mar 06 14:50:04 almach.pma.lan NetworkManager[1230]: <info>  [1520344204.9976] audit: op="connection-activate" uuid="51e24d69-bace-47ba-807f-f5d3f314bd25" name="eno1" result="fail" reason="No suitable device found for this connection."
Mar 06 14:50:04 almach.pma.lan network[1473]: Bringing up interface eno1:  Error: Connection activation failed: No suitable device found for this connection.
Mar 06 14:50:05 almach.pma.lan network[1473]: [FAILED]
:
Mar 06 14:50:05 almach.pma.lan mount[1646]: mount.nfs4: Failed to resolve server sr-nethomes: Name or service not known
Mar 06 14:50:05 almach.pma.lan mount[1646]: mount.nfs4: Operation already in progress
:
Mar 06 14:50:05 almach.pma.lan NetworkManager[1230]: <info>  [1520344205.7534] device (eno1): link connected
Mar 06 14:50:05 almach.pma.lan kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
Mar 06 14:50:05 almach.pma.lan kernel: warning: `NetworkManager' uses legacy ethtool link settings API, link modes are only partially reported
Mar 06 14:50:05 almach.pma.lan NetworkManager[1230]: <info>  [1520344205.7538] device (eno1): state change: unavailable -> disconnected (reason 'carrier-changed') [20 30 40]
Mar 06 14:50:05 almach.pma.lan NetworkManager[1230]: <info>  [1520344205.7546] policy: auto-activating connection 'eno1'


If I deploy my own copy of the Unit file with the remove '-s' switch everything works fine. If you say NetworkManager-wait-online.service is working as "intended" I am a little desperate how to force systemd to wait until I have a working network connection with provided tools. Of course I can poke around systemd units files...

But this unit file is called NetworkManager-wait-online.service and not NetworkManager-wait-max5s-and-then-maybe-online.service.

There are lots of reason a system needs to wait until we have a working network connection. If this takes 6 hours I will think about changing the hardware, configuration what ever. But systemd should wait. Forever until proceeding with network stuff. Especially since this service is by default off.

Comment 3 Thomas Haller 2018-03-07 23:15:04 UTC

The interface has no carrier for more then five seconds. NetworkManager doesn't assume that something still will happen, and declares that startup is complete. When carrier comes later, it doesn't matter, because NM-w-o is already complete.

Since 1.10 you can configure the wait-time for carrier per-device, see https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=b595a80977193c7dd2a79ab5bd3caaa28bb88252

You can replace NetworkManager-wait-online.service with any service of your choosing to block network-online.target. For example a shell script that polls `nmcli general status`. Or even use "nm-online" without -s option, if that works for you. NetworkManager-wait-online.service is a very simple hammer. It cannot be perfect, nor suitable for everbody.

But the -s option is precisely there for NM-w-o. If it doesn't work, it should be fixed (for example by making the carrier-wait-timeout configurable, which was done in newer versions). But saying that the -s option is wrong for NM-w-o is not correct.

Whether NM-wait-online is enabled by default depends on the systemd presets. Most systems don't need this, that is why it's disabled by default. It's intended to be there for you to enable as a simple solution for a particular problem. But the default configuration cannot be optimal out of the box for every user.

I would close this as NEXTRELEASE, because we are not upgrading Fedora 27 or older to 1.10, and because we probably won't backport the configuration option. Opinions welcome.

Comment 4 Timo Ballin 2018-03-08 14:03:30 UTC

I perfectly understand how it works and how to fix it by my self. 

What I do not understand is the purpose of the NM-wait-online service is if it does not wait if NM is online. It dosent even fail if NM is not online! So the status of the NM is random after NM-wait-online. 

Actualy I am not intrested how NM-wait-online checks if NM is online. But after a successful run of NM-wait-online NM has to be online. 

As documented here:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

"If you use NetworkManager you can do this by enabling NetworkManager-wait-online.service:

systemctl enable NetworkManager-wait-online.service"
:
:
"This will ensure that all configured network devices are up and have an IP address assigned before boot continues. This service will time out after 90s. Enabling this service might considerably delay your boot even if the timeout is not reached. Both services are disabled by default."

In the current state this does NOT work.

Comment 5 Thomas Haller 2018-03-08 14:33:13 UTC

> What I do not understand is the purpose of the NM-wait-online service is if it > does not wait if NM is online. It dosent even fail if NM is not online! So the > status of the NM is random after NM-wait-online. 

The purpose of NM-wait-online is to delay network-online.target (and indirectly other units).

NM-w-o completing, means that startup is complete. A bit like `udevadm settle`.
Note that `udevadm settle` doesn't guarantee that all devices are discovered. Instead, it guarantees that udev is finished processing all currently found devices.
NM-w-o means, NetworkManager did all initial activations to a point where no further actions are expected.

Of course, this involves guessing, because NM can never know whether a second later something would happen that requires additional actions. In your case, NM determines that probably the cable is unplugged and declares startup as complete. In reality, the device takes so long to initialize. It's the problem of the 5 seconds timeout. A timeout cannot be perfect for the wide range of hardware. It's either too long (waiting needlessly long) or too short -- like in your case.


The documentation you quote is not accurate (it's not NM documentation, fwiw). It's badly worded to claim that when NM-w-o completes, all interfaces will have an IP address (they might have failed to activate). If you boot the machine with cable unplugged, you obviously won't be online no matter how long you wait. But that won't delay boot any longer then it takes NM to determine that probably the cable is really unplugged and nothing is gonna happen. If you boot with cable unplugged, you don't want to wait 30 seconds. You wait 5 seconds until NM is convinced the cable is unplugged.


You ask for a different meaning of NM-w-o. You are free to implement any kind of service that suits your expectation. NM-w-o isn't doing what you ask, but it does what makes sense in a lot of cases. In fact, you have the issue with the 5 seconds timeout waiting for carrier. If that timeout would be longer (it's configurable in 1.10+), then NM-w-o would work just fine for you. That doesn't mean, something is fundamentally wrong with how NM-w-o works.

Comment 6 Timo Ballin 2018-03-08 16:48:32 UTC

Perhaps you are right and this is the correct way to do it.

When you say "working as intended" what can I say...

But I am quite sure that this is very confusing for a lot of admins and users. I do not see the harm in increasing the default time-out values to save values because it will not effect anybody in a negative way. And yes if I manually enable NM-w-o this means I want NM to wait so I could climb under the desk an plug in the cable.
My english may be not perfect but obviously my understanding of "wait online" differs from yours. I am not talking about NetworkManager, I am only talking about the meaning of "wait online". I think you will need to explain your point of view to all who depend on a working network after boot time - as soon as they have trouble with it. In my personal opinion this is another point making systemd and NM even more complicated.

When people read NM-wait-online.service they think they understands what it does, after reading your argumentation I am quite sure they do not. And they will only notice there misunderstanding when thinks break.

So as a workaround to all people who have bought a 5GBase-T Interface and connected it to a 1GBase-T Switch port or just need a reliable "online status"

cp /usr/lib/systemd/system/NetworkManager-wait-online.service /etc/systemd/system/NetworkManager-wait-online.service

as stated in the first post removing the "-s" in NetworkManager-wait-online.service works fine.

reload with
systemctl daemon-reload

/etc/systemd/system/NetworkManager-wait-online.service can be deployed on every workstation without any boot time impacts. Except you removed the lan cable.

Comment 7 Thomas Haller 2018-03-09 10:37:08 UTC

what exactly is supposed to happen on this bug?

do you request that the timeout of 5 seconds is increased?

On newer versions it already increased to 6 seconds
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=156344b8beec88b68f335fe13c5db91d62fcb3fc
and additionally it is made configurable (per device).


I think there is nothing left to do, except, that this is not backported to Fedora 26. Is that what you request?

Comment 8 Barry Scott 2018-03-10 14:11:44 UTC

Best is to make the timeout a config option.
Second best is to increase to say 30s.

Comment 9 Fedora End Of Life 2018-05-03 08:06:57 UTC

This message is a reminder that Fedora 26 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 26. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '26'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 26 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 10 Fedora End Of Life 2018-05-29 11:31:32 UTC

Fedora 26 changed to end-of-life (EOL) status on 2018-05-29. Fedora 26
is no longer maintained, which means that it will not receive any
further security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.