Bug 1277257 - NetworkManager-wait-online fails on boot when 2 NIC are active
Summary: NetworkManager-wait-online fails on boot when 2 NIC are active
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: bind
Version: 23
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Tomáš Hozza
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1316687 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-02 20:11 UTC by Frank Ansari
Modified: 2021-02-28 09:17 UTC (History)
11 users (show)

Fixed In Version: bind-9.10.3-13.P4.fc24 bind-9.10.3-13.P4.fc23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-16 16:19:08 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
p4p1 (384 bytes, text/plain)
2015-11-03 18:33 UTC, Frank Ansari
no flags Details
enp4s0 (336 bytes, text/plain)
2015-11-03 18:34 UTC, Frank Ansari
no flags Details
journalctl -b -u NetworkManager (13.29 KB, text/x-vhdl)
2015-11-22 15:42 UTC, Frank Ansari
no flags Details
journalctl -b -u NetworkManager-wait-online (609 bytes, text/x-vhdl)
2015-11-22 15:43 UTC, Frank Ansari
no flags Details
journalctl -b -u NetworkManager-dispatcher (18.07 KB, text/x-vhdl)
2015-12-17 20:00 UTC, Frank Ansari
no flags Details

Description Frank Ansari 2015-11-02 20:11:54 UTC
Description of problem:
When I have 2 network interfaces in my PC (with ONBOOT=yes) NetworkManager-wait-online fails at boot.

Version-Release number of selected component (if applicable):
Name        : NetworkManager
Arch        : x86_64
Epoch       : 1
Version     : 1.0.6
Release     : 7.fc23

How reproducible:
Put 2 NICs into the PC. Then boot it.

Steps to Reproduce:
1. Put 2 NIC into the PC. Boot. => NetworkManager-wait-online fails
2. Disable one one them (ONBOOT=no) => NetworkManager-wait-online is ok
3.

Actual results:
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2015-11-02 21:05:06 CET; 1min 4s ago
  Process: 1676 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=1/FAILURE)
 Main PID: 1676 (code=exited, status=1/FAILURE)

Nov 02 21:04:36 bat.localdomain systemd[1]: Starting Network Manager Wait Online...
Nov 02 21:05:06 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Nov 02 21:05:06 bat.localdomain systemd[1]: Failed to start Network Manager Wait Online.
Nov 02 21:05:06 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Unit entered failed state.
Nov 02 21:05:06 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.


Expected results:
Service should start properly.

Additional info:

Comment 1 Frank Ansari 2015-11-02 20:13:50 UTC
To be more clear about this:
2. Disable one one them (ONBOOT=no). Boot. => NetworkManager-wait-online is ok

Comment 2 Dan Williams 2015-11-02 22:18:42 UTC
Could you paste the ifcfg files for each interface into a reply?

Also, after boot when it fails, can you paste the result of "nmcli dev"?

Comment 3 Frank Ansari 2015-11-03 18:33:25 UTC
After the boot is completed it looks like this:

[fansari@bat ~]$ nmcli dev
DEVICE      TYPE      STATE      CONNECTION 
virbr0      bridge    connected  virbr0     
enp4s0      ethernet  connected  enp4s0     
p4p1        ethernet  connected  p4p1       
virbr0-nic  tap       connected  virbr0-nic 
lo          loopback  unmanaged  --

Comment 4 Frank Ansari 2015-11-03 18:33:57 UTC
Created attachment 1089164 [details]
p4p1

Comment 5 Frank Ansari 2015-11-03 18:34:25 UTC
Created attachment 1089165 [details]
enp4s0

Comment 6 Frank Ansari 2015-11-21 14:12:05 UTC
I there any progress here? I still have this issue.

Comment 7 Jirka Klimes 2015-11-21 17:13:37 UTC
So, comment #3 shows that both interfaces were connected successfully.

Would you post NetworkManager logs when you have the problem (both files ONBOOT=yes).
Which of the files do you disable ONBOOT in?

Comment 8 Frank Ansari 2015-11-22 15:41:49 UTC
Yes - the problem is not that the connections are not coming up. The problem is that the NetworkManager-wait is always hanging and this costs time at every boot.

Regarding the disabling: if I disable either of these interfaces the NetworkManager-wait service works normally and does not hang.

After boot it looks like this:


fansari@bat ~]$ systemctl status NetworkManager-wait-online
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sun 2015-11-22 16:31:21 CET; 8min ago
  Process: 1598 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=1/FAILURE)
 Main PID: 1598 (code=exited, status=1/FAILURE)

Nov 22 16:30:51 bat.localdomain systemd[1]: Starting Network Manager Wait Online...
Nov 22 16:31:21 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Nov 22 16:31:21 bat.localdomain systemd[1]: Failed to start Network Manager Wait Online.
Nov 22 16:31:21 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Unit entered failed state.
Nov 22 16:31:21 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.

Comment 9 Frank Ansari 2015-11-22 15:42:45 UTC
Created attachment 1097537 [details]
journalctl -b -u NetworkManager

Comment 10 Frank Ansari 2015-11-22 15:43:25 UTC
Created attachment 1097538 [details]
journalctl -b -u NetworkManager-wait-online

Comment 11 Frank Ansari 2015-12-15 20:47:06 UTC
Is there a fix for this?

Simply disabling this service did not help because it seems that it is called from somewhere else. I guessed this might be NetworkManager.dispatcher so I disabled this also but this did not help either.

Currently the only solution (which is not a solution of course) I found is to "shortcut" the service by placing a copy of NetworkManager-wait-online.service to /etc/systemd/system and to set

ExecStart=/usr/bin/true

Comment 12 Dan Williams 2015-12-15 21:15:46 UTC
It appears that your interfaces are taking 31 seconds to come up, but NM-wait-online's timeout is 30s by default.  That's why it's failing.

The interfaces are taking 31s to come up because of:

Nov 22 16:30:56 bat.localdomain NetworkManager[1441]: <info>  (enp4s0): device state change: ip-config -> ip-check (reason 'none') [70 80 0]
Nov 22 16:31:22 bat.localdomain NetworkManager[1441]: <info>  (enp4s0): device state change: ip-check -> secondaries (reason 'none') [80 90 0]

which is quite a long delay during the ip-check phase.  Some things that happen during this stage are running dispatcher scripts, sending neighbor notifications (ARPs) and pinging the gateway.  

Could you modify /lib/systemd/system/NetworkManager-dispatcher.service so that the ExecStart= line looks like this:

ExecStart=/usr/libexec/nm-dispatcher --debug

then run 'systemctl daemon-reload', then restart and reproduce the problem.  Then run 'journalctl -b -u NetworkManager-dispatcher' and attach the output here and we'll see which script is taking too long.  You'll see output like:

Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Dispatching action 'dhcp4-change' for wlp4s0
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/00-netreport'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/00-netreport' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/10-ifcfg-rh-routes.sh'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/10-ifcfg-rh-routes.sh' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/11-dhclient'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/11-dhclient' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/20-chrony'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/20-chrony' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Dispatch 'dhcp4-change' on wlp4s0 complete

And I'll bet one of your scripts is taking too long.  If that's true, we need to file a bug against the package that provides that script.

Comment 13 Frank Ansari 2015-12-17 20:00:12 UTC
Created attachment 1106830 [details]
journalctl -b -u NetworkManager-dispatcher

Comment 14 Frank Ansari 2015-12-17 20:15:40 UTC
From what I see from the output of this logs it is more clear to me where the problem is located:

It is in /etc/NetworkManager/dispatcher.d/13-named. When I disable the first line like this

if [ "$2" = 'up' -o "$2" = 'down' ]; then
#	/usr/bin/systemctl reload named.service > /dev/null 2>&1 || true
	/usr/bin/systemctl reload named-chroot.service > /dev/null 2>&1 || true
	/usr/bin/systemctl reload named-sdb.service > /dev/null 2>&1 || true
	/usr/bin/systemctl reload named-sdb-chroot.service > /dev/null 2>&1 || true
fi


then there is no startup delay anymore.

But this again is only a workaround not a final solution. If named is meant to be reloaded at this point then I conclude that it is not expected to be a problem.

In my scenario I use named. The other services mentioned here are not used by me.

When I check the output of the NetworkManager-dispatcher logs and compare them with the named logs I find something interesting:

Dec 17 20:20:21 bat.localdomain nm-dispatcher[1909]: ** Message: Script '/etc/NetworkManager/dispatcher.d/13-named' complete

Dec 17 20:20:20 bat.localdomain systemd[1]: Starting Berkeley Internet Name Domain (DNS)...

It seems that the 13-named dispatcher waits until named is started.

But even this is true: what has this to do with the number of NICs?

Comment 15 Frank Ansari 2015-12-17 21:05:47 UTC
I did some more tests. The problem is not the number of NICs.

I can only reproduce this behaviour when the second interface is using IPv6 autoconf and also have named. There is another machine (which is up all the time) where raded is running and pushes the autoconf to my machines.

When I change the config to static IPv6 there is no delay with the 13-named dispatcher.

But autoconf is a useful thing for IPv6 and I don't want to remove it.

So I can reproduce this behaviour in this scenario:

- one NIC with IPv4
- seond NIC with IPv6 and autoconf via radvd on second machine
- named locally running

Comment 16 Dan Williams 2015-12-18 23:05:14 UTC
Dec 17 20:19:55 bat.localdomain nm-dispatcher[1909]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/13-named'
Dec 17 20:20:21 bat.localdomain nm-dispatcher[1909]: ** Message: Script '/etc/NetworkManager/dispatcher.d/13-named' complete

The named script is taking 26 seconds to complete, and that's running every time an interface comes up.  This is a problem with the named dispatcher script, and that's where the problem needs to be fixed.  Alternatively, you could remove named if you don't need its functionality.  If you do, then the script should get fixed :(

Or you can also increase the NetworkManager-wait-online timeout if you like by modifying the systemd until file.

Comment 17 Tomáš Hozza 2016-05-12 10:39:06 UTC
*** Bug 1316687 has been marked as a duplicate of this bug. ***

Comment 18 Tomáš Hozza 2016-05-12 10:57:33 UTC
Thanks all.

I'm going to remove the dispatcher script completely, as it is a relic from bind 9.9.x times. BIND 9.10.x and newer now listens on NETLINK for network configuration changes and can react to them.

Comment 19 Fedora Update System 2016-05-12 12:24:32 UTC
bind-9.10.3-13.P4.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-79cad60d9c

Comment 20 Fedora Update System 2016-05-12 12:24:43 UTC
bind-9.10.3-13.P4.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-05d4c87988

Comment 21 Fedora Update System 2016-05-15 06:57:43 UTC
bind-9.10.3-13.P4.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-79cad60d9c

Comment 22 Fedora Update System 2016-05-16 15:24:34 UTC
bind-9.10.3-13.P4.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-05d4c87988

Comment 23 Fedora Update System 2016-05-16 16:19:03 UTC
bind-9.10.3-13.P4.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 24 Fedora Update System 2016-05-18 02:52:35 UTC
bind-9.10.3-13.P4.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 25 Jack 2018-06-10 13:35:02 UTC
I met this issue again in Fedora 27 and Fedora 28 with my Laptop.

Comment 26 Charlie Candergart 2020-10-23 11:58:24 UTC
I have this issue in Fedora 32.
Even if I manually start the service after boot, the service failes.

Comment 27 Charlie Candergart 2020-10-31 07:29:35 UTC
So, I troubleshooted this a bit more.

For me the situation was always that I had at least one wireless adapter installed on my system.

If I disabled the wireless adapter, the problem went away.

However, I wanted to be able to use my wireless adapters.
So, I found that if I install a package named wpa_supplicant and made sure that the wpa_supplicant service was started, then this NetworkManager-wait-online service would start and work just fine.


I know that this may not be the solution for everyone but it worked out for me.
Hoping this at least helps someone in some way.


/Charlie

Ps. if this works for you, do not forget to set the wpa_supplicant service to autostart.

Comment 28 Sampson Fung 2021-02-28 09:17:58 UTC
This problem is due to use of IPv6 Automatic for me.

It is the same for me using Fedora 32, 33, Silverblue 33.

Enable of wpa-supplicant do not help for me.

Once I turn off IPv6 from all of my virtual bridge interfaces, the wait-online service started without issue.


Note You need to log in before you can comment on or make changes to this bug.