Bug 1277257 - NetworkManager-wait-online fails on boot when 2 NIC are active
NetworkManager-wait-online fails on boot when 2 NIC are active
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: bind (Show other bugs)
23
x86_64 Linux
unspecified Severity medium
: ---
: ---
Assigned To: Tomáš Hozza
Fedora Extras Quality Assurance
:
: 1316687 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-02 15:11 EST by Frank Ansari
Modified: 2016-05-17 22:52 EDT (History)
8 users (show)

See Also:
Fixed In Version: bind-9.10.3-13.P4.fc24 bind-9.10.3-13.P4.fc23
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-16 12:19:08 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
p4p1 (384 bytes, text/plain)
2015-11-03 13:33 EST, Frank Ansari
no flags Details
enp4s0 (336 bytes, text/plain)
2015-11-03 13:34 EST, Frank Ansari
no flags Details
journalctl -b -u NetworkManager (13.29 KB, text/x-vhdl)
2015-11-22 10:42 EST, Frank Ansari
no flags Details
journalctl -b -u NetworkManager-wait-online (609 bytes, text/x-vhdl)
2015-11-22 10:43 EST, Frank Ansari
no flags Details
journalctl -b -u NetworkManager-dispatcher (18.07 KB, text/x-vhdl)
2015-12-17 15:00 EST, Frank Ansari
no flags Details

  None (edit)
Description Frank Ansari 2015-11-02 15:11:54 EST
Description of problem:
When I have 2 network interfaces in my PC (with ONBOOT=yes) NetworkManager-wait-online fails at boot.

Version-Release number of selected component (if applicable):
Name        : NetworkManager
Arch        : x86_64
Epoch       : 1
Version     : 1.0.6
Release     : 7.fc23

How reproducible:
Put 2 NICs into the PC. Then boot it.

Steps to Reproduce:
1. Put 2 NIC into the PC. Boot. => NetworkManager-wait-online fails
2. Disable one one them (ONBOOT=no) => NetworkManager-wait-online is ok
3.

Actual results:
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2015-11-02 21:05:06 CET; 1min 4s ago
  Process: 1676 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=1/FAILURE)
 Main PID: 1676 (code=exited, status=1/FAILURE)

Nov 02 21:04:36 bat.localdomain systemd[1]: Starting Network Manager Wait Online...
Nov 02 21:05:06 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Nov 02 21:05:06 bat.localdomain systemd[1]: Failed to start Network Manager Wait Online.
Nov 02 21:05:06 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Unit entered failed state.
Nov 02 21:05:06 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.


Expected results:
Service should start properly.

Additional info:
Comment 1 Frank Ansari 2015-11-02 15:13:50 EST
To be more clear about this:
2. Disable one one them (ONBOOT=no). Boot. => NetworkManager-wait-online is ok
Comment 2 Dan Williams 2015-11-02 17:18:42 EST
Could you paste the ifcfg files for each interface into a reply?

Also, after boot when it fails, can you paste the result of "nmcli dev"?
Comment 3 Frank Ansari 2015-11-03 13:33:25 EST
After the boot is completed it looks like this:

[fansari@bat ~]$ nmcli dev
DEVICE      TYPE      STATE      CONNECTION 
virbr0      bridge    connected  virbr0     
enp4s0      ethernet  connected  enp4s0     
p4p1        ethernet  connected  p4p1       
virbr0-nic  tap       connected  virbr0-nic 
lo          loopback  unmanaged  --
Comment 4 Frank Ansari 2015-11-03 13:33 EST
Created attachment 1089164 [details]
p4p1
Comment 5 Frank Ansari 2015-11-03 13:34 EST
Created attachment 1089165 [details]
enp4s0
Comment 6 Frank Ansari 2015-11-21 09:12:05 EST
I there any progress here? I still have this issue.
Comment 7 Jirka Klimes 2015-11-21 12:13:37 EST
So, comment #3 shows that both interfaces were connected successfully.

Would you post NetworkManager logs when you have the problem (both files ONBOOT=yes).
Which of the files do you disable ONBOOT in?
Comment 8 Frank Ansari 2015-11-22 10:41:49 EST
Yes - the problem is not that the connections are not coming up. The problem is that the NetworkManager-wait is always hanging and this costs time at every boot.

Regarding the disabling: if I disable either of these interfaces the NetworkManager-wait service works normally and does not hang.

After boot it looks like this:


fansari@bat ~]$ systemctl status NetworkManager-wait-online
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sun 2015-11-22 16:31:21 CET; 8min ago
  Process: 1598 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=1/FAILURE)
 Main PID: 1598 (code=exited, status=1/FAILURE)

Nov 22 16:30:51 bat.localdomain systemd[1]: Starting Network Manager Wait Online...
Nov 22 16:31:21 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Nov 22 16:31:21 bat.localdomain systemd[1]: Failed to start Network Manager Wait Online.
Nov 22 16:31:21 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Unit entered failed state.
Nov 22 16:31:21 bat.localdomain systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Comment 9 Frank Ansari 2015-11-22 10:42 EST
Created attachment 1097537 [details]
journalctl -b -u NetworkManager
Comment 10 Frank Ansari 2015-11-22 10:43 EST
Created attachment 1097538 [details]
journalctl -b -u NetworkManager-wait-online
Comment 11 Frank Ansari 2015-12-15 15:47:06 EST
Is there a fix for this?

Simply disabling this service did not help because it seems that it is called from somewhere else. I guessed this might be NetworkManager.dispatcher so I disabled this also but this did not help either.

Currently the only solution (which is not a solution of course) I found is to "shortcut" the service by placing a copy of NetworkManager-wait-online.service to /etc/systemd/system and to set

ExecStart=/usr/bin/true
Comment 12 Dan Williams 2015-12-15 16:15:46 EST
It appears that your interfaces are taking 31 seconds to come up, but NM-wait-online's timeout is 30s by default.  That's why it's failing.

The interfaces are taking 31s to come up because of:

Nov 22 16:30:56 bat.localdomain NetworkManager[1441]: <info>  (enp4s0): device state change: ip-config -> ip-check (reason 'none') [70 80 0]
Nov 22 16:31:22 bat.localdomain NetworkManager[1441]: <info>  (enp4s0): device state change: ip-check -> secondaries (reason 'none') [80 90 0]

which is quite a long delay during the ip-check phase.  Some things that happen during this stage are running dispatcher scripts, sending neighbor notifications (ARPs) and pinging the gateway.  

Could you modify /lib/systemd/system/NetworkManager-dispatcher.service so that the ExecStart= line looks like this:

ExecStart=/usr/libexec/nm-dispatcher --debug

then run 'systemctl daemon-reload', then restart and reproduce the problem.  Then run 'journalctl -b -u NetworkManager-dispatcher' and attach the output here and we'll see which script is taking too long.  You'll see output like:

Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Dispatching action 'dhcp4-change' for wlp4s0
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/00-netreport'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/00-netreport' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/10-ifcfg-rh-routes.sh'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/10-ifcfg-rh-routes.sh' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/11-dhclient'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/11-dhclient' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/20-chrony'
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Script '/etc/NetworkManager/dispatcher.d/20-chrony' complete
Dec 15 15:12:21 localhost.localdomain nm-dispatcher[18636]: ** Message: Dispatch 'dhcp4-change' on wlp4s0 complete

And I'll bet one of your scripts is taking too long.  If that's true, we need to file a bug against the package that provides that script.
Comment 13 Frank Ansari 2015-12-17 15:00 EST
Created attachment 1106830 [details]
journalctl -b -u NetworkManager-dispatcher
Comment 14 Frank Ansari 2015-12-17 15:15:40 EST
From what I see from the output of this logs it is more clear to me where the problem is located:

It is in /etc/NetworkManager/dispatcher.d/13-named. When I disable the first line like this

if [ "$2" = 'up' -o "$2" = 'down' ]; then
#	/usr/bin/systemctl reload named.service > /dev/null 2>&1 || true
	/usr/bin/systemctl reload named-chroot.service > /dev/null 2>&1 || true
	/usr/bin/systemctl reload named-sdb.service > /dev/null 2>&1 || true
	/usr/bin/systemctl reload named-sdb-chroot.service > /dev/null 2>&1 || true
fi


then there is no startup delay anymore.

But this again is only a workaround not a final solution. If named is meant to be reloaded at this point then I conclude that it is not expected to be a problem.

In my scenario I use named. The other services mentioned here are not used by me.

When I check the output of the NetworkManager-dispatcher logs and compare them with the named logs I find something interesting:

Dec 17 20:20:21 bat.localdomain nm-dispatcher[1909]: ** Message: Script '/etc/NetworkManager/dispatcher.d/13-named' complete

Dec 17 20:20:20 bat.localdomain systemd[1]: Starting Berkeley Internet Name Domain (DNS)...

It seems that the 13-named dispatcher waits until named is started.

But even this is true: what has this to do with the number of NICs?
Comment 15 Frank Ansari 2015-12-17 16:05:47 EST
I did some more tests. The problem is not the number of NICs.

I can only reproduce this behaviour when the second interface is using IPv6 autoconf and also have named. There is another machine (which is up all the time) where raded is running and pushes the autoconf to my machines.

When I change the config to static IPv6 there is no delay with the 13-named dispatcher.

But autoconf is a useful thing for IPv6 and I don't want to remove it.

So I can reproduce this behaviour in this scenario:

- one NIC with IPv4
- seond NIC with IPv6 and autoconf via radvd on second machine
- named locally running
Comment 16 Dan Williams 2015-12-18 18:05:14 EST
Dec 17 20:19:55 bat.localdomain nm-dispatcher[1909]: ** Message: Running script '/etc/NetworkManager/dispatcher.d/13-named'
Dec 17 20:20:21 bat.localdomain nm-dispatcher[1909]: ** Message: Script '/etc/NetworkManager/dispatcher.d/13-named' complete

The named script is taking 26 seconds to complete, and that's running every time an interface comes up.  This is a problem with the named dispatcher script, and that's where the problem needs to be fixed.  Alternatively, you could remove named if you don't need its functionality.  If you do, then the script should get fixed :(

Or you can also increase the NetworkManager-wait-online timeout if you like by modifying the systemd until file.
Comment 17 Tomáš Hozza 2016-05-12 06:39:06 EDT
*** Bug 1316687 has been marked as a duplicate of this bug. ***
Comment 18 Tomáš Hozza 2016-05-12 06:57:33 EDT
Thanks all.

I'm going to remove the dispatcher script completely, as it is a relic from bind 9.9.x times. BIND 9.10.x and newer now listens on NETLINK for network configuration changes and can react to them.
Comment 19 Fedora Update System 2016-05-12 08:24:32 EDT
bind-9.10.3-13.P4.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-79cad60d9c
Comment 20 Fedora Update System 2016-05-12 08:24:43 EDT
bind-9.10.3-13.P4.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-05d4c87988
Comment 21 Fedora Update System 2016-05-15 02:57:43 EDT
bind-9.10.3-13.P4.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-79cad60d9c
Comment 22 Fedora Update System 2016-05-16 11:24:34 EDT
bind-9.10.3-13.P4.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-05d4c87988
Comment 23 Fedora Update System 2016-05-16 12:19:03 EDT
bind-9.10.3-13.P4.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
Comment 24 Fedora Update System 2016-05-17 22:52:35 EDT
bind-9.10.3-13.P4.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.