Red Hat Bugzilla – Bug 1333983
Restarting NetworkManager causes devices to be lost from the network connections
Last modified: 2017-06-02 11:25:44 EDT
ENVIRONMENT: ------------ Red Hat Enterprise Linux Server release 7.2 (Maipo) 3.10.0-327.10.1.el7.x86_64 ISSUE: ------ When we restart NetworkManager the devices are lost from the network connections and we are unable to get online. The only way we have been able to get back online is by rebooting. Customer is using VLAN tagging over a team running in active-backup mode. NAME UUID TYPE DEVICE team1.153 1795ff88-3800-47ac-a46a-52b1a53c81f0 vlan team1.153 team1 3bf46f44-412c-47cc-a811-3b1994352c91 team team1 team1 26c8dd05-a8a1-44d9-8969-8298f8e6bc7d team -- em2_1 163b49ce-053a-4d75-a4b1-553da2c1eb53 802-3-ethernet em2_1 em1_1 21859b1d-aa32-4118-a308-2ea904844d49 802-3-ethernet em1_1 - Then we restarted NetworkManager with this command at which point we immediately lost our connection: [root@eng-vocngdbdrs71 ~]# systemctl restart NetworkManager Write failed: Broken pipe System had to be rebooted to recover connections. CUSTOMER CONFIGS: ----------------- 4 -rw-r--r--. 1 root root 115 May 3 13:14 etc/sysconfig/network-scripts/ifcfg-em1_1 4 -rw-r--r--. 1 root root 115 May 3 13:14 etc/sysconfig/network-scripts/ifcfg-em2_1 4 -rw-r--r--. 1 root root 254 Sep 16 2015 etc/sysconfig/network-scripts/ifcfg-lo 4 -rw-r--r--. 1 root root 319 May 3 13:12 etc/sysconfig/network-scripts/ifcfg-team1 4 -rw-r--r--. 1 root root 371 May 3 13:34 etc/sysconfig/network-scripts/ifcfg-team1.153 etc/sysconfig/network-scripts/ifcfg-em1_1 ------------------------------------------ NAME=em1_1 UUID=21859b1d-aa32-4118-a308-2ea904844d49 DEVICE=em1_1 ONBOOT=yes TEAM_MASTER=team1 DEVICETYPE=TeamPort etc/sysconfig/network-scripts/ifcfg-em2_1 ------------------------------------------ NAME=em2_1 UUID=163b49ce-053a-4d75-a4b1-553da2c1eb53 DEVICE=em2_1 ONBOOT=yes TEAM_MASTER=team1 DEVICETYPE=TeamPort etc/sysconfig/network-scripts/ifcfg-team1 ------------------------------------------ DEVICE=team1 TEAM_CONFIG="{\"runner\": {\"name\": \"activebackup\"}}" DEVICETYPE=Team DEFROUTE=yes PEERDNS=yes PEERROUTES=yes IPV4_FAILURE_FATAL=no IPV6INIT=no IPV6_AUTOCONF=yes IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no NAME=team1 UUID=3bf46f44-412c-47cc-a811-3b1994352c91 ONBOOT=yes IPV6_PEERDNS=yes IPV6_PEERROUTES=yes LOGGING DURING NM RESTART: -------------------------- May 4 17:25:55 eng-vocngdbdrs71 systemd: Stopping Network Manager... May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> caught SIGTERM, shutting down normally. May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> (team1): team port em2_1 was released May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> (em2_1): released from master team1 May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> (team1): device state change: activated -> deactivating (reason 'unmanaged') [100 110 3] May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> (team1): device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 3] May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> (team1): deactivation: stopping teamd... May 4 17:25:55 eng-vocngdbdrs71 NetworkManager: Got SIGINT, SIGQUIT or SIGTERM. May 4 17:25:55 eng-vocngdbdrs71 NetworkManager: Exiting... May 4 17:25:55 eng-vocngdbdrs71 kernel: be2net 0000:01:00.0: MAC address changed to b4:e1:0f:9f:0d:d5 May 4 17:25:55 eng-vocngdbdrs71 kernel: team1: Port device em1_1 removed May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> (team1): released team port em1_1 May 4 17:25:55 eng-vocngdbdrs71 kernel: be2net 0000:01:00.0 em1_1: Link is Up May 4 17:25:55 eng-vocngdbdrs71 kernel: 8021q: adding VLAN 0 to HW filter on device em1_1 May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> (team1.153): device state change: activated -> unmanaged (reason 'unmanaged') [100 10 3] May 4 17:25:55 eng-vocngdbdrs71 kernel: IPv6: ADDRCONF(NETDEV_UP): team1: link is not ready May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> NetworkManager state is now CONNECTED_LOCAL May 4 17:25:55 eng-vocngdbdrs71 dbus-daemon: dbus[1526]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' May 4 17:25:55 eng-vocngdbdrs71 dbus[1526]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info> exiting (success) May 4 17:25:55 eng-vocngdbdrs71 systemd: Starting Network Manager Script Dispatcher Service... May 4 17:25:55 eng-vocngdbdrs71 dbus-daemon: dbus[1526]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' May 4 17:25:55 eng-vocngdbdrs71 dbus[1526]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' May 4 17:25:55 eng-vocngdbdrs71 systemd: Started Network Manager Script Dispatcher Service. May 4 17:25:55 eng-vocngdbdrs71 systemd: Starting Network Manager... May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> NetworkManager (version 1.0.6-27.el7) is starting... May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Read config: /etc/NetworkManager/NetworkManager.conf and conf.d: 00-server.conf, 10-ibft-plugin.conf May 4 17:25:55 eng-vocngdbdrs71 systemd: Started Network Manager. May 4 17:25:55 eng-vocngdbdrs71 systemd: Starting Network Manager Wait Online... May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded settings plugin ifcfg-rh: (c) 2007 - 2015 Red Hat, Inc. To report bugs please use the NetworkManager mailing list. (/usr/lib64/NetworkManager/libnm-settings-plugin-ifcfg-rh.so) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded settings plugin iBFT: (c) 2014 Red Hat, Inc. To report bugs please use the NetworkManager mailing list. (/usr/lib64/NetworkManager/libnm-settings-plugin-ibft.so) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded settings plugin keyfile: (c) 2007 - 2015 Red Hat, Inc. To report bugs please use the NetworkManager mailing list. May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-team1-1 (26c8dd05-a8a1-44d9-8969-8298f8e6bc7d,"team1") May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-team1.153 (1795ff88-3800-47ac-a46a-52b1a53c81f0,"team1.153") May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em2_1 (163b49ce-053a-4d75-a4b1-553da2c1eb53,"em2_1") May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em1_1 (21859b1d-aa32-4118-a308-2ea904844d49,"em1_1") May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-team1 (3bf46f44-412c-47cc-a811-3b1994352c91,"team1") May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> monitoring kernel firmware directory '/lib/firmware'. May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMVxlanFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMVlanFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMVethFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMTunFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMMacvlanFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMInfinibandFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMGreFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMEthernetFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMBridgeFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMBondFactory (internal) May 4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info> Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/libnm-device-plugin-team.so)
Customers impacted in bad way Changed to Urgent
Hm, there's a vlan on a team, but the team doesn't get assumed on restart because it has no L3 configuration. What makes this have an urgent severity? Could we find a workaround until we fix this? If you restart NetworkManager to reload the configuration you could instead do a "nmcli c reload".
Can we please get the intended configuration of the VLAN and TEAM, please. Maybe the commands that were used to configure them initially. And how they were intended to be configured. Maybe a diagram of intended configuration. That way we can see if there is a gap between what was executed via the commands and what was the intention of the configuration / setup
Rashid, I've forwarded your inquiries to the customer. I'll update you as soon as I receive their feedback.
Cut and paste from the email ================================ On Tue, May 10, 2016 at 5:19 PM, James Mills <james.mills@redhat.com> wrote: The test of the nmcli con reload command worked as expected. No network services or devices were lost.
This is definitely a bug in NetworkManager. However the fix is not straightforward. I'm working on it. To find a suitable workaround I need to know what is the reason the customer restarting NetworkManager. The comment #8 indicates that "nmcli c reload" works. If the customer restarts NetworkManager in order to reload the configuration they should do a "nmcli c reload" instead.
Hello, I have been poking at this a bit and I think I have found at least part of the problem. Normally, when the NetworkManager service is stopped, it is supposed to leave the active network configuration in place. This can be confirmed by stopping the service on a system with a more basic configuration. However, in the customer's case here, they are using a team. For some reason, when the NetworkManager service is stopped, it deactivates team0 and team1.153. The nm_device_cleanup() function removes IP addressing from interfaces: May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.321530] [devices/nm-device.c:7600] nm_device_set_unmanaged(): [0x7f6f200c2420] (team1): now unmanaged May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info> (team1): device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 3] May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.321549] [devices/nm-device.c:8155] nm_device_cleanup(): [0x7f6f200c2420] (team1): deactivating device (reason 'unmanaged') [3] -- May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.340104] [devices/nm-device.c:7600] nm_device_set_unmanaged(): [0x7f6f200c27d0] (team1.153): now unmanaged May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info> (team1.153): device state change: activated -> unmanaged (reason 'unmanaged') [100 10 3] May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.341589] [devices/nm-device.c:8155] nm_device_cleanup(): [0x7f6f200c27d0] (team1.153): deactivating device (reason 'unmanaged') [3] In the case of the team device, it stops the teamd process as well, but leaves the actual teamd interface in place: May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info> (team1): deactivation: stopping teamd... May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.323319] [NetworkManagerUtils.c:667] nm_utils_kill_child_async(): kill child process 'teamd' (700): wait for process to terminate after sending SIGTERM (15) (send SIGKILL in 2000 milliseconds)... This leaves the team in an unmanageable state. When NetworkManager starts back up, it can't talk to the team. It would have to launch a new teamd process to take over the left over team interface which I do not believe it is smart enough to do. Perhaps some of the logic in remove_device() is incorrect? It is supposed to skip tearing down active interfaces on shutdown: 797 static void 798 remove_device (NMManager *manager, 799 NMDevice *device, 800 gboolean quitting, 801 gboolean allow_unmanage) 802 { 803 NMManagerPrivate *priv = NM_MANAGER_GET_PRIVATE (manager); 804 805 nm_log_dbg (LOGD_DEVICE, "(%s): removing device (allow_unmanage %d, managed %d)", 806 nm_device_get_iface (device), allow_unmanage, nm_device_get_managed (device)); 807 808 if (allow_unmanage && nm_device_get_managed (device)) { 809 NMActRequest *req = nm_device_get_act_request (device); 810 gboolean unmanage = FALSE; 811 812 /* Leave activated interfaces up when quitting so their configuration 813 * can be taken over when NM restarts. This ensures connectivity while 814 * NM is stopped. Devices which do not support connection assumption 815 * cannot be left up. 816 */ 817 if (!quitting) /* Forced removal; device already gone */ 818 unmanage = TRUE; 819 else if (!nm_device_can_assume_active_connection (device)) 820 unmanage = TRUE; 821 else if (!req) 822 unmanage = TRUE; 823 824 if (unmanage) { 825 if (quitting) 826 nm_device_set_unmanaged_quitting (device); 827 else 828 nm_device_set_unmanaged (device, NM_UNMANAGED_INTERNAL, TRUE, NM_DEVICE_STATE_REAS ON_REMOVED); 829 } else if (quitting && nm_config_get_configure_and_quit (nm_config_get ())) { 830 nm_device_spawn_iface_helper (device); 831 } 832 } 833 834 g_signal_handlers_disconnect_matched (device, G_SIGNAL_MATCH_DATA, 0, 0, NULL, NULL, manager); 835 836 nm_settings_device_removed (priv->settings, device, quitting); 837 priv->devices = g_slist_remove (priv->devices, device); 838 839 g_signal_emit (manager, signals[DEVICE_REMOVED], 0, device); 840 g_object_notify (G_OBJECT (manager), NM_MANAGER_DEVICES); 841 nm_device_removed (device); 842 843 nm_dbus_manager_unregister_object (priv->dbus_mgr, device); 844 g_object_unref (device); 845 846 check_if_startup_complete (manager); 847 }
Hello Lubomir, Thank you for this. Do you really mean the bug #1311988? Bug 1311988 - NetworkManager shuts down if the dbus.service is stopped Thank you, Patrick
(In reply to Patrick Talbert from comment #18) > Hello Lubomir, > > Thank you for this. > > Do you really mean the bug #1311988? > > Bug 1311988 - NetworkManager shuts down if the dbus.service is stopped > > > Thank you, > > Patrick Yes
*** Bug 1325811 has been marked as a duplicate of this bug. ***
*** Bug 1331009 has been marked as a duplicate of this bug. ***
Bronce, we need this fixed in 7.2.zas well and asap. Can you escalate?
Created attachment 1204250 [details] [PATCH] device: consider a device with slaves configured
(In reply to Lubomir Rintel from comment #41) > Created attachment 1204250 [details] > [PATCH] device: consider a device with slaves configured Seems OK to me, though obviously we should make sure the test suite passes. Though for this part: + if (!(priv->slaves || nm_platform_link_can_assume (NM_PLATFORM_GET, nm_device_get_ifindex (self)))) { + /* The device has no layer 3 configuration and no slaves. Leave it up. */ return FALSE; I'd rather see that as "!priv->slaves && !nm_platform_link_can_assume()", I think that's clearer... but doens't this mean that master devices that could be managed (eg where nm_platform_link_can_assume() returns TRUE) would return TRUE from unmanaged_on_quit() and then get unmanaged, where before they wouldn't get touched? I'm probably looking at it wrong, it's late on a Friday...
Created attachment 1204862 [details] [PATCH] device: consider a device with slaves configured (In reply to Dan Williams from comment #43) > (In reply to Lubomir Rintel from comment #41) > > Created attachment 1204250 [details] > > [PATCH] device: consider a device with slaves configured > > Seems OK to me, though obviously we should make sure the test suite passes. > Though for this part: > > + if (!(priv->slaves || nm_platform_link_can_assume (NM_PLATFORM_GET, > nm_device_get_ifindex (self)))) { > + /* The device has no layer 3 configuration and no slaves. Leave it up. */ > return FALSE; > > I'd rather see that as "!priv->slaves && !nm_platform_link_can_assume()", I > think that's clearer... but doens't this mean that master devices that > could be managed (eg where nm_platform_link_can_assume() returns TRUE) would > return TRUE from unmanaged_on_quit() and then get unmanaged, where before > they wouldn't get touched? > > I'm probably looking at it wrong, it's late on a Friday... You're reading that right -- I got that wrong thinking that unmanaging prior to removing leaves the device up. I'm not sure when does NMDevice's unmanaged_on_quit() return FALSE though -- !nm_platform_link_can_assume() along with nm_device_can_assume_active_connection() return TRUE for all cases I can imagine. It's perhaps safer to leave it alone though.
LGTM
QE: This is the setup needed to test (assuming two ethernets, ens11 and ens12) # nmcli c add con-name james type bond ifname bond0 ipv4.method \ disabled ipv6.method ignore autoconnect no # nmcli c add type ethernet con-name slave-ens11 ifname ens11 \ master bond0 slave-type bond autoconnect no # nmcli c add type ethernet con-name slave-ens12 ifname ens12 \ master bond0 slave-type bond autoconnect no # nmcli c add type vlan dev bond0 id 153 autoconnect no \ ip4 10.66.66.1/24 The important part is the bond0 bond ("james" connection) having no L3 configuration. To test the scenario, please activate all the connections (they're autoconnect=no, so that autoconnect won't interfere on restart): # nmcli c up james # nmcli c up slave-ens11 # nmcli c up slave-ens12 # nmcli c up vlan Then stop NetworkManager, check that all of the devices are still UP and have configuration (using "ip addr" or "ip link"). Then start NetworkManager back and check that all connections have been assumed correctly with "nmcli d"
-12 version works as expected
Done.
this works as expected
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2581.html