Bug 1333983

Summary:

Restarting NetworkManager causes devices to be lost from the network connections

Product:

Red Hat Enterprise Linux 7

Reporter:

Saul Serna <sserna>

Component:

NetworkManager

Assignee:

Lubomir Rintel <lrintel>

Status:

CLOSED ERRATA

QA Contact:

Desktop QE <desktop-qa-list>

Severity:

medium

Docs Contact:

Priority:

high

Version:

7.2

CC:

aloughla, atragler, bgalvani, bmcclain, danken, dpathak, hasuzuki, kwalker, lkundrak, lrintel, mburman, mleitner, msugaya, ptalbert, rhepner, rkhan, snagar, sserna, sukulkar, thaller, tlavigne, vanhoof, vbenes, ylavi

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1379711 (view as bug list)

Environment:

Last Closed:

2016-11-03 19:09:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1203710, 1301628, 1304509, 1379711

Attachments:

Description	Flags
[PATCH] device: consider a device with slaves configured	none
[PATCH] device: consider a device with slaves configured	none

Description Saul Serna 2016-05-06 22:19:48 UTC

ENVIRONMENT:
------------
Red Hat Enterprise Linux Server release 7.2 (Maipo)

3.10.0-327.10.1.el7.x86_64


ISSUE:
------
When we restart NetworkManager the devices are lost from the network connections and we are unable to get online. The only way we have been able to get back online is by rebooting. Customer is using VLAN tagging over a team running in active-backup mode.

NAME       UUID                                  TYPE            DEVICE    
team1.153  1795ff88-3800-47ac-a46a-52b1a53c81f0  vlan            team1.153 
team1      3bf46f44-412c-47cc-a811-3b1994352c91  team            team1     
team1      26c8dd05-a8a1-44d9-8969-8298f8e6bc7d  team            --        
em2_1      163b49ce-053a-4d75-a4b1-553da2c1eb53  802-3-ethernet  em2_1     
em1_1      21859b1d-aa32-4118-a308-2ea904844d49  802-3-ethernet  em1_1  

- Then we restarted NetworkManager with this command at which point we immediately lost our connection:

[root@eng-vocngdbdrs71 ~]# systemctl restart NetworkManager
Write failed: Broken pipe

System had to be rebooted to recover connections.





CUSTOMER CONFIGS:
-----------------
4 -rw-r--r--. 1 root root 115 May  3 13:14 etc/sysconfig/network-scripts/ifcfg-em1_1
4 -rw-r--r--. 1 root root 115 May  3 13:14 etc/sysconfig/network-scripts/ifcfg-em2_1
4 -rw-r--r--. 1 root root 254 Sep 16  2015 etc/sysconfig/network-scripts/ifcfg-lo
4 -rw-r--r--. 1 root root 319 May  3 13:12 etc/sysconfig/network-scripts/ifcfg-team1
4 -rw-r--r--. 1 root root 371 May  3 13:34 etc/sysconfig/network-scripts/ifcfg-team1.153


etc/sysconfig/network-scripts/ifcfg-em1_1
------------------------------------------
NAME=em1_1
UUID=21859b1d-aa32-4118-a308-2ea904844d49
DEVICE=em1_1
ONBOOT=yes
TEAM_MASTER=team1
DEVICETYPE=TeamPort


etc/sysconfig/network-scripts/ifcfg-em2_1
------------------------------------------
NAME=em2_1
UUID=163b49ce-053a-4d75-a4b1-553da2c1eb53
DEVICE=em2_1
ONBOOT=yes
TEAM_MASTER=team1
DEVICETYPE=TeamPort


etc/sysconfig/network-scripts/ifcfg-team1
------------------------------------------
DEVICE=team1
TEAM_CONFIG="{\"runner\": {\"name\": \"activebackup\"}}"
DEVICETYPE=Team
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=no
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
NAME=team1
UUID=3bf46f44-412c-47cc-a811-3b1994352c91
ONBOOT=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes





LOGGING DURING NM RESTART:
--------------------------
May  4 17:25:55 eng-vocngdbdrs71 systemd: Stopping Network Manager...
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  caught SIGTERM, shutting down normally.
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  (team1): team port em2_1 was released
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  (em2_1): released from master team1
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  (team1): device state change: activated -> deactivating (reason 'unmanaged') [100 110 3]
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  (team1): device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 3]
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  (team1): deactivation: stopping teamd...
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager: Got SIGINT, SIGQUIT or SIGTERM.
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager: Exiting...
May  4 17:25:55 eng-vocngdbdrs71 kernel: be2net 0000:01:00.0: MAC address changed to b4:e1:0f:9f:0d:d5
May  4 17:25:55 eng-vocngdbdrs71 kernel: team1: Port device em1_1 removed
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  (team1): released team port em1_1
May  4 17:25:55 eng-vocngdbdrs71 kernel: be2net 0000:01:00.0 em1_1: Link is Up
May  4 17:25:55 eng-vocngdbdrs71 kernel: 8021q: adding VLAN 0 to HW filter on device em1_1
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  (team1.153): device state change: activated -> unmanaged (reason 'unmanaged') [100 10 3]
May  4 17:25:55 eng-vocngdbdrs71 kernel: IPv6: ADDRCONF(NETDEV_UP): team1: link is not ready
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  NetworkManager state is now CONNECTED_LOCAL
May  4 17:25:55 eng-vocngdbdrs71 dbus-daemon: dbus[1526]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
May  4 17:25:55 eng-vocngdbdrs71 dbus[1526]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[1639]: <info>  exiting (success)
May  4 17:25:55 eng-vocngdbdrs71 systemd: Starting Network Manager Script Dispatcher Service...
May  4 17:25:55 eng-vocngdbdrs71 dbus-daemon: dbus[1526]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
May  4 17:25:55 eng-vocngdbdrs71 dbus[1526]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
May  4 17:25:55 eng-vocngdbdrs71 systemd: Started Network Manager Script Dispatcher Service.
May  4 17:25:55 eng-vocngdbdrs71 systemd: Starting Network Manager...
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  NetworkManager (version 1.0.6-27.el7) is starting...
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Read config: /etc/NetworkManager/NetworkManager.conf and conf.d: 00-server.conf, 10-ibft-plugin.conf
May  4 17:25:55 eng-vocngdbdrs71 systemd: Started Network Manager.
May  4 17:25:55 eng-vocngdbdrs71 systemd: Starting Network Manager Wait Online...
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded settings plugin ifcfg-rh: (c) 2007 - 2015 Red Hat, Inc.  To report bugs please use the NetworkManager mailing list. (/usr/lib64/NetworkManager/libnm-settings-plugin-ifcfg-rh.so)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded settings plugin iBFT: (c) 2014 Red Hat, Inc.  To report bugs please use the NetworkManager mailing list. (/usr/lib64/NetworkManager/libnm-settings-plugin-ibft.so)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded settings plugin keyfile: (c) 2007 - 2015 Red Hat, Inc.  To report bugs please use the NetworkManager mailing list.
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-team1-1 (26c8dd05-a8a1-44d9-8969-8298f8e6bc7d,"team1")
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-team1.153 (1795ff88-3800-47ac-a46a-52b1a53c81f0,"team1.153")
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em2_1 (163b49ce-053a-4d75-a4b1-553da2c1eb53,"em2_1")
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em1_1 (21859b1d-aa32-4118-a308-2ea904844d49,"em1_1")
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-team1 (3bf46f44-412c-47cc-a811-3b1994352c91,"team1")
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  monitoring kernel firmware directory '/lib/firmware'.
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMVxlanFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMVlanFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMVethFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMTunFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMMacvlanFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMInfinibandFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMGreFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMEthernetFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMBridgeFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMBondFactory (internal)
May  4 17:25:55 eng-vocngdbdrs71 NetworkManager[4451]: <info>  Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/libnm-device-plugin-team.so)

Comment 3 Rashid Khan 2016-05-09 19:54:47 UTC

Customers impacted in bad way 
Changed to Urgent

Comment 5 Lubomir Rintel 2016-05-10 09:25:33 UTC

Hm, there's a vlan on a team, but the team doesn't get assumed on restart because it has no L3 configuration.

What makes this have an urgent severity? Could we find a workaround until we fix this?

If you restart NetworkManager to reload the configuration you could instead do a "nmcli c reload".

Comment 6 Rashid Khan 2016-05-10 17:23:33 UTC

Can we please get the intended configuration of the VLAN and TEAM, please.
Maybe the commands that were used to configure them initially.
And how they were intended to be configured. Maybe a diagram of intended configuration. 

That way we can see if there is a gap between what was executed via the commands and what was the intention of the configuration / setup

Comment 7 Saul Serna 2016-05-10 18:40:06 UTC

Rashid,

I've forwarded your inquiries to the customer. I'll update you as soon as I receive their feedback.

Comment 8 Rashid Khan 2016-05-10 21:30:48 UTC

Cut and paste from the email
================================
On Tue, May 10, 2016 at 5:19 PM, James Mills <james.mills> wrote:
The test of the nmcli con reload command worked as expected. No network services or devices were lost.

Comment 13 Lubomir Rintel 2016-05-11 13:41:29 UTC

This is definitely a bug in NetworkManager. However the fix is not straightforward. I'm working on it.

To find a suitable workaround I need to know what is the reason the customer restarting NetworkManager.

The comment #8 indicates that "nmcli c reload" works. If the customer restarts NetworkManager in order to reload the configuration they should do a "nmcli c reload" instead.

Comment 15 Patrick Talbert 2016-05-11 14:49:09 UTC

Hello,

I have been poking at this a bit and I think I have found at least part of the problem.

Normally, when the NetworkManager service is stopped, it is supposed to leave the active network configuration in place. This can be confirmed by stopping the service on a system with a more basic configuration.

However, in the customer's case here, they are using a team. For some reason, when the NetworkManager service is stopped, it deactivates team0 and team1.153. The nm_device_cleanup() function removes IP addressing from interfaces:


May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.321530] [devices/nm-device.c:7600] nm_device_set_unmanaged(): [0x7f6f200c2420] (team1): now unmanaged
May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info>  (team1): device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 3]
May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.321549] [devices/nm-device.c:8155] nm_device_cleanup(): [0x7f6f200c2420] (team1): deactivating device (reason 'unmanaged') [3]
--
May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.340104] [devices/nm-device.c:7600] nm_device_set_unmanaged(): [0x7f6f200c27d0] (team1.153): now unmanaged
May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info>  (team1.153): device state change: activated -> unmanaged (reason 'unmanaged') [100 10 3]
May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.341589] [devices/nm-device.c:8155] nm_device_cleanup(): [0x7f6f200c27d0] (team1.153): deactivating device (reason 'unmanaged') [3]



In the case of the team device, it stops the teamd process as well, but leaves the actual teamd interface in place:


May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info>  (team1): deactivation: stopping teamd...
May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.323319] [NetworkManagerUtils.c:667] nm_utils_kill_child_async(): kill child process 'teamd' (700): wait for process to terminate after sending SIGTERM (15) (send SIGKILL in 2000 milliseconds)...


This leaves the team in an unmanageable state. When NetworkManager starts back up, it can't talk to the team. It would have to launch a new teamd process to take over the left over team interface which I do not believe it is smart enough to do.



Perhaps some of the logic in remove_device() is incorrect? It is supposed to skip tearing down active interfaces on shutdown:

 797 static void
 798 remove_device (NMManager *manager,
 799                NMDevice *device,
 800                gboolean quitting,
 801                gboolean allow_unmanage)
 802 {
 803         NMManagerPrivate *priv = NM_MANAGER_GET_PRIVATE (manager);
 804 
 805         nm_log_dbg (LOGD_DEVICE, "(%s): removing device (allow_unmanage %d, managed %d)",
 806                     nm_device_get_iface (device), allow_unmanage, nm_device_get_managed (device));
 807 
 808         if (allow_unmanage && nm_device_get_managed (device)) {
 809                 NMActRequest *req = nm_device_get_act_request (device);
 810                 gboolean unmanage = FALSE;
 811 
 812                 /* Leave activated interfaces up when quitting so their configuration
 813                  * can be taken over when NM restarts.  This ensures connectivity while
 814                  * NM is stopped. Devices which do not support connection assumption
 815                  * cannot be left up.
 816                  */
 817                 if (!quitting)  /* Forced removal; device already gone */
 818                         unmanage = TRUE;
 819                 else if (!nm_device_can_assume_active_connection (device))
 820                         unmanage = TRUE;
 821                 else if (!req)
 822                         unmanage = TRUE;
 823 
 824                 if (unmanage) {
 825                         if (quitting)
 826                                 nm_device_set_unmanaged_quitting (device);
 827                         else
 828                                 nm_device_set_unmanaged (device, NM_UNMANAGED_INTERNAL, TRUE, NM_DEVICE_STATE_REAS     ON_REMOVED);
 829                 } else if (quitting && nm_config_get_configure_and_quit (nm_config_get ())) {
 830                         nm_device_spawn_iface_helper (device);
 831                 }
 832         }
 833 
 834         g_signal_handlers_disconnect_matched (device, G_SIGNAL_MATCH_DATA, 0, 0, NULL, NULL, manager);
 835 
 836         nm_settings_device_removed (priv->settings, device, quitting);
 837         priv->devices = g_slist_remove (priv->devices, device);
 838 
 839         g_signal_emit (manager, signals[DEVICE_REMOVED], 0, device);
 840         g_object_notify (G_OBJECT (manager), NM_MANAGER_DEVICES);
 841         nm_device_removed (device);
 842 
 843         nm_dbus_manager_unregister_object (priv->dbus_mgr, device);
 844         g_object_unref (device);
 845 
 846         check_if_startup_complete (manager);
 847 }

Comment 18 Patrick Talbert 2016-05-11 19:18:02 UTC

Hello Lubomir,

Thank you for this.

Do you really mean the bug #1311988?

Bug 1311988 - NetworkManager shuts down if the dbus.service is stopped


Thank you,

Patrick

Comment 19 Lubomir Rintel 2016-05-12 07:51:35 UTC

(In reply to Patrick Talbert from comment #18)
> Hello Lubomir,
> 
> Thank you for this.
> 
> Do you really mean the bug #1311988?
> 
> Bug 1311988 - NetworkManager shuts down if the dbus.service is stopped
> 
> 
> Thank you,
> 
> Patrick

Yes

Comment 26 Lubomir Rintel 2016-06-01 16:03:46 UTC

*** Bug 1325811 has been marked as a duplicate of this bug. ***

Comment 30 Beniamino Galvani 2016-07-21 12:18:16 UTC

*** Bug 1331009 has been marked as a duplicate of this bug. ***

Comment 31 Yaniv Lavi 2016-08-24 12:49:01 UTC

Bronce, we need this fixed in 7.2.zas well and asap.  Can you escalate?

Comment 41 Lubomir Rintel 2016-09-23 19:20:51 UTC

Created attachment 1204250 [details]
[PATCH] device: consider a device with slaves configured

Comment 43 Dan Williams 2016-09-23 23:11:27 UTC

(In reply to Lubomir Rintel from comment #41)
> Created attachment 1204250 [details]
> [PATCH] device: consider a device with slaves configured

Seems OK to me, though obviously we should make sure the test suite passes. Though for this part:

+	if (!(priv->slaves || nm_platform_link_can_assume (NM_PLATFORM_GET, nm_device_get_ifindex (self)))) {
+		/* The device has no layer 3 configuration and no slaves. Leave it up. */
 		return FALSE;

I'd rather see that as "!priv->slaves && !nm_platform_link_can_assume()", I think that's clearer...  but doens't this mean that master devices that could be managed (eg where nm_platform_link_can_assume() returns TRUE) would return TRUE from unmanaged_on_quit() and then get unmanaged, where before they wouldn't get touched?

I'm probably looking at it wrong, it's late on a Friday...

Comment 44 Lubomir Rintel 2016-09-26 15:28:52 UTC

Created attachment 1204862 [details]
[PATCH] device: consider a device with slaves configured

(In reply to Dan Williams from comment #43)
> (In reply to Lubomir Rintel from comment #41)
> > Created attachment 1204250 [details]
> > [PATCH] device: consider a device with slaves configured
> 
> Seems OK to me, though obviously we should make sure the test suite passes.
> Though for this part:
> 
> +	if (!(priv->slaves || nm_platform_link_can_assume (NM_PLATFORM_GET,
> nm_device_get_ifindex (self)))) {
> +		/* The device has no layer 3 configuration and no slaves. Leave it up. */
>  		return FALSE;
> 
> I'd rather see that as "!priv->slaves && !nm_platform_link_can_assume()", I
> think that's clearer...  but doens't this mean that master devices that
> could be managed (eg where nm_platform_link_can_assume() returns TRUE) would
> return TRUE from unmanaged_on_quit() and then get unmanaged, where before
> they wouldn't get touched?
> 
> I'm probably looking at it wrong, it's late on a Friday...

You're reading that right -- I got that wrong thinking that unmanaging prior to removing leaves the device up.

I'm not sure when does NMDevice's unmanaged_on_quit() return FALSE though -- !nm_platform_link_can_assume() along with nm_device_can_assume_active_connection() return TRUE for all cases I can imagine.

It's perhaps safer to leave it alone though.

Comment 45 Beniamino Galvani 2016-09-26 15:44:39 UTC

LGTM

Comment 46 Lubomir Rintel 2016-09-26 16:21:05 UTC

QE: This is the setup needed to test (assuming two ethernets, ens11 and ens12)

  # nmcli c add con-name james type bond ifname bond0 ipv4.method \
    disabled ipv6.method ignore autoconnect no
  # nmcli c add type ethernet con-name slave-ens11 ifname ens11 \
    master bond0 slave-type bond autoconnect no
  # nmcli c add type ethernet con-name slave-ens12 ifname ens12 \
    master bond0 slave-type bond autoconnect no
  # nmcli c add type vlan dev bond0 id 153 autoconnect no \
    ip4 10.66.66.1/24

The important part is the bond0 bond ("james" connection) having no L3 configuration.

To test the scenario, please activate all the connections (they're autoconnect=no, so that autoconnect won't interfere on restart):

  # nmcli c up james
  # nmcli c up slave-ens11
  # nmcli c up slave-ens12
  # nmcli c up vlan

Then stop NetworkManager, check that all of the devices are still UP and have configuration (using "ip addr" or "ip link").

Then start NetworkManager back and check that all connections have been assumed correctly with "nmcli d"

Comment 50 Vladimir Benes 2016-09-27 14:46:58 UTC

-12 version works as expected

Comment 51 Lubomir Rintel 2016-09-27 17:16:23 UTC

Done.

Comment 53 Vladimir Benes 2016-09-27 18:33:17 UTC

this works as expected

Comment 57 errata-xmlrpc 2016-11-03 19:09:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2581.html