Bug 1333983
Summary: | Restarting NetworkManager causes devices to be lost from the network connections | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Saul Serna <sserna> | ||||||
Component: | NetworkManager | Assignee: | Lubomir Rintel <lrintel> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.2 | CC: | aloughla, atragler, bgalvani, bmcclain, danken, dpathak, hasuzuki, kwalker, lkundrak, lrintel, mburman, mleitner, msugaya, ptalbert, rhepner, rkhan, snagar, sserna, sukulkar, thaller, tlavigne, vanhoof, vbenes, ylavi | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1379711 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-11-03 19:09:26 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1203710, 1301628, 1304509, 1379711 | ||||||||
Attachments: |
|
Description
Saul Serna
2016-05-06 22:19:48 UTC
Customers impacted in bad way Changed to Urgent Hm, there's a vlan on a team, but the team doesn't get assumed on restart because it has no L3 configuration. What makes this have an urgent severity? Could we find a workaround until we fix this? If you restart NetworkManager to reload the configuration you could instead do a "nmcli c reload". Can we please get the intended configuration of the VLAN and TEAM, please. Maybe the commands that were used to configure them initially. And how they were intended to be configured. Maybe a diagram of intended configuration. That way we can see if there is a gap between what was executed via the commands and what was the intention of the configuration / setup Rashid, I've forwarded your inquiries to the customer. I'll update you as soon as I receive their feedback. Cut and paste from the email ================================ On Tue, May 10, 2016 at 5:19 PM, James Mills <james.mills> wrote: The test of the nmcli con reload command worked as expected. No network services or devices were lost. This is definitely a bug in NetworkManager. However the fix is not straightforward. I'm working on it. To find a suitable workaround I need to know what is the reason the customer restarting NetworkManager. The comment #8 indicates that "nmcli c reload" works. If the customer restarts NetworkManager in order to reload the configuration they should do a "nmcli c reload" instead. Hello, I have been poking at this a bit and I think I have found at least part of the problem. Normally, when the NetworkManager service is stopped, it is supposed to leave the active network configuration in place. This can be confirmed by stopping the service on a system with a more basic configuration. However, in the customer's case here, they are using a team. For some reason, when the NetworkManager service is stopped, it deactivates team0 and team1.153. The nm_device_cleanup() function removes IP addressing from interfaces: May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.321530] [devices/nm-device.c:7600] nm_device_set_unmanaged(): [0x7f6f200c2420] (team1): now unmanaged May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info> (team1): device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 3] May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.321549] [devices/nm-device.c:8155] nm_device_cleanup(): [0x7f6f200c2420] (team1): deactivating device (reason 'unmanaged') [3] -- May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.340104] [devices/nm-device.c:7600] nm_device_set_unmanaged(): [0x7f6f200c27d0] (team1.153): now unmanaged May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info> (team1.153): device state change: activated -> unmanaged (reason 'unmanaged') [100 10 3] May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.341589] [devices/nm-device.c:8155] nm_device_cleanup(): [0x7f6f200c27d0] (team1.153): deactivating device (reason 'unmanaged') [3] In the case of the team device, it stops the teamd process as well, but leaves the actual teamd interface in place: May 11 10:18:38 rhel72.example.com NetworkManager[637]: <info> (team1): deactivation: stopping teamd... May 11 10:18:38 rhel72.example.com NetworkManager[637]: <debug> [1462976318.323319] [NetworkManagerUtils.c:667] nm_utils_kill_child_async(): kill child process 'teamd' (700): wait for process to terminate after sending SIGTERM (15) (send SIGKILL in 2000 milliseconds)... This leaves the team in an unmanageable state. When NetworkManager starts back up, it can't talk to the team. It would have to launch a new teamd process to take over the left over team interface which I do not believe it is smart enough to do. Perhaps some of the logic in remove_device() is incorrect? It is supposed to skip tearing down active interfaces on shutdown: 797 static void 798 remove_device (NMManager *manager, 799 NMDevice *device, 800 gboolean quitting, 801 gboolean allow_unmanage) 802 { 803 NMManagerPrivate *priv = NM_MANAGER_GET_PRIVATE (manager); 804 805 nm_log_dbg (LOGD_DEVICE, "(%s): removing device (allow_unmanage %d, managed %d)", 806 nm_device_get_iface (device), allow_unmanage, nm_device_get_managed (device)); 807 808 if (allow_unmanage && nm_device_get_managed (device)) { 809 NMActRequest *req = nm_device_get_act_request (device); 810 gboolean unmanage = FALSE; 811 812 /* Leave activated interfaces up when quitting so their configuration 813 * can be taken over when NM restarts. This ensures connectivity while 814 * NM is stopped. Devices which do not support connection assumption 815 * cannot be left up. 816 */ 817 if (!quitting) /* Forced removal; device already gone */ 818 unmanage = TRUE; 819 else if (!nm_device_can_assume_active_connection (device)) 820 unmanage = TRUE; 821 else if (!req) 822 unmanage = TRUE; 823 824 if (unmanage) { 825 if (quitting) 826 nm_device_set_unmanaged_quitting (device); 827 else 828 nm_device_set_unmanaged (device, NM_UNMANAGED_INTERNAL, TRUE, NM_DEVICE_STATE_REAS ON_REMOVED); 829 } else if (quitting && nm_config_get_configure_and_quit (nm_config_get ())) { 830 nm_device_spawn_iface_helper (device); 831 } 832 } 833 834 g_signal_handlers_disconnect_matched (device, G_SIGNAL_MATCH_DATA, 0, 0, NULL, NULL, manager); 835 836 nm_settings_device_removed (priv->settings, device, quitting); 837 priv->devices = g_slist_remove (priv->devices, device); 838 839 g_signal_emit (manager, signals[DEVICE_REMOVED], 0, device); 840 g_object_notify (G_OBJECT (manager), NM_MANAGER_DEVICES); 841 nm_device_removed (device); 842 843 nm_dbus_manager_unregister_object (priv->dbus_mgr, device); 844 g_object_unref (device); 845 846 check_if_startup_complete (manager); 847 } Hello Lubomir, Thank you for this. Do you really mean the bug #1311988? Bug 1311988 - NetworkManager shuts down if the dbus.service is stopped Thank you, Patrick (In reply to Patrick Talbert from comment #18) > Hello Lubomir, > > Thank you for this. > > Do you really mean the bug #1311988? > > Bug 1311988 - NetworkManager shuts down if the dbus.service is stopped > > > Thank you, > > Patrick Yes *** Bug 1325811 has been marked as a duplicate of this bug. *** *** Bug 1331009 has been marked as a duplicate of this bug. *** Bronce, we need this fixed in 7.2.zas well and asap. Can you escalate? Created attachment 1204250 [details]
[PATCH] device: consider a device with slaves configured
(In reply to Lubomir Rintel from comment #41) > Created attachment 1204250 [details] > [PATCH] device: consider a device with slaves configured Seems OK to me, though obviously we should make sure the test suite passes. Though for this part: + if (!(priv->slaves || nm_platform_link_can_assume (NM_PLATFORM_GET, nm_device_get_ifindex (self)))) { + /* The device has no layer 3 configuration and no slaves. Leave it up. */ return FALSE; I'd rather see that as "!priv->slaves && !nm_platform_link_can_assume()", I think that's clearer... but doens't this mean that master devices that could be managed (eg where nm_platform_link_can_assume() returns TRUE) would return TRUE from unmanaged_on_quit() and then get unmanaged, where before they wouldn't get touched? I'm probably looking at it wrong, it's late on a Friday... Created attachment 1204862 [details] [PATCH] device: consider a device with slaves configured (In reply to Dan Williams from comment #43) > (In reply to Lubomir Rintel from comment #41) > > Created attachment 1204250 [details] > > [PATCH] device: consider a device with slaves configured > > Seems OK to me, though obviously we should make sure the test suite passes. > Though for this part: > > + if (!(priv->slaves || nm_platform_link_can_assume (NM_PLATFORM_GET, > nm_device_get_ifindex (self)))) { > + /* The device has no layer 3 configuration and no slaves. Leave it up. */ > return FALSE; > > I'd rather see that as "!priv->slaves && !nm_platform_link_can_assume()", I > think that's clearer... but doens't this mean that master devices that > could be managed (eg where nm_platform_link_can_assume() returns TRUE) would > return TRUE from unmanaged_on_quit() and then get unmanaged, where before > they wouldn't get touched? > > I'm probably looking at it wrong, it's late on a Friday... You're reading that right -- I got that wrong thinking that unmanaging prior to removing leaves the device up. I'm not sure when does NMDevice's unmanaged_on_quit() return FALSE though -- !nm_platform_link_can_assume() along with nm_device_can_assume_active_connection() return TRUE for all cases I can imagine. It's perhaps safer to leave it alone though. LGTM QE: This is the setup needed to test (assuming two ethernets, ens11 and ens12) # nmcli c add con-name james type bond ifname bond0 ipv4.method \ disabled ipv6.method ignore autoconnect no # nmcli c add type ethernet con-name slave-ens11 ifname ens11 \ master bond0 slave-type bond autoconnect no # nmcli c add type ethernet con-name slave-ens12 ifname ens12 \ master bond0 slave-type bond autoconnect no # nmcli c add type vlan dev bond0 id 153 autoconnect no \ ip4 10.66.66.1/24 The important part is the bond0 bond ("james" connection) having no L3 configuration. To test the scenario, please activate all the connections (they're autoconnect=no, so that autoconnect won't interfere on restart): # nmcli c up james # nmcli c up slave-ens11 # nmcli c up slave-ens12 # nmcli c up vlan Then stop NetworkManager, check that all of the devices are still UP and have configuration (using "ip addr" or "ip link"). Then start NetworkManager back and check that all connections have been assumed correctly with "nmcli d" -12 version works as expected Done. this works as expected Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2581.html |