Bug 1433303

Summary: NetworkManager leaks NMDevice objects for enslaved veth devices
Product: Red Hat Enterprise Linux 7 Reporter: Sergio Lopez <slopezpa>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.3CC: ajawarka, alanm, atragler, bgalvani, fgiudici, lrintel, rkhan, rki, sukulkar, thaller, tpelka, vbenes
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.8.0-0.4.rc1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1436650 (view as bug list) Environment:
Last Closed: 2017-08-01 09:24:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1436650    
Attachments:
Description Flags
[PATCH] manager: ensure proper disposal of unrealized devices
none
[PATCH v2] manager: ensure proper disposal of unrealized devices none

Description Sergio Lopez 2017-03-17 11:01:50 UTC
Description of problem:

For containers with the "bridge" driver, Docker creates various veth devices, and one of them is enslaved to an existing network bridge.

When the container is stopped, those devices are removed, but a reference to the one enslaved is leaked, causing NetworkManager VSZ and RSS to slowly but steadily increase through time.


Version-Release number of selected component (if applicable):

Tested with NetworkManager-1.4.0-17.el7_3.x86_64.
Upstream (1e4f1892e052c69983245b14e17a88dec6e5d138 2017-03-17) is also affected.


How reproducible:

Always.


Steps to Reproduce:

1. Execute a bunch of containers (i.e. "n=0; while echo $((++n)); docker run --rm busybox /bin/true; do :; done")
2. Wait for a few iterations.
3. See NetworkManager's VSZ and RSS increase over time.


Actual results:

NetworkManager keeps allocating and using more and more memory.


Expected results:

NetworkManager memory usage should be kept reasonably stable over time.


Additional info:

At nm-manager.c:2252, when a link is being removed from a software device, nm_device_unrealize is called, instead of remove_device (used for hardware devices):

As a consequence, the device fails the condition at nm-manager.c:977 (nm_device_unrealize sets NMDevicePriv->real to FALSE), and nm_device_removed is not called, which is the function that would eventually remove the slave from its master, releasing the otherwise pseudo-leaked reference.

Unconditionally calling nm_device_removed, even for real == FALSE devices, seems to fix the problem, but I'm not sure if that's the proper solution.

Comment 2 Beniamino Galvani 2017-03-17 23:10:00 UTC
Created attachment 1264326 [details]
[PATCH] manager: ensure proper disposal of unrealized devices

Thank you for the detailed analysis. I can reproduce the leak on 1.4
and git master with this script:

        ip l add br1 type bridge
        for i in $(seq 1 1000); do
                echo $i;
                ip l add veth$i type veth peer name vethp$i
                ip l set veth$i up
                ip a a dev veth$i 9.9.9.9
                ip l set veth$i master br1
                ip l del veth$i
        done

The attached patch against git master fixes the problem. It works for
1.4 too, but requires a (trivial) manual apply.

Comment 3 Beniamino Galvani 2017-03-17 23:15:14 UTC
Created attachment 1264327 [details]
[PATCH v2] manager: ensure proper disposal of unrealized devices

Ignore the previous patch please.

Comment 4 Lubomir Rintel 2017-03-21 11:32:40 UTC
Looks good to me

Comment 5 Thomas Haller 2017-03-21 11:41:47 UTC
lgtm

Comment 15 errata-xmlrpc 2017-08-01 09:24:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2299