Bug 1141256

Summary: Out of memory (-5) calling nl_recvmsgs_default()
Product: Red Hat Enterprise Linux 7 Reporter: Dan Williams <dcbw>
Component: NetworkManagerAssignee: Lubomir Rintel <lrintel>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.1CC: danw, dcbw, hof, jklimes, lpeer, lrintel, rkhan, s+redhatbugzilla, tgummels, thaller, vbenes, vhumpa
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-0.995.0.0-1.el7 Doc Type: Bug Fix
Doc Text:
* NetworkManager could not receive notifications from kernel in case of huge changes to network configuration in quick succession such as changes to bridges that affect large number of ports. The configuration is now synchronized properly if kernel indicates the events have been missed. (BZ#1141256)
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-05 13:53:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1179614    
Attachments:
Description Flags
Suggested fix (el7)
none
Suggested fix (el7) none

Description Dan Williams 2014-09-12 14:26:48 UTC
Sep  4 09:00:51 rose11 NetworkManager[682]: <error> [1409810451.326147] [platform/nm-linux-platform.c:3161] event_handler(): Failed to retrieve incoming events: Out of memory (-5)

which corresponds to:

	int nle;

	nle = nl_recvmsgs_default (priv->nlh_event);
	if (nle < 0)
		switch (nle) {
		case -NLE_DUMP_INTR:
			/* this most likely happens due to our request (RTM_GETADDR, AF_INET6, NLM_F_DUMP)
			 * to detect support for support_kernel_extended_ifa_flags. This is not critical
			 * and can happen easily. */
			debug ("Uncritical failure to retrieve incoming events: %s (%d)", nl_geterror (nle), nle);
			break;
		default:
---->			error ("Failed to retrieve incoming events: %s (%d)", nl_geterror (nle), nle);
			break;
	}

This system has ~180 network interfaces (it's an OVS system) so I can only assume there are a lot of messages going around.  However, since:

MemTotal:       16238772 kB
MemFree:          322568 kB
MemAvailable:    4009156 kB
Buffers:          165368 kB
Cached:          4414100 kB
SwapCached:         6728 kB

there is apparently still a ton of free/cached memory, so my assumption right now is that libnl has some upper bound on internal buffers that it's using.  NM is setting up the libnl socket buffer with 128K, which perhaps is not enough:

	/* The default buffer size wasn't enough for the testsuites. It might just
	 * as well happen with NetworkManager itself. For now let's hope 128KB is
	 * good enough.
	 */
	nle = nl_socket_set_buffer_size (priv->nlh_event, 131072, 0);

Perhaps NM should adjust the libnl3 buffer size based on the amount of memory in the system, or perhaps better, if it notices that there are > 50 interfaces on the system increase the buffer size.

Comment 2 Lubomir Rintel 2014-09-15 05:50:10 UTC
Created attachment 937446 [details]
Suggested fix (el7)

Would this make sense? (Patch that applies to el7 attached, master commit here: https://github.com/lkundrak/NetworkManager/commit/d54ff8b43983e2dce22e9c08402fefaf5a10a56f)

Comment 3 Dan Winship 2014-09-17 16:30:35 UTC
> Seems like 128k is not enough for systems with many interfaces. This adds 4096k

4k, not 4096k :)

>+	g_assert (!nle);

I wouldn't assert here; we don't know why that might fail. g_warning() or nm_log_warn() instead.

Comment 4 Travis Gummels 2014-09-18 16:05:48 UTC
Dan,

Partner Stratus would like visibility on this bug, they are seeing this in their lab.  They would like to follow the bug and contribute any relevant reproduction information.  Let me know if you approve.

Thank you,

Travis

Comment 5 Lubomir Rintel 2014-09-18 16:17:44 UTC
Created attachment 938978 [details]
Suggested fix (el7)

(In reply to Dan Winship from comment #3)
> > Seems like 128k is not enough for systems with many interfaces. This adds 4096k
> 
> 4k, not 4096k :)

Good catch. Corrected.

> >+	g_assert (!nle);
> 
> I wouldn't assert here; we don't know why that might fail. g_warning() or
> nm_log_warn() instead.

Done.

master: https://github.com/lkundrak/NetworkManager/commit/d262cab30a90d93148287a137c0af6b75fa133d3
el7: (attached)

Comment 6 Dan Williams 2014-09-19 21:22:18 UTC
d262cab30a90d93148287a137c0af6b75fa133d3 looks good to me

Comment 7 Dan Williams 2014-09-19 21:22:40 UTC
(In reply to Travis Gummels from comment #4)
> Dan,
> 
> Partner Stratus would like visibility on this bug, they are seeing this in
> their lab.  They would like to follow the bug and contribute any relevant
> reproduction information.  Let me know if you approve.
> 
> Thank you,
> 
> Travis

This bug is now public.

Comment 8 Jirka Klimes 2014-09-22 10:46:10 UTC
Pushed upstream:
efd0984 platform: increase NL buffer for systems with lots of interfaces (rh #1141256)

Comment 12 Lubomir Rintel 2014-12-09 11:01:43 UTC
Different fix via bug #1141266.

QA, here's how do you test it:

0.) Create a bridge
# ip link add bridge0 type bridge

1.) Create a large number of interfaces and enslave them
# for i in $(seq 0 1000); ip link add port$i type dummy; ip link set port$i master bridge0; done

2.) Delete a bridge (generates link change event for each port)
# ip link del bridge0

Now you should see <error> messages about out of memory conditions. You should check that NM recovered from it, this tool should generate empty output:

http://people.freedesktop.org/~lkundrak/nm-rtnl-diff.py

It would be awesome if you could integrate this into automated testing.

Thank you!

Comment 16 errata-xmlrpc 2015-03-05 13:53:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0311.html