1141256 – Out of memory (-5) calling nl_recvmsgs_default()

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1141256 - Out of memory (-5) calling nl_recvmsgs_default()

Summary: Out of memory (-5) calling nl_recvmsgs_default()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	NetworkManager
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Lubomir Rintel
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1179614
TreeView+	depends on / blocked

Reported:	2014-09-12 14:26 UTC by Dan Williams
Modified:	2015-03-05 13:53 UTC (History)
CC List:	12 users (show)
Fixed In Version:	NetworkManager-0.995.0.0-1.el7
Doc Type:	Bug Fix
Doc Text:	* NetworkManager could not receive notifications from kernel in case of huge changes to network configuration in quick succession such as changes to bridges that affect large number of ports. The configuration is now synchronized properly if kernel indicates the events have been missed. (BZ#1141256)
Clone Of:
Environment:
Last Closed:	2015-03-05 13:53:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Suggested fix (el7) (1.50 KB, text/plain) 2014-09-15 05:50 UTC, Lubomir Rintel	no flags	Details
Suggested fix (el7) (1.58 KB, text/plain) 2014-09-18 16:17 UTC, Lubomir Rintel	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0311	0	normal	SHIPPED_LIVE	NetworkManager bug fix and enhancement update	2015-03-05 17:35:10 UTC

Description Dan Williams 2014-09-12 14:26:48 UTC

Sep  4 09:00:51 rose11 NetworkManager[682]: <error> [1409810451.326147] [platform/nm-linux-platform.c:3161] event_handler(): Failed to retrieve incoming events: Out of memory (-5)

which corresponds to:

	int nle;

	nle = nl_recvmsgs_default (priv->nlh_event);
	if (nle < 0)
		switch (nle) {
		case -NLE_DUMP_INTR:
			/* this most likely happens due to our request (RTM_GETADDR, AF_INET6, NLM_F_DUMP)
			 * to detect support for support_kernel_extended_ifa_flags. This is not critical
			 * and can happen easily. */
			debug ("Uncritical failure to retrieve incoming events: %s (%d)", nl_geterror (nle), nle);
			break;
		default:
---->			error ("Failed to retrieve incoming events: %s (%d)", nl_geterror (nle), nle);
			break;
	}

This system has ~180 network interfaces (it's an OVS system) so I can only assume there are a lot of messages going around.  However, since:

MemTotal:       16238772 kB
MemFree:          322568 kB
MemAvailable:    4009156 kB
Buffers:          165368 kB
Cached:          4414100 kB
SwapCached:         6728 kB

there is apparently still a ton of free/cached memory, so my assumption right now is that libnl has some upper bound on internal buffers that it's using.  NM is setting up the libnl socket buffer with 128K, which perhaps is not enough:

	/* The default buffer size wasn't enough for the testsuites. It might just
	 * as well happen with NetworkManager itself. For now let's hope 128KB is
	 * good enough.
	 */
	nle = nl_socket_set_buffer_size (priv->nlh_event, 131072, 0);

Perhaps NM should adjust the libnl3 buffer size based on the amount of memory in the system, or perhaps better, if it notices that there are > 50 interfaces on the system increase the buffer size.

Comment 2 Lubomir Rintel 2014-09-15 05:50:10 UTC

Created attachment 937446 [details]
Suggested fix (el7)

Would this make sense? (Patch that applies to el7 attached, master commit here: https://github.com/lkundrak/NetworkManager/commit/d54ff8b43983e2dce22e9c08402fefaf5a10a56f)

Comment 3 Dan Winship 2014-09-17 16:30:35 UTC

> Seems like 128k is not enough for systems with many interfaces. This adds 4096k

4k, not 4096k :)

>+	g_assert (!nle);

I wouldn't assert here; we don't know why that might fail. g_warning() or nm_log_warn() instead.

Comment 4 Travis Gummels 2014-09-18 16:05:48 UTC

Dan,

Partner Stratus would like visibility on this bug, they are seeing this in their lab.  They would like to follow the bug and contribute any relevant reproduction information.  Let me know if you approve.

Thank you,

Travis

Comment 5 Lubomir Rintel 2014-09-18 16:17:44 UTC

Created attachment 938978 [details]
Suggested fix (el7)

(In reply to Dan Winship from comment #3)
> > Seems like 128k is not enough for systems with many interfaces. This adds 4096k
> 
> 4k, not 4096k :)

Good catch. Corrected.

> >+	g_assert (!nle);
> 
> I wouldn't assert here; we don't know why that might fail. g_warning() or
> nm_log_warn() instead.

Done.

master: https://github.com/lkundrak/NetworkManager/commit/d262cab30a90d93148287a137c0af6b75fa133d3
el7: (attached)

Comment 6 Dan Williams 2014-09-19 21:22:18 UTC

d262cab30a90d93148287a137c0af6b75fa133d3 looks good to me

Comment 7 Dan Williams 2014-09-19 21:22:40 UTC

(In reply to Travis Gummels from comment #4)
> Dan,
> 
> Partner Stratus would like visibility on this bug, they are seeing this in
> their lab.  They would like to follow the bug and contribute any relevant
> reproduction information.  Let me know if you approve.
> 
> Thank you,
> 
> Travis

This bug is now public.

Comment 8 Jirka Klimes 2014-09-22 10:46:10 UTC

Pushed upstream:
efd0984 platform: increase NL buffer for systems with lots of interfaces (rh #1141256)

Comment 12 Lubomir Rintel 2014-12-09 11:01:43 UTC

Different fix via bug #1141266.

QA, here's how do you test it:

0.) Create a bridge
# ip link add bridge0 type bridge

1.) Create a large number of interfaces and enslave them
# for i in $(seq 0 1000); ip link add port$i type dummy; ip link set port$i master bridge0; done

2.) Delete a bridge (generates link change event for each port)
# ip link del bridge0

Now you should see <error> messages about out of memory conditions. You should check that NM recovered from it, this tool should generate empty output:

http://people.freedesktop.org/~lkundrak/nm-rtnl-diff.py

It would be awesome if you could integrate this into automated testing.

Thank you!

Comment 16 errata-xmlrpc 2015-03-05 13:53:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0311.html

Note You need to log in before you can comment on or make changes to this bug.