Bug 1059597

Summary:

"Out of memory" loglines when receiving ipv6 router advertisements

Product:

[Fedora] Fedora

Reporter:

Shivaram Lingamneni <slingamn>

Component:

NetworkManager

Assignee:

Lubomir Rintel <lrintel>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

CC:

awilliam, dcbw, ed.greshko, gary.buhrmaster, hcamp, lkundrak, lrintel, luton, paulds, raorn, rmj, rocketraman, sly.midnight

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

NetworkManager-0.9.9.0-45.git20131003.fc20

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-09-28 04:27:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
debug log of NetworkManager during 100% CPU usage / Out of memory	none
output of gstack for NetworkManager	none
tcpdump during error	none

Description Shivaram Lingamneni 2014-01-30 09:00:01 UTC

Description of problem:

NetworkManager can get into a state where an "Out of memory" error line is logged during every router advertisement. See #1048046 and #1044757 for similar but unrelated issues.

Version-Release number of selected component (if applicable):

0.9.9.0-27.git20131003.fc20

How reproducible:

Nondeterministic, but it's not resolved by `systemctl restart NetworkManager.service`.

Actual results:

Loglines like this:

Jan 30 00:52:44 good-fortune.localdomain NetworkManager[20674]: <error> [1391071964.648812] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:52:54 good-fortune.localdomain NetworkManager[20674]: <error> [1391071974.537399] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:01 good-fortune.localdomain NetworkManager[20674]: <error> [1391071981.667015] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:11 good-fortune.localdomain NetworkManager[20674]: <error> [1391071991.220960] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:20 good-fortune.localdomain NetworkManager[20674]: <error> [1391072000.447611] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:27 good-fortune.localdomain NetworkManager[20674]: <error> [1391072007.868005] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:31 good-fortune.localdomain NetworkManager[20674]: <error> [1391072011.285654] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:37 good-fortune.localdomain NetworkManager[20674]: <error> [1391072017.647993] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:42 good-fortune.localdomain NetworkManager[20674]: <error> [1391072022.681125] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:53:50 good-fortune.localdomain NetworkManager[20674]: <error> [1391072030.771302] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:00 good-fortune.localdomain NetworkManager[20674]: <error> [1391072040.517687] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:10 good-fortune.localdomain NetworkManager[20674]: <error> [1391072050.280287] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:19 good-fortune.localdomain NetworkManager[20674]: <error> [1391072059.660305] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:26 good-fortune.localdomain NetworkManager[20674]: <error> [1391072066.998106] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:34 good-fortune.localdomain NetworkManager[20674]: <error> [1391072074.410892] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:38 good-fortune.localdomain NetworkManager[20674]: <error> [1391072078.461932] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:45 good-fortune.localdomain NetworkManager[20674]: <error> [1391072085.923642] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:49 good-fortune.localdomain NetworkManager[20674]: <error> [1391072089.288010] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 30 00:54:56 good-fortune.localdomain NetworkManager[20674]: <error> [1391072096.186190] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory

Expected results:

Nothing should be logged during the receipt of a "typical" router advertisement.

Comment 1 Ed Greshko 2014-01-30 09:05:46 UTC

I get the same messages.  One thing I noticed is that they don't start appearing until sometime after boot.  Also, if I "disable" network and then "enable" it will be sometime before the messages start showing up in the logs again.

Comment 2 Ed Greshko 2014-01-31 13:36:24 UTC

FWIW, I rebooted my system this morning while working on an NFS issue and it took 2.5 hrs before the message started to appear in the logs.

Jan 31 10:29:24 meimei NetworkManager[781]: <error> [1391135364.40025] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory
Jan 31 10:29:28 meimei NetworkManager[781]: <error> [1391135368.830350] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory

Comment 3 woky 2014-02-02 16:42:50 UTC

Created attachment 858283 [details]
debug log of NetworkManager during 100% CPU usage / Out of memory

Hello,

I had transmission-gtk running downloading several ISOs and I experienced this behavior as well. It wasn't constant 100% CPU usage, but it was happening for few seconds every few seconds. In attached log every DISTINCT_IPV6_ADDRESS refers to different IPv6 address.

Comment 4 Jason Haar 2014-02-09 00:51:23 UTC

I'm seeing it too, but the one extra data point I can add is that I'm seeing it on networks that don't have ipv6 enabled. I appreciate the F20 always has ipv6 going at least as link-local, but just in case there's something to do with network-assigned ipv6 addresses, well that can't be it :-)

I also don't see the 100% CPU load problem

Finally, I have 8G RAM on my laptop with 8G of swap, and see these "out of memory" errors even though I'm not swapping, and have free RAM and RAM allocated to buffers  - ie the system has free RAM - so I can't figure out why anything would say it's out of memory

Comment 5 Jason Haar 2014-02-09 00:53:34 UTC

Whoops, I did forget to say my laptop moves between work and home and I am running ipv6 at home, but not at work. So I guess my home ipv6 RA events could trigger the bug, which then lives into the at-work-with-no-ipv6 phase?

Comment 6 Raman Gupta 2014-03-05 13:43:51 UTC

+1 Same thing here, Fedora 20, network with ipv6.

Comment 7 Shivaram Lingamneni 2014-05-02 08:36:03 UTC

I Googled a little and it looks like the problem has to do with resource exhaustion on the kernel side:

http://stackoverflow.com/questions/8478231/netlink-sending-from-kernel-to-user-eagain-and-enobufs

which would explain why restarting the userspace NetworkManager process doesn't fix the issue, also why the issue persists even after moving to a network without ipv6 (cf. comment 5).

Comment 8 Reilly Hall 2014-06-04 13:44:36 UTC

I am experiencing this issue with my system as well.  IPv6 via RA and also appears some abitrary amount of time after boot.  The affected system is also accessing a file server somewhat heavily over NFS via IPv6.

I have no other data points to add though.

Comment 9 Alexey I. Froloff 2014-07-07 13:49:34 UTC

I also observe this bug. NetworkManager-0.9.9.0-40.git20131003.fc20.x86_64

Comment 10 Ed Greshko 2014-09-12 22:17:13 UTC

A *very* annoying side effect of this bug is that sometime after these error messages start appearing it is impossible to create a VPN connection (openVPN and PPTP, at least) on the IPV4 side.

Comment 11 Ed Greshko 2014-09-15 04:19:12 UTC

I have recompiled NetworkManager-0.9.10.0-6.git20140704 from F21 and installed it on my F20 system.  After installing and rebooting I have not seen these error messages.

[root@meimei log]# uptime
 12:17:32 up 1 day,  5:45,  4 users,  load average: 2.75, 2.92, 2.70

Previously, these messages would start to appear within 4~6 hours.

Can NetworkManager-0.9.10.0-6 be officially added to F20?

Comment 12 Adam Williamson 2014-09-17 07:16:30 UTC

That's probably too big a change for a backport - note the date different, there's 9 months of development in there.

If we can identify the specific fix, though, backporting that might well be possible.

Comment 13 Adam Williamson 2014-09-17 07:21:27 UTC

Just doing a dumb search of the NM git log for 'memory':

http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?qt=grep&q=memory

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=a032d82e2ade03ac0393c6d2e80558221944c6e1

looks like a possible candidate, maybe? Nothing else jumps out.

Comment 14 Shivaram Lingamneni 2014-09-17 08:03:54 UTC

The proximate cause of the error is ENOBUFS when reading from a netlink socket, so I looked for commits referencing netlink, but I didn't see anything that looked like a slam dunk. There are some changes referencing netlink and ipv6 address management, though.

I was briefly excited about this:

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=38a9ac5cc22a39506453a238406bd7f9d9205943

as per this answer:

http://stackoverflow.com/a/18152453

but it looks as though it's already included in the F20 build.

Comment 15 Ed Greshko 2014-09-17 13:30:25 UTC

Well, if someone would add suspect patches and put them up for testing I'd be happy to give them a shot.  Otherwise, I'll continue to go with the F21 version as it has been working well in my environment.  It also fixes another issue where if a wired link is disabled it will totally disable the link.  In the existing F20 NetworkManager the wired link's IPV6 address is still live and functional.  So, the link is only partially disabled in F20.

Comment 16 Lubomir Rintel 2014-09-19 08:12:07 UTC

I'm having trouble reproducing this, though I've seen this myself in past. I've tried bombing NM with all sorts of big and malformed RAs without much success.

If I won't manage to reproduce this myself, I'd be very thankful if someone could provide some more traces:

1.) The actual network traffic (if you could obtain a caputure with tcpdump -w file.pcap it would be awesome)
2.) strace of the what's NM doing at the time (strace -s1024 -tt -o file.strace -f -p $(pidof NetworkManager))
3.) A traceback (gstack $(pidof NetworkManager))
4.) If you could abort NM and get ABRT create its report it would be awesome (pkill -ABRT $(pidof NetworkManager))

Thank you!

Comment 17 Ed Greshko 2014-09-19 11:00:59 UTC

Couple of questions/comments....

Should tcpdump be capturing all traffic?  I get this and after letting it sit for a minute the file.pcap is still 0.

[root@meimei ~]# tcpdump -w file.pcap
tcpdump: WARNING: SIOCGIFADDR: nflog: No such device
tcpdump: listening on nflog, link-type NFLOG (Linux netfilter log messages), capture size 65535 bytes
^C0 packets captured
0 packets received by filter
0 packets dropped by kernel


On my network, I have an ASUS RT-AC66U wireless router connected to the internet.  It is configured to create a 6in4 tunnel connecting to a tunnel server at Hurricane Electric.  It is also configured to Enable Router Advertisement. And, of course, the problem will not arise if the tunnel is disabled.

It is also impossible to determine when the problem will present itself.  It can take anywhere between 2 and 6 hours.  I don't know how large a file tcpdump may produce before I notice the occurrence of the problem.

Comment 18 Paul Stauffer 2014-09-19 11:40:15 UTC

Ed, try using the "-i" option to tcpdump to tell it which interface to capture on.  For example, if the interface exhibiting the problem is named wlan0, you would do "tcpdump -i wlan0 -w file.pcap".

You're right that this file could grow large if you leave it running for a long time.  Probably you'll just want to wait until the problem is already happening and then run the capture for a minute or so.

Comment 19 Ed Greshko 2014-09-19 13:41:05 UTC

Well, DUH, on the "-i" option.

OK, I've downgraded the NetworkManager packages to the official F20 version and will wait to see when it starts happening.  Will check it again in my morning.

Comment 20 Ed Greshko 2014-09-20 08:17:59 UTC

The problem started up at about 16:00 today.  I'm attaching the requested traces.  The "pkill" did not produce an ABRT.

Comment 21 Ed Greshko 2014-09-20 08:19:10 UTC

Created attachment 939485 [details]
output of gstack for NetworkManager

Comment 22 Ed Greshko 2014-09-20 08:19:55 UTC

Created attachment 939486 [details]
tcpdump during error

Comment 23 Ed Greshko 2014-09-20 08:24:27 UTC

The strace output was rather long.  It can be downloaded from here....

https://drive.google.com/file/d/0B2H9v1dYNcvpMVdiWVpoYk03aTg/edit?usp=sharing

Comment 24 Ed Greshko 2014-09-23 23:32:44 UTC

Another FWIW.  Doing this....

[root@meimei log]# echo 1 > /proc/sys/net/ipv6/conf/p128p1/disable_ipv6
[root@meimei log]# echo 0 > /proc/sys/net/ipv6/conf/p128p1/disable_ipv6

is enough to stop the "out of memory" condition temporarily.

Comment 25 Lubomir Rintel 2014-09-24 13:51:16 UTC

It seems to me that NM was removing re-adding all routes upon RA receipt and then running out of memory consuming the events notifications generated from that.

I'm wondering if you could try out a scratch build that adds two patches cherry-picked from Git master that address this issue and let me know if it helps?

http://koji.fedoraproject.org/koji/taskinfo?taskID=7676645

Thank you!

Comment 26 Ed Greshko 2014-09-24 14:21:39 UTC

If you would point me to the rpm for NetworkManager and NetworkManager-glib
that you would like me to install/test I'd be happy to oblige.

When I follow the link that you supply, I come upon 

NetworkManager-0.9.9.0-43.git20131003.fc20.lr2.x86_64.rpm 
and
NetworkManager-glib-0.9.9.0-43.git20131003.fc20.lr2.x86_64.rpm 

which appears to me to be a step back from 

NetworkManager-0.9.9.0-44.git20131003.fc20.x86_64
and
NetworkManager-glib-0.9.9.0-44.git20131003.fc20.x86_64

43 v.s. 44

Comment 27 Lubomir Rintel 2014-09-24 14:30:55 UTC

(In reply to Ed Greshko from comment #26)
...
> 43 v.s. 44

Rebased on top of -44: http://koji.fedoraproject.org/koji/taskinfo?taskID=7676959

Comment 28 Ed Greshko 2014-09-24 14:41:02 UTC

OK....  If I understand you correctly, I'll install (upgrade to?) the rpm's from....

http://koji.fedoraproject.org/koji/taskinfo?taskID=7676961

in my AM (now 22:38) and see how things go.

Comment 29 Lubomir Rintel 2014-09-24 14:43:07 UTC

(In reply to Ed Greshko from comment #28)
> OK....  If I understand you correctly, I'll install (upgrade to?) the rpm's
> from....
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=7676961
> 
> in my AM (now 22:38) and see how things go.

Correct.

(Please don't cancel the needinfo flag needlessly, it's used to track whether the feedback is already available)

Comment 30 Ed Greshko 2014-09-24 14:46:13 UTC

Ooopss....  Sorry about that.  I'm afraid I've had a bit too many Gin Tonics tonight.   The other reason why I'm going to wait until tomorrow to install.

Comment 31 Ed Greshko 2014-09-25 00:33:37 UTC

[root@meimei log]# rpm -q NetworkManager-glib
NetworkManager-glib-0.9.9.0-44.git20131003.fc20.lr2.x86_64
[root@meimei log]# rpm -q NetworkManager
NetworkManager-0.9.9.0-44.git20131003.fc20.lr2.x86_64
[root@meimei log]# uptime
 08:32:37 up  1:57,  4 users,  load average: 0.20, 0.31, 0.41

And now we wait.....

Comment 32 Lubomir Rintel 2014-09-25 15:47:05 UTC

Any news here? :)

Comment 33 Gary Buhrmaster 2014-09-25 16:16:08 UTC

Initial testing indicates that the cpu spike (and messages) are resolved with your scratch build (-44 based).  I will (of course) continue to monitor.  Thanks.

Comment 34 Fedora Update System 2014-09-25 19:31:00 UTC

NetworkManager-0.9.9.0-45.git20131003.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/NetworkManager-0.9.9.0-45.git20131003.fc20

Comment 35 Ed Greshko 2014-09-25 23:17:31 UTC

[root@meimei log]# uptime
 07:15:54 up 1 day, 41 min,  4 users,  load average: 0.18, 0.29, 0.34

And the problem has not shown.  So, I'm very confident the bug has been squashed.

Comment 36 Lubomir Rintel 2014-09-26 05:48:07 UTC

Thank you Ed, Gary & others involved!

Comment 37 Fedora Update System 2014-09-26 09:03:40 UTC

Package NetworkManager-0.9.9.0-45.git20131003.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing NetworkManager-0.9.9.0-45.git20131003.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-11534/NetworkManager-0.9.9.0-45.git20131003.fc20
then log in and leave karma (feedback).

Comment 38 Fedora Update System 2014-09-28 04:27:03 UTC

NetworkManager-0.9.9.0-45.git20131003.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 39 Luke 2015-03-19 03:23:37 UTC

I've recently started seeing this bug in my system (MacBook Pro 4,1), with F20 and

NetworkManager.x86_64 1:0.9.9.0-46.git20131003.fc20

The appropriate lines in journal log:

Mar 19 04:10:52 mcb NetworkManager[872]: <error> [1426734652.632793] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory
Mar 19 04:11:22 mcb NetworkManager[872]: <error> [1426734682.579452] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory
Mar 19 04:11:52 mcb NetworkManager[872]: <error> [1426734712.595163] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory
Mar 19 04:13:22 mcb NetworkManager[872]: <error> [1426734802.647432] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory
Mar 19 04:13:48 mcb kdeinit4[4825]: DIGEST-MD5 common mech free