Bug 1059597
Summary: | "Out of memory" loglines when receiving ipv6 router advertisements | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Shivaram Lingamneni <slingamn> | ||||||||
Component: | NetworkManager | Assignee: | Lubomir Rintel <lrintel> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 20 | CC: | awilliam, dcbw, ed.greshko, gary.buhrmaster, hcamp, lkundrak, lrintel, luton, paulds, raorn, rmj, rocketraman, sly.midnight | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | NetworkManager-0.9.9.0-45.git20131003.fc20 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2014-09-28 04:27:03 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Shivaram Lingamneni
2014-01-30 09:00:01 UTC
I get the same messages. One thing I noticed is that they don't start appearing until sometime after boot. Also, if I "disable" network and then "enable" it will be sometime before the messages start showing up in the logs again. FWIW, I rebooted my system this morning while working on an NFS issue and it took 2.5 hrs before the message started to appear in the logs. Jan 31 10:29:24 meimei NetworkManager[781]: <error> [1391135364.40025] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory Jan 31 10:29:28 meimei NetworkManager[781]: <error> [1391135368.830350] [platform/nm-linux-platform.c:2538] event_handler(): Failed to retrieve incoming events: Out of memory Created attachment 858283 [details]
debug log of NetworkManager during 100% CPU usage / Out of memory
Hello,
I had transmission-gtk running downloading several ISOs and I experienced this behavior as well. It wasn't constant 100% CPU usage, but it was happening for few seconds every few seconds. In attached log every DISTINCT_IPV6_ADDRESS refers to different IPv6 address.
I'm seeing it too, but the one extra data point I can add is that I'm seeing it on networks that don't have ipv6 enabled. I appreciate the F20 always has ipv6 going at least as link-local, but just in case there's something to do with network-assigned ipv6 addresses, well that can't be it :-) I also don't see the 100% CPU load problem Finally, I have 8G RAM on my laptop with 8G of swap, and see these "out of memory" errors even though I'm not swapping, and have free RAM and RAM allocated to buffers - ie the system has free RAM - so I can't figure out why anything would say it's out of memory Whoops, I did forget to say my laptop moves between work and home and I am running ipv6 at home, but not at work. So I guess my home ipv6 RA events could trigger the bug, which then lives into the at-work-with-no-ipv6 phase? +1 Same thing here, Fedora 20, network with ipv6. I Googled a little and it looks like the problem has to do with resource exhaustion on the kernel side: http://stackoverflow.com/questions/8478231/netlink-sending-from-kernel-to-user-eagain-and-enobufs which would explain why restarting the userspace NetworkManager process doesn't fix the issue, also why the issue persists even after moving to a network without ipv6 (cf. comment 5). I am experiencing this issue with my system as well. IPv6 via RA and also appears some abitrary amount of time after boot. The affected system is also accessing a file server somewhat heavily over NFS via IPv6. I have no other data points to add though. I also observe this bug. NetworkManager-0.9.9.0-40.git20131003.fc20.x86_64 A *very* annoying side effect of this bug is that sometime after these error messages start appearing it is impossible to create a VPN connection (openVPN and PPTP, at least) on the IPV4 side. I have recompiled NetworkManager-0.9.10.0-6.git20140704 from F21 and installed it on my F20 system. After installing and rebooting I have not seen these error messages. [root@meimei log]# uptime 12:17:32 up 1 day, 5:45, 4 users, load average: 2.75, 2.92, 2.70 Previously, these messages would start to appear within 4~6 hours. Can NetworkManager-0.9.10.0-6 be officially added to F20? That's probably too big a change for a backport - note the date different, there's 9 months of development in there. If we can identify the specific fix, though, backporting that might well be possible. Just doing a dumb search of the NM git log for 'memory': http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?qt=grep&q=memory http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=a032d82e2ade03ac0393c6d2e80558221944c6e1 looks like a possible candidate, maybe? Nothing else jumps out. The proximate cause of the error is ENOBUFS when reading from a netlink socket, so I looked for commits referencing netlink, but I didn't see anything that looked like a slam dunk. There are some changes referencing netlink and ipv6 address management, though. I was briefly excited about this: http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=38a9ac5cc22a39506453a238406bd7f9d9205943 as per this answer: http://stackoverflow.com/a/18152453 but it looks as though it's already included in the F20 build. Well, if someone would add suspect patches and put them up for testing I'd be happy to give them a shot. Otherwise, I'll continue to go with the F21 version as it has been working well in my environment. It also fixes another issue where if a wired link is disabled it will totally disable the link. In the existing F20 NetworkManager the wired link's IPV6 address is still live and functional. So, the link is only partially disabled in F20. I'm having trouble reproducing this, though I've seen this myself in past. I've tried bombing NM with all sorts of big and malformed RAs without much success. If I won't manage to reproduce this myself, I'd be very thankful if someone could provide some more traces: 1.) The actual network traffic (if you could obtain a caputure with tcpdump -w file.pcap it would be awesome) 2.) strace of the what's NM doing at the time (strace -s1024 -tt -o file.strace -f -p $(pidof NetworkManager)) 3.) A traceback (gstack $(pidof NetworkManager)) 4.) If you could abort NM and get ABRT create its report it would be awesome (pkill -ABRT $(pidof NetworkManager)) Thank you! Couple of questions/comments.... Should tcpdump be capturing all traffic? I get this and after letting it sit for a minute the file.pcap is still 0. [root@meimei ~]# tcpdump -w file.pcap tcpdump: WARNING: SIOCGIFADDR: nflog: No such device tcpdump: listening on nflog, link-type NFLOG (Linux netfilter log messages), capture size 65535 bytes ^C0 packets captured 0 packets received by filter 0 packets dropped by kernel On my network, I have an ASUS RT-AC66U wireless router connected to the internet. It is configured to create a 6in4 tunnel connecting to a tunnel server at Hurricane Electric. It is also configured to Enable Router Advertisement. And, of course, the problem will not arise if the tunnel is disabled. It is also impossible to determine when the problem will present itself. It can take anywhere between 2 and 6 hours. I don't know how large a file tcpdump may produce before I notice the occurrence of the problem. Ed, try using the "-i" option to tcpdump to tell it which interface to capture on. For example, if the interface exhibiting the problem is named wlan0, you would do "tcpdump -i wlan0 -w file.pcap". You're right that this file could grow large if you leave it running for a long time. Probably you'll just want to wait until the problem is already happening and then run the capture for a minute or so. Well, DUH, on the "-i" option. OK, I've downgraded the NetworkManager packages to the official F20 version and will wait to see when it starts happening. Will check it again in my morning. The problem started up at about 16:00 today. I'm attaching the requested traces. The "pkill" did not produce an ABRT. Created attachment 939485 [details]
output of gstack for NetworkManager
Created attachment 939486 [details]
tcpdump during error
The strace output was rather long. It can be downloaded from here.... https://drive.google.com/file/d/0B2H9v1dYNcvpMVdiWVpoYk03aTg/edit?usp=sharing Another FWIW. Doing this.... [root@meimei log]# echo 1 > /proc/sys/net/ipv6/conf/p128p1/disable_ipv6 [root@meimei log]# echo 0 > /proc/sys/net/ipv6/conf/p128p1/disable_ipv6 is enough to stop the "out of memory" condition temporarily. It seems to me that NM was removing re-adding all routes upon RA receipt and then running out of memory consuming the events notifications generated from that. I'm wondering if you could try out a scratch build that adds two patches cherry-picked from Git master that address this issue and let me know if it helps? http://koji.fedoraproject.org/koji/taskinfo?taskID=7676645 Thank you! If you would point me to the rpm for NetworkManager and NetworkManager-glib that you would like me to install/test I'd be happy to oblige. When I follow the link that you supply, I come upon NetworkManager-0.9.9.0-43.git20131003.fc20.lr2.x86_64.rpm and NetworkManager-glib-0.9.9.0-43.git20131003.fc20.lr2.x86_64.rpm which appears to me to be a step back from NetworkManager-0.9.9.0-44.git20131003.fc20.x86_64 and NetworkManager-glib-0.9.9.0-44.git20131003.fc20.x86_64 43 v.s. 44 (In reply to Ed Greshko from comment #26) ... > 43 v.s. 44 Rebased on top of -44: http://koji.fedoraproject.org/koji/taskinfo?taskID=7676959 OK.... If I understand you correctly, I'll install (upgrade to?) the rpm's from.... http://koji.fedoraproject.org/koji/taskinfo?taskID=7676961 in my AM (now 22:38) and see how things go. (In reply to Ed Greshko from comment #28) > OK.... If I understand you correctly, I'll install (upgrade to?) the rpm's > from.... > > http://koji.fedoraproject.org/koji/taskinfo?taskID=7676961 > > in my AM (now 22:38) and see how things go. Correct. (Please don't cancel the needinfo flag needlessly, it's used to track whether the feedback is already available) Ooopss.... Sorry about that. I'm afraid I've had a bit too many Gin Tonics tonight. The other reason why I'm going to wait until tomorrow to install. [root@meimei log]# rpm -q NetworkManager-glib NetworkManager-glib-0.9.9.0-44.git20131003.fc20.lr2.x86_64 [root@meimei log]# rpm -q NetworkManager NetworkManager-0.9.9.0-44.git20131003.fc20.lr2.x86_64 [root@meimei log]# uptime 08:32:37 up 1:57, 4 users, load average: 0.20, 0.31, 0.41 And now we wait..... Any news here? :) Initial testing indicates that the cpu spike (and messages) are resolved with your scratch build (-44 based). I will (of course) continue to monitor. Thanks. NetworkManager-0.9.9.0-45.git20131003.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/NetworkManager-0.9.9.0-45.git20131003.fc20 [root@meimei log]# uptime 07:15:54 up 1 day, 41 min, 4 users, load average: 0.18, 0.29, 0.34 And the problem has not shown. So, I'm very confident the bug has been squashed. Thank you Ed, Gary & others involved! Package NetworkManager-0.9.9.0-45.git20131003.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing NetworkManager-0.9.9.0-45.git20131003.fc20' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-11534/NetworkManager-0.9.9.0-45.git20131003.fc20 then log in and leave karma (feedback). NetworkManager-0.9.9.0-45.git20131003.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. I've recently started seeing this bug in my system (MacBook Pro 4,1), with F20 and NetworkManager.x86_64 1:0.9.9.0-46.git20131003.fc20 The appropriate lines in journal log: Mar 19 04:10:52 mcb NetworkManager[872]: <error> [1426734652.632793] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory Mar 19 04:11:22 mcb NetworkManager[872]: <error> [1426734682.579452] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory Mar 19 04:11:52 mcb NetworkManager[872]: <error> [1426734712.595163] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory Mar 19 04:13:22 mcb NetworkManager[872]: <error> [1426734802.647432] [platform/nm-linux-platform.c:2736] event_handler(): Failed to retrieve incoming events: Out of memory Mar 19 04:13:48 mcb kdeinit4[4825]: DIGEST-MD5 common mech free |