Description of problem: OOM kernel panic with no significant memory used by processes. Version-Release number of selected component (if applicable): kernel-4.3.3-301.fc23.x86_64 kernel-4.3.3-303.fc23.x86_64 How reproducible: always Steps to Reproduce: 1. Load batman-adv 2. Note call trace tainting kernel 3. start getting OOM after about 12 hours, finally panic Actual results: OOM kernel panic Expected results: system stays up Additional info: I would guess it is related to batman-adv module, since that is where the taint happens.
Created attachment 1119563 [details] tainting Call Trace Loading batman-adv taints kernel
Created attachment 1119564 [details] OOM before kernel panics This is an OOM that gets logged before the kernel is totally wedged and panics.
Created attachment 1119565 [details] photo of console with kernel panic
When hitting the kernel panic, all other devices on the same switch become unreachable. Disconnecting the panicked system from the switch instantly restores normal operation to the other devices. Reconnecting the panicked system from the switch disables all other devices on the switch within seconds. There is no DoS until the kernel panic. At this point, I would like to increase the severity, but it is apparently too late. The batman-ctl taint is unrelated. When I disable loading batman-ctl, the OOM still happens (and no user programs running - just sitting at login screen) within 12 hours. The kernel panic is now "Not tainted".
Created attachment 1119675 [details] Not tainted OOM log before kernel panic the batman-adv taint was a red herring. Here is an untainted OOM log before the panic.
Tested 4.3.4-300 - get OOM panic even faster, only 8 hours!
There's not really enough information here. You're the only person to report this issue. A large number of people have been running just fine, so there is something specific to your machine and/or workload combination. What else was updated on your system with 4.3.3-301? Do you have the full yum/dnf log? What is the machine actually doing? Do you have an older "working" kernel that continues to work, or if you boot into that does it also now exhibit OOMs?
Created attachment 1119931 [details] installed package list
(In reply to Josh Boyer from comment #7) > > What is the machine actually doing? Nothing. Sitting at lightdm login prompt. Background daemons include nut-server, cjdroute in addition to stuff any PC would run (cron, chrony, etc). > > Do you have an older "working" kernel that continues to work, or if you boot > into that does it also now exhibit OOMs? I believe 4.2.8 works, but let me downgrade (has already been deleted) and check.
Created attachment 1119932 [details] dnf update log
I have confirmed that 4.2.8-300.fc23.x86_64 works correctly. This regression begins with 4.3.3 and continues with 4.3.4.
Created attachment 1120161 [details] fpaste --sysinfo Since a commenter believes the problem is system specific.
I'm not a kernel expert, but could this be a clue? slab_unreclaimable:388260kB Having nearly 4G of kernel memory unreclaimable seems like a recipe for OOM.
Just booted kernel-4.3.5-300 from koji, and slab_unreclaimable grows steadily by 100 50kB/sec. Gonna OOM panic in a few hours.
Modules loaded: Module Size Used by xt_CHECKSUM 16384 1 ipt_MASQUERADE 16384 3 nf_nat_masquerade_ipv4 16384 1 ipt_MASQUERADE tun 28672 1 nf_conntrack_netbios_ns 16384 0 nf_conntrack_broadcast 16384 1 nf_conntrack_netbios_ns ip6t_rpfilter 16384 1 ip6t_REJECT 16384 2 nf_reject_ipv6 16384 1 ip6t_REJECT xt_conntrack 16384 29 ebtable_filter 16384 1 ebtable_broute 16384 1 bridge 114688 1 ebtable_broute stp 16384 1 bridge llc 16384 2 stp,bridge ebtable_nat 16384 1 ebtables 32768 3 ebtable_broute,ebtable_nat,ebtable_filter ip6table_nat 16384 1 nf_conntrack_ipv6 20480 16 nf_defrag_ipv6 36864 1 nf_conntrack_ipv6 nf_nat_ipv6 16384 1 ip6table_nat ip6table_raw 16384 1 ip6table_security 16384 1 ip6table_mangle 16384 1 ip6table_filter 16384 1 ip6_tables 28672 5 ip6table_filter,ip6table_mangle,ip6table_security,ip6table_nat,ip6table_raw iptable_nat 16384 1 nf_conntrack_ipv4 16384 15 nf_defrag_ipv4 16384 1 nf_conntrack_ipv4 nf_nat_ipv4 16384 1 iptable_nat nf_nat 24576 3 nf_nat_ipv4,nf_nat_ipv6,nf_nat_masquerade_ipv4 nf_conntrack 106496 9 nf_conntrack_netbios_ns,nf_nat,nf_nat_ipv4,nf_nat_ipv6,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_broadcast,nf_conntrack_ipv4,nf_conntrack_ipv6 iptable_raw 16384 1 iptable_security 16384 1 iptable_mangle 16384 1 arc4 16384 2 rtl8192cu 69632 0 rtl_usb 20480 1 rtl8192cu rtl8192c_common 53248 1 rtl8192cu rtlwifi 73728 3 rtl_usb,rtl8192c_common,rtl8192cu mac80211 696320 3 rtl_usb,rtlwifi,rtl8192cu intel_rapl 20480 0 cfg80211 536576 2 mac80211,rtlwifi rfkill 24576 2 cfg80211 iosf_mbi 16384 1 intel_rapl x86_pkg_temp_thermal 16384 0 coretemp 16384 0 kvm_intel 167936 0 kvm 503808 1 kvm_intel snd_hda_codec_hdmi 49152 1 snd_usb_audio 176128 1 crct10dif_pclmul 16384 0 crc32_pclmul 16384 0 cm109 24576 0 crc32c_intel 24576 0 uvcvideo 90112 0 videobuf2_vmalloc 16384 1 uvcvideo videobuf2_core 49152 1 uvcvideo videobuf2_memops 16384 1 videobuf2_vmalloc v4l2_common 16384 1 videobuf2_core snd_hda_codec_realtek 81920 1 snd_hda_codec_generic 69632 1 snd_hda_codec_realtek snd_usbmidi_lib 36864 1 snd_usb_audio snd_rawmidi 32768 1 snd_usbmidi_lib videodev 163840 3 uvcvideo,v4l2_common,videobuf2_core iTCO_wdt 16384 0 iTCO_vendor_support 16384 1 iTCO_wdt ppdev 20480 0 snd_hda_intel 36864 1 snd_hda_codec 126976 4 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_intel joydev 20480 0 media 24576 2 uvcvideo,videodev dcdbas 16384 0 snd_hda_core 61440 5 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel snd_hwdep 16384 2 snd_usb_audio,snd_hda_codec snd_seq 69632 0 snd_seq_device 16384 2 snd_seq,snd_rawmidi snd_pcm 114688 5 snd_usb_audio,snd_hda_codec_hdmi,snd_hda_codec,snd_hda_intel,snd_hda_core snd_timer 32768 2 snd_pcm,snd_seq parport_pc 28672 0 mei_me 32768 0 lpc_ich 24576 0 parport 49152 2 ppdev,parport_pc mei 98304 1 mei_me i2c_i801 20480 0 snd 73728 17 snd_hda_codec_realtek,snd_usb_audio,snd_hwdep,snd_timer,snd_hda_codec_hdmi,snd_pcm,snd_seq,snd_rawmidi,snd_hda_codec_generic,snd_usbmidi_lib,snd_hda_codec,snd_hda_intel,snd_seq_device soundcore 16384 1 snd shpchp 36864 0 tpm_tis 20480 0 tpm 36864 1 tpm_tis nfsd 311296 1 auth_rpcgss 61440 1 nfsd nfs_acl 16384 1 nfsd lockd 94208 1 nfsd grace 16384 2 nfsd,lockd sunrpc 315392 7 nfsd,auth_rpcgss,lockd,nfs_acl hid_microsoft 16384 0 i915 1138688 3 i2c_algo_bit 16384 1 i915 drm_kms_helper 122880 1 i915 drm 335872 5 i915,drm_kms_helper serio_raw 16384 0 e1000e 237568 0 ptp 20480 1 e1000e pps_core 20480 1 ptp fjes 28672 0 video 36864 1 i915
I have confirmed that other people are also encountering this bug - but with a slower rate of growth. Typically, a 4G system will stay up 120 hours. This is why you haven't gotten more complaints - they just don't know it's broke yet.
Running slabtop show that the kmalloc-256 pool is ramping, adding about 15 objects per update.
Likely dup: https://bugzilla.redhat.com/show_bug.cgi?id=1296972
Created attachment 1120230 [details] Script to monitor SUnreclaimable On my system: 32.528585279 Kb/s 32.528585279 Kb/s 12.7215287657 Hr 34.0999938048 Kb/s 34.0999938048 Kb/s 12.1123840839 Hr 31.7421681716 Kb/s 31.7421681716 Kb/s 12.9892758286 Hr 32.2198732604 Kb/s 32.2198732604 Kb/s 12.6846833191 Hr 10s and cum rate and est time to OOM (optimistic)
Created attachment 1120302 [details] Script to monitor SUnreclaimable 1.9979281453 Kb/s 3.14192024846 Kb/s 272.855076226 Hr 10.7887796079 Kb/s 3.39684378822 Kb/s 252.342745893 Hr 1.99792495711 Kb/s 3.3517125995 Kb/s 255.717032579 Hr 2.79709633895 Kb/s 3.33437907319 Kb/s 257.032363304 Hr 18.3809538213 Kb/s 3.79038014751 Kb/s 226.042410872 Hr 2.79710246789 Kb/s 3.76116335357 Kb/s 227.759022504 Hr -13.9854327301 Kb/s 3.2540686734 Kb/s 263.185731587 Hr Notice the lower average rate. I booted with ethernet disconnected.
*** Bug 1303979 has been marked as a duplicate of this bug. ***
How did you determine bug#1033979 was actually a dup? There is no steady ramp of SUnreclaim there.
(In reply to Stuart D Gathman from comment #22) > How did you determine bug#1033979 was actually a dup? There is no steady > ramp of SUnreclaim there. Didn't, but there's no reason to track unknown memory leak bugs multiple times when the original report is unresolved. If this one gets resolved and the other one doesn't, then we can reopen it. You're still the only person reporting major issues with this. I know you found an earlier unrelated report and I'll be following up with that reporter later this week.
At least 2 people are #fedora are getting the ramp up of SUnreclaim with eventual OOM in est 120 hours - they just haven't bothered to file a bug report (or comment on this one).
(In reply to Stuart D Gathman from comment #20) > Created attachment 1120302 [details] > Script to monitor SUnreclaimable > > > 1.9979281453 Kb/s 3.14192024846 Kb/s 272.855076226 Hr > 10.7887796079 Kb/s 3.39684378822 Kb/s 252.342745893 Hr > 1.99792495711 Kb/s 3.3517125995 Kb/s 255.717032579 Hr > 2.79709633895 Kb/s 3.33437907319 Kb/s 257.032363304 Hr > 18.3809538213 Kb/s 3.79038014751 Kb/s 226.042410872 Hr > 2.79710246789 Kb/s 3.76116335357 Kb/s 227.759022504 Hr > -13.9854327301 Kb/s 3.2540686734 Kb/s 263.185731587 Hr > > Notice the lower average rate. I booted with ethernet disconnected. Just so I understand your numbers here, if you boot without ethernet, your leak drastically decreased? Is wifi in use still? If you boot with ethernet connected but do not load batman-adv, does the issue persist? The problem here is narrowing down exactly what is causing the memory usage to grow unbounded. Your OOMs keep killing lightdm, but if that is the only "running" userspace process that isn't necessarily unusual. As an aside, responses will be delayed this week.
(In reply to Stuart D Gathman from comment #24) > At least 2 people are #fedora are getting the ramp up of SUnreclaim with > eventual OOM in est 120 hours - they just haven't bothered to file a bug > report (or comment on this one). That's fine. There is no data that says their issue is the same as yours. There's also tons of people that run this kernel on their machine for a long time with no noticeable issues, including everyone I talked to. I'm not dismissing your issue or pretending it doesn't exist, but I'm not interested in anecdotal data either. Let's not seek out people and have a bunch of pile on comments that may or may not be related at all. Let's just focus on the issue.
(In reply to Josh Boyer from comment #25) > Just so I understand your numbers here, if you boot without ethernet, your > leak drastically decreased? Is wifi in use still? If you boot with Yes and yes. Maybe boot with USB wifi disconnected also and see what the numbers do? > ethernet connected but do not load batman-adv, does the issue persist? As already mentioned in comment#5, batman-adv is disabled as it was tainting the kernel. It was a red herring. I considered refiling the bug with no mention of batman-adv - would that have been a good idea? > The problem here is narrowing down exactly what is causing the memory usage > to grow unbounded. Your OOMs keep killing lightdm, but if that is the only > "running" userspace process that isn't necessarily unusual. Yes, that is the only running userspace process other than chrony, systemd, etc.
(In reply to Stuart D Gathman from comment #27) > (In reply to Josh Boyer from comment #25) > > Just so I understand your numbers here, if you boot without ethernet, your > > leak drastically decreased? Is wifi in use still? If you boot with > > Yes and yes. > > Maybe boot with USB wifi disconnected also and see what the numbers do? Maybe? > > ethernet connected but do not load batman-adv, does the issue persist? > > As already mentioned in comment#5, batman-adv is disabled as it was tainting OK. > the kernel. It was a red herring. I considered refiling the bug with no > mention of batman-adv - would that have been a good idea? Nah, no need for another bug. > > The problem here is narrowing down exactly what is causing the memory usage > > to grow unbounded. Your OOMs keep killing lightdm, but if that is the only > > "running" userspace process that isn't necessarily unusual. > > Yes, that is the only running userspace process other than chrony, systemd, > etc. Since we're going with the assumption this is a memory leak, can you install the debug version of this kernel and boot with: kmemleak=on on the kernel command line? That might help point out where things are leaking. I will note that things will run slower due to the debug options being enabled, but it might provide some valuable insight.
I reported the relatively minor batman-adv problem separately in bug#1304428
Booted 4.3.4+debug. Leak is nicely repeatable: 30.3684745371 Kb/s 36.1620860164 Kb/s 26.8190956685 Hr 10.388982631 Kb/s 27.5709826689 Kb/s 35.155036022 Hr 27.1714183098 Kb/s 27.4710916208 Kb/s 35.2514001923 Hr 40.7569925372 Kb/s 30.1282779554 Kb/s 32.1187365456 Hr 33.9642494286 Kb/s 30.7676063718 Kb/s 31.4337455902 Hr 25.5982233899 Kb/s 30.0297450278 Kb/s 32.1900835024 Hr Still converges quickly to 30 Kb/s Forgot to specify kmemleak=on, so will reboot with that.
With kmemleak=on it leaks slightly faster: 31.9854886088 Kb/s 38.9708032838 Kb/s 24.7616428146 Hr 32.366051291 Kb/s 36.7687797769 Kb/s 26.2369024684 Hr 37.5902786791 Kb/s 36.9740636134 Kb/s 26.0019807111 Hr 34.7634049268 Kb/s 36.5318077455 Kb/s 26.3660036889 Hr 19.1853997523 Kb/s 33.640765529 Kb/s 28.6122105586 Hr 42.7551460281 Kb/s 34.9431311854 Kb/s 27.5223379829 Hr 31.5666466932 Kb/s 34.5209795612 Kb/s 27.8416845188 Hr kmemleak ran out of memory! I'll attach journalctl -b0
Created attachment 1120822 [details] journalctl -b0 with 4.3.4-300+debug
(In reply to Stuart D Gathman from comment #32) > Created attachment 1120822 [details] > journalctl -b0 with 4.3.4-300+debug Hm. Not all that helpful. Can you give https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/kmemleak.txt and see if you can play around with it to get some more useful info?
Booted with both ether and wifi physically disconnected - memory leak disappears! I think (not being a kernel expert) that confirms this is related to networking.
I physically connected ethernet, and there is still no memory leak. Then with the monitor running, I inserted the USB wifi dongle. Memory starts leaking again! To summarize: net connected leak rate ------------- --------- nothing 0 kb/s ether 0 kb/s wifi 5 kb/s ether+wifi 30 kb/s (34kb/s with debug kernel)
(In reply to Stuart D Gathman from comment #35) > I physically connected ethernet, and there is still no memory leak. > Then with the monitor running, I inserted the USB wifi dongle. Memory > starts leaking again! > > To summarize: > > net connected leak rate > ------------- --------- > nothing 0 kb/s > ether 0 kb/s > wifi 5 kb/s > ether+wifi 30 kb/s (34kb/s with debug kernel) Hm, ok. Well, digging upstream in the rtlwifi driver (which I'm pretty sure is what you're using), we are likely missing this commit: commit 17bc55864f81dd730d05f09b1641312a7990d636 Author: Peter Wu <peter> Date: Mon Dec 7 01:07:31 2015 +0100 rtlwifi: fix memory leak for USB device Free skb for received frames with a wrong checksum. This can happen pretty rapidly, exhausting all memory. This fixes a memleak (detected with kmemleak). Originally found while using monitor mode, but it also appears during managed mode (once the link is up). Cc: stable.org Signed-off-by: Peter Wu <peter> ACKed-by: Larry Finger <Larry.Finger> Signed-off-by: Kalle Valo <kvalo> I can build a kernel with that included for you to test. It will be 4.3.5 based, but that shouldn't be an issue.
http://koji.fedoraproject.org/koji/taskinfo?taskID=12806302 Please test that when it completes. As noted earlier, further replies might be delayed for a while.
That would explain bug#1033979 as well, as I have a dongle connected to the laptop. Let me reboot the laptop and see if it is stable with 4.3.5 and no dongle. (The dongle is to connect to a mesh network simultaneously with managed wifi.)
The memory leak goes away with kernel-4.3.5-300.bz1303270.fc23.x86_64 Also, the batman-adv oops is gone. Not sure if that is related.
While we seem to have explained the memory leak, what about the DoS when the kernel finally panics? Do I assume the driver left the NIC in a transmit state or something? Is it worth investigating?
Thanks for testing. I'll get the patch committed today. As for the DoS, I have no idea but it is certainly plausible the kernel left the card in a bad state. However, I probably won't have time chasing down that issue because it's in an error path that shouldn't happen to begin with. Your help on this bug has been great. Thanks for putting in the effort.
kernel-4.3.6-201.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-e7162262b0
kernel-4.4.2-300.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-ec8b4ce774
kernel-4.3.6-201.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-e7162262b0
kernel-4.4.2-301.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-7e12ae5359
kernel-4.4.2-301.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-7e12ae5359
I've tested both 4.4.2-301 for f23 and 4.3.6-201 for f22 - both of which fix the problem, and don't seem to make other problems any worse. I left karma on bodhi. Do I close the bug now? Or will you?
Our update tooling will close the bug when the update reaches the stable repositories. Thank you.
kernel-4.4.3-200.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-a5ac00e07c
kernel-4.3.6-201.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.
kernel-4.4.2-301.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.
kernel-4.4.3-200.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-a5ac00e07c
kernel-4.4.3-201.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-9fbe2c258b
kernel-4.4.3-201.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-9fbe2c258b
kernel-4.4.3-201.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.