Bug 1152848 - kernel BUG at include/linux/mm.h:321! [NEEDINFO]
Summary: kernel BUG at include/linux/mm.h:321!
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 20
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-15 04:10 UTC by Larkin Lowrey
Modified: 2014-12-10 15:01 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-12-10 15:01:41 UTC
Type: Bug
Embargoed:
jforbes: needinfo?


Attachments (Terms of Use)
page allocation failure: order:6, mode:0x4020 (1.76 KB, text/plain)
2014-10-15 08:14 UTC, Larkin Lowrey
no flags Details

Description Larkin Lowrey 2014-10-15 04:10:06 UTC
Description of problem:
When under high 10Gbe network load, stack traces are dumped to the console and the process performing the transfer ceases to send or receive.

I can usually kill the process and restart it but I did get a "Kernel panic - not syncing: Fatal exception in interrupt" at one point.

Stack traces showing both failures are at the bottom of this entry.

Version-Release number of selected component (if applicable):


How reproducible:
Every time I run a large network file transfer.


Steps to Reproduce:
1. Run bbcp to xfer a lot of data over 10Gbe
2. Wait for error messages on the console
3.

Actual results:
Transfer halts, process needs to be restarted, stack trace on the console.

Expected results:
No error


Additional info:

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors

[ 2629.313687] page:ffffea00048cb000 count:0 mapcount:-127 mapping:          (null) index:0x0
[ 2629.322002] page flags: 0x5ffff800004000(head)
[ 2629.326523] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
[ 2629.333960] ------------[ cut here ]------------
[ 2629.338582] kernel BUG at include/linux/mm.h:321!
[ 2629.343278] invalid opcode: 0000 [#3] SMP
[ 2629.347408] Modules linked in: tn4022(OE) binfmt_misc bonding bridge stp llc it87 hwmon_vid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx bcache kvm_amd kvm crct10dif_pclmul eeepc_wmi snd_hda_codec_realtek snd_hda_codec_generic crc32_pclmul crc32c_intel asus_wmi snd_hda_intel sparse_keymap snd_hda_controller rfkill ghash_clmulni_intel snd_hda_codec amd64_edac_mod microcode serio_raw snd_hwdep edac_core fam15h_power edac_mce_amd k10temp snd_seq snd_seq_device snd_pcm snd_timer sp5100_tco snd soundcore raid0 i2c_piix4 tpm_tis tpm_infineon nfsd tpm auth_rpcgss shpchp nfs_acl acpi_cpufreq lockd sunrpc btrfs xor raid6_pq uas nouveau raid1 r8169 video mii i2c_algo_bit usb_storage drm_kms_helper mvsas ttm libsas mxm_wmi scsi_transport_sas drm i2c_core wmi [last unloaded: iptable_raw]
[ 2629.419808] CPU: 4 PID: 5674 Comm: bbcp Tainted: G      D    OE 3.16.4-200.fc20.x86_64 #1
[ 2629.428018] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99FX PRO R2.0, BIOS 2501 04/07/2014
[ 2629.438100] task: ffff880444eb4f00 ti: ffff88011a114000 task.ti: ffff88011a114000
[ 2629.445574] RIP: 0010:[<ffffffff8170475a>]  [<ffffffff8170475a>] put_page_testzero.part.16+0x10/0x12
[ 2629.454703] RSP: 0018:ffff88011a117b68  EFLAGS: 00010246
[ 2629.460006] RAX: 0000000000000000 RBX: ffff88011ae8b880 RCX: 0000000000000000
[ 2629.467127] RDX: 0000000000000000 RSI: ffff88045ed0e718 RDI: 0000000001232c00
[ 2629.474255] RBP: ffff88011a117b68 R08: 000000000000000a R09: 0000000000000728
[ 2629.481384] R10: 0000000000000000 R11: ffff88011a117836 R12: 0000000000000002
[ 2629.488503] R13: ffff8804450b0700 R14: 0000000000000020 R15: 0000000000000020
[ 2629.495632] FS:  00007fabc1540700(0000) GS:ffff88045ed00000(0000) knlGS:0000000000000000
[ 2629.503714] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2629.509456] CR2: 00007f700a787000 CR3: 00000001215d3000 CR4: 00000000000407e0
[ 2629.516577] Stack:
[ 2629.518585]  ffff88011a117b78 ffffffff8118bf06 ffff88011a117ba0 ffffffff815e7cb8
[ 2629.526034]  ffff8804450b0700 0000000000000001 ffff8804450b0700 ffff88011a117bb8
[ 2629.533493]  ffffffff815e7d64 ffff8804450b0700 ffff88011a117bd0 ffffffff815e8007
[ 2629.540968] Call Trace:
[ 2629.543440]  [<ffffffff8118bf06>] put_page+0x36/0x50
[ 2629.548394]  [<ffffffff815e7cb8>] skb_release_data+0x88/0x110
[ 2629.554128]  [<ffffffff815e7d64>] skb_release_all+0x24/0x30
[ 2629.559689]  [<ffffffff815e8007>] kfree_skb_partial+0x17/0x40
[ 2629.565424]  [<ffffffff8164fe84>] tcp_rcv_established+0x414/0x700
[ 2629.571513]  [<ffffffff81659eb5>] tcp_v4_do_rcv+0x1b5/0x4c0
[ 2629.577073]  [<ffffffff815e3089>] release_sock+0x99/0x160
[ 2629.582471]  [<ffffffff81646e88>] tcp_recvmsg+0x5c8/0xbd0
[ 2629.587867]  [<ffffffff816709bb>] inet_recvmsg+0x7b/0xa0
[ 2629.593167]  [<ffffffff815dd54e>] sock_aio_read.part.10+0x10e/0x150
[ 2629.599439]  [<ffffffff815dd5b1>] sock_aio_read+0x21/0x40
[ 2629.604827]  [<ffffffff811f2f87>] do_sync_read+0x67/0xa0
[ 2629.610127]  [<ffffffff811f3965>] vfs_read+0x135/0x170
[ 2629.615254]  [<ffffffff811f45a5>] SyS_read+0x55/0xd0
[ 2629.620220]  [<ffffffff8170efa9>] system_call_fastpath+0x16/0x1b
[ 2629.626264] Code: 48 c8 48 85 d2 48 0f 49 c2 48 01 c8 49 89 06 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 c7 c6 88 8b a3 81 48 89 e5 e8 36 06 a8 ff <0f> 0b 55 48 c7 c6 c8 8c a3 81 48 89 e5 e8 24 06 a8 ff 0f 0b 55
[ 2629.646445] RIP  [<ffffffff8170475a>] put_page_testzero.part.16+0x10/0x12
[ 2629.653245]  RSP <ffff88011a117b68>
[ 2629.656779] ---[ end trace a03a93c2cb977cc1 ]---
[ 2629.661427] Kernel panic - not syncing: Fatal exception in interrupt
[ 2629.667879] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[ 2629.678063] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
[ 2629.685197] ------------[ cut here ]------------
[ 2629.689811] WARNING: CPU: 4 PID: 5674 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
[ 2629.699362] Modules linked in: tn4022(OE) binfmt_misc bonding bridge stp llc it87 hwmon_vid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx bcache kvm_amd kvm crct10dif_pclmul eeepc_wmi snd_hda_codec_realtek snd_hda_codec_generic crc32_pclmul crc32c_intel asus_wmi snd_hda_intel sparse_keymap snd_hda_controller rfkill ghash_clmulni_intel snd_hda_codec amd64_edac_mod microcode serio_raw snd_hwdep edac_core fam15h_power edac_mce_amd k10temp snd_seq snd_seq_device snd_pcm snd_timer sp5100_tco snd soundcore raid0 i2c_piix4 tpm_tis tpm_infineon nfsd tpm auth_rpcgss shpchp nfs_acl acpi_cpufreq lockd sunrpc btrfs xor raid6_pq uas nouveau raid1 r8169 video mii i2c_algo_bit usb_storage drm_kms_helper mvsas ttm libsas mxm_wmi scsi_transport_sas drm i2c_core wmi [last unloaded: iptable_raw]
[ 2629.771509] CPU: 4 PID: 5674 Comm: bbcp Tainted: G      D    OE 3.16.4-200.fc20.x86_64 #1
[ 2629.779675] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99FX PRO R2.0, BIOS 2501 04/07/2014
[ 2629.789749]  0000000000000000 00000000a78e5c67 ffff88045ed03d90 ffffffff81707965
[ 2629.797199]  0000000000000000 ffff88045ed03dc8 ffffffff8108d0ad 0000000000000000
[ 2629.804665]  ffff88045ec145c0 0000000000000004 0000000000000004 000000000000e608
[ 2629.812139] Call Trace:
[ 2629.814600]  <IRQ>  [<ffffffff81707965>] dump_stack+0x45/0x56
[ 2629.820362]  [<ffffffff8108d0ad>] warn_slowpath_common+0x7d/0xa0
[ 2629.826365]  [<ffffffff8108d1da>] warn_slowpath_null+0x1a/0x20
[ 2629.832194]  [<ffffffff8104489d>] native_smp_send_reschedule+0x5d/0x60
[ 2629.838709]  [<ffffffff810cf0e4>] trigger_load_balance+0x144/0x1b0
[ 2629.844886]  [<ffffffff810bf3b7>] scheduler_tick+0x97/0xd0
[ 2629.850367]  [<ffffffff8109b2c0>] update_process_times+0x60/0x70
[ 2629.856404]  [<ffffffff810fdce5>] tick_sched_handle.isra.17+0x25/0x60
[ 2629.862840]  [<ffffffff810fdd61>] tick_sched_timer+0x41/0x60
[ 2629.868496]  [<ffffffff810b3794>] __run_hrtimer+0x74/0x1d0
[ 2629.873970]  [<ffffffff810fdd20>] ? tick_sched_handle.isra.17+0x60/0x60
[ 2629.880571]  [<ffffffff810b3b97>] hrtimer_interrupt+0x107/0x250
[ 2629.886479]  [<ffffffff810476a7>] local_apic_timer_interrupt+0x37/0x60
[ 2629.893002]  [<ffffffff81711d9f>] smp_apic_timer_interrupt+0x3f/0x60
[ 2629.899351]  [<ffffffff8170fe9d>] apic_timer_interrupt+0x6d/0x80
[ 2629.905350]  <EOI>  [<ffffffff817031c1>] ? panic+0x1c8/0x20c
[ 2629.911030]  [<ffffffff810173a3>] oops_end+0xd3/0xe0
[ 2629.916032]  [<ffffffff8101781b>] die+0x4b/0x70
[ 2629.920556]  [<ffffffff81013eb0>] do_trap+0xb0/0x150
[ 2629.925519]  [<ffffffff81014455>] do_error_trap+0x95/0x130
[ 2629.931000]  [<ffffffff8170475a>] ? put_page_testzero.part.16+0x10/0x12
[ 2629.937601]  [<ffffffff810e66d8>] ? vprintk_emit+0x1e8/0x550
[ 2629.943249]  [<ffffffff81014ab0>] do_invalid_op+0x20/0x30
[ 2629.948637]  [<ffffffff8171095e>] invalid_op+0x1e/0x30
[ 2629.953774]  [<ffffffff8170475a>] ? put_page_testzero.part.16+0x10/0x12
[ 2629.960383]  [<ffffffff8170475a>] ? put_page_testzero.part.16+0x10/0x12
[ 2629.966992]  [<ffffffff8118bf06>] put_page+0x36/0x50
[ 2629.971954]  [<ffffffff815e7cb8>] skb_release_data+0x88/0x110
[ 2629.977689]  [<ffffffff815e7d64>] skb_release_all+0x24/0x30
[ 2629.983251]  [<ffffffff815e8007>] kfree_skb_partial+0x17/0x40
[ 2629.989037]  [<ffffffff8164fe84>] tcp_rcv_established+0x414/0x700
[ 2629.995126]  [<ffffffff81659eb5>] tcp_v4_do_rcv+0x1b5/0x4c0
[ 2630.000695]  [<ffffffff815e3089>] release_sock+0x99/0x160
[ 2630.006091]  [<ffffffff81646e88>] tcp_recvmsg+0x5c8/0xbd0
[ 2630.011480]  [<ffffffff816709bb>] inet_recvmsg+0x7b/0xa0
[ 2630.016789]  [<ffffffff815dd54e>] sock_aio_read.part.10+0x10e/0x150
[ 2630.023052]  [<ffffffff815dd5b1>] sock_aio_read+0x21/0x40
[ 2630.028447]  [<ffffffff811f2f87>] do_sync_read+0x67/0xa0
[ 2630.033749]  [<ffffffff811f3965>] vfs_read+0x135/0x170
[ 2630.038919]  [<ffffffff811f45a5>] SyS_read+0x55/0xd0
[ 2630.043874]  [<ffffffff8170efa9>] system_call_fastpath+0x16/0x1b
[ 2630.049877] ---[ end trace a03a93c2cb977cc2 ]---

Comment 1 Larkin Lowrey 2014-10-15 04:11:00 UTC
Kernel is 3.16.4-200.fc20.x86_64 was observed with 3.16.3-200.fc20.x86_64 as well.

Comment 2 Larkin Lowrey 2014-10-15 08:14:32 UTC
Created attachment 947136 [details]
page allocation failure: order:6, mode:0x4020

Comment 3 Larkin Lowrey 2014-10-15 08:32:02 UTC
Also, I've been getting a lot of "page allocation failure: order:6, mode:0x4020" (attached) unless I set vm.min_free_kbytes very high (eg 1048576). These occur only during period of high network load.

Perhaps they are related?

Comment 4 Justin M. Forbes 2014-11-13 16:02:18 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.17.2-200.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 21, and are still experiencing this issue, please change the version to Fedora 21.

If you experience different issues, please open a new bug report for those.

Comment 5 Justin M. Forbes 2014-12-10 15:01:41 UTC
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.