Bug 626851
Summary: | Server freeze when multiple clients are writing large files | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jonathan Dieter <jonathan> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 13 | CC: | anton, aquini, dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, toby |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-08-27 16:23:27 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jonathan Dieter
2010-08-24 14:56:22 UTC
Is it really completely locked up? Do the numlock/capslock keys stop lighting their LEDs when you press them? Does it respond to SysRq? (Add "sysrq_always_enabled ignore_loglevel" to the boot options and see if alt-sysrq-p shows any output.) Also, can you try a Fedora 12 kernel? Ok, thanks for the boot options, I got something different this time. It looks like I'm running out of RAM, though I'm not sure why, given that nfs on btrfs on drbd is the only thing running on this server. I've pasted the first crash from dmesg (don't know why it gave me a few minutes of network access this time before it completely froze). The server is now completely frozen and num lock and caps lock don't change the light indicators or anything. The display has been blanked because I wasn't on it for 15 minutes or whatever it is, but alt-sysrq-b doesn't reboot the system. Let me know if you need the full dmesg output, or if you still want me to try with a F12 kernel. swapper: page allocation failure. order:0, mode:0x4020 Pid: 0, comm: swapper Not tainted 2.6.34.4-41.fc13.x86_64 #1 Call Trace: <IRQ> [<ffffffff810d1424>] __alloc_pages_nodemask+0x5b2/0x635 [<ffffffff810f82b7>] alloc_pages_current+0x95/0x9e [<ffffffff810ff828>] alloc_slab_page+0x1b/0x5b [<ffffffff810ff8d7>] new_slab+0x6f/0x1d4 [<ffffffff81100266>] __slab_alloc+0x220/0x3ae [<ffffffff8139dafe>] ? __netdev_alloc_skb+0x34/0x51 [<ffffffff8110112c>] __kmalloc_node_track_caller+0xb8/0x118 [<ffffffff8139dafe>] ? __netdev_alloc_skb+0x34/0x51 [<ffffffff8139d6b8>] __alloc_skb+0x80/0x170 [<ffffffff8139dafe>] __netdev_alloc_skb+0x34/0x51 [<ffffffffa0128883>] rtl8169_rx_interrupt+0x223/0x4e0 [r8169] [<ffffffffa012b560>] rtl8169_poll+0x3c/0x1a6 [r8169] [<ffffffff8106b3ae>] ? sched_clock_cpu+0xc3/0xce [<ffffffff813a57af>] net_rx_action+0xaf/0x1ca [<ffffffff81010261>] ? sched_clock+0x9/0xd [<ffffffff81053245>] __do_softirq+0xe5/0x1a6 [<ffffffff8109e115>] ? handle_IRQ_event+0x60/0x121 [<ffffffff8100ab5c>] call_softirq+0x1c/0x30 [<ffffffff8100c342>] do_softirq+0x46/0x83 [<ffffffff810530b6>] irq_exit+0x3b/0x7d [<ffffffff814525b4>] do_IRQ+0xac/0xc3 [<ffffffff8144cd13>] ret_from_intr+0x0/0x11 <EOI> [<ffffffff810117c5>] ? mwait_idle+0x7a/0x87 [<ffffffff81011777>] ? mwait_idle+0x2c/0x87 [<ffffffff81008c22>] cpu_idle+0xaa/0xe4 [<ffffffff81445695>] start_secondary+0x253/0x294 Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 178 CPU 1: hi: 186, btch: 31 usd: 197 active_anon:6377 inactive_anon:2136 isolated_anon:0 active_file:207231 inactive_file:208770 isolated_file:0 unevictable:0 dirty:4681 writeback:98529 unstable:0 free:2474 slab_reclaimable:12039 slab_unreclaimable:10757 mapped:698 shmem:1405 pagetables:790 bounce:0 Node 0 DMA free:7876kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:1308kB inactive_file:6736kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15356kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:20kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 1966 1966 1966 Node 0 DMA32 free:2268kB min:5652kB low:7064kB high:8476kB active_anon:25508kB inactive_anon:8544kB active_file:827616kB inactive_file:828088kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2014096kB mlocked:0kB dirty:18724kB writeback:394116kB mapped:2792kB shmem:5620kB slab_reclaimable:48136kB slab_unreclaimable:43008kB kernel_stack:1728kB pagetables:3160kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 2*4kB 5*8kB 5*16kB 4*32kB 3*64kB 4*128kB 1*256kB 1*512kB 2*1024kB 2*2048kB 0*4096kB = 7872kB Node 0 DMA32: 455*4kB 12*8kB 0*16kB 3*32kB 0*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2268kB 417348 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 515327 pages RAM 10651 pages reserved 424422 pages shared 84890 pages non-shared SLUB: Unable to allocate memory on node -1 (gfp=0x20) cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0 node 0: slabs: 1390, objs: 7274, free: 0 Oh, it's an r8169 card. I would recommend trying the boot option "pcie_aspm=off" first, and then try replacing the card if that doesn't help. The driver for that card has had lots of problems in the past. Swapped the hard drives into a different machine with an e1000e card. Here's the output from the first crash. However, the machine hasn't frozen yet. It just keeps going, while these errors keep showing up on the screen. swapper: page allocation failure. order:0, mode:0x4020 Pid: 0, comm: swapper Not tainted 2.6.34.4-41.fc13.x86_64 #1 Call Trace: <IRQ> [<ffffffff810d1424>] __alloc_pages_nodemask+0x5b2/0x635 [<ffffffff810f82b7>] alloc_pages_current+0x95/0x9e [<ffffffff810ff828>] alloc_slab_page+0x1b/0x5b [<ffffffff810ff8d7>] new_slab+0x6f/0x1d4 [<ffffffff81100266>] __slab_alloc+0x220/0x3ae [<ffffffff8139dafe>] ? __netdev_alloc_skb+0x34/0x51 [<ffffffff8110112c>] __kmalloc_node_track_caller+0xb8/0x118 [<ffffffff8139dafe>] ? __netdev_alloc_skb+0x34/0x51 [<ffffffff8139d6b8>] __alloc_skb+0x80/0x170 [<ffffffff8139dafe>] __netdev_alloc_skb+0x34/0x51 [<ffffffffa0104e78>] netdev_alloc_skb_ip_align+0x16/0x2c [e1000e] [<ffffffffa0104f0a>] e1000_alloc_rx_buffers+0x7c/0x15e [e1000e] [<ffffffffa01052be>] e1000_clean_rx_irq+0x260/0x29b [e1000e] [<ffffffff813a6a83>] ? dev_kfree_skb_any+0x38/0x3a [<ffffffffa010657d>] e1000_clean+0x7a/0x224 [e1000e] [<ffffffff813a57af>] net_rx_action+0xaf/0x1ca [<ffffffff81053245>] __do_softirq+0xe5/0x1a6 [<ffffffff8109e115>] ? handle_IRQ_event+0x60/0x121 [<ffffffff8100ab5c>] call_softirq+0x1c/0x30 [<ffffffff8100c342>] do_softirq+0x46/0x83 [<ffffffff810530b6>] irq_exit+0x3b/0x7d [<ffffffff814525b4>] do_IRQ+0xac/0xc3 [<ffffffff8144cd13>] ret_from_intr+0x0/0x11 <EOI> [<ffffffff810117c5>] ? mwait_idle+0x7a/0x87 [<ffffffff81011777>] ? mwait_idle+0x2c/0x87 [<ffffffff81008c22>] cpu_idle+0xaa/0xe4 [<ffffffff81434e9a>] rest_init+0x7e/0x80 [<ffffffff81baee6f>] start_kernel+0x413/0x41e [<ffffffff81bae2c8>] x86_64_start_reservations+0xb3/0xb7 [<ffffffff81bae3c4>] x86_64_start_kernel+0xf8/0x107 Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 105 CPU 1: hi: 186, btch: 31 usd: 92 active_anon:4192 inactive_anon:4226 isolated_anon:0 active_file:84917 inactive_file:97670 isolated_file:0 unevictable:0 dirty:5399 writeback:40684 unstable:0 free:1349 slab_reclaimable:7161 slab_unreclaimable:7232 mapped:495 shmem:1374 pagetables:776 bounce:0 Node 0 DMA free:3960kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:3816kB inactive_file:8032kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15288kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:68kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 984 984 984 Node 0 DMA32 free:1436kB min:3980kB low:4972kB high:5968kB active_anon:16768kB inactive_anon:16904kB active_file:335852kB inactive_file:382648kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1007976kB mlocked:0kB dirty:21596kB writeback:162736kB mapped:1980kB shmem:5496kB slab_reclaimable:28576kB slab_unreclaimable:28908kB kernel_stack:1688kB pagetables:3104kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 4*4kB 9*8kB 8*16kB 5*32kB 2*64kB 3*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3960kB Node 0 DMA32: 165*4kB 7*8kB 5*16kB 0*32kB 3*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1500kB 183950 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 259839 pages RAM 6616 pages reserved 190261 pages shared 67349 pages non-shared SLUB: Unable to allocate memory on node -1 (gfp=0x20) cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0 node 0: slabs: 561, objs: 5966, free: 0 Just to update this, the server is still up a couple of hours later, but most of the clients been disconnected, which gives bonnie++ stale NFS handles when they reconnect. I'm assuming the clients are disconnecting because when the server runs out of RAM, it's unable to send some packets. I can go ahead and throw loads more RAM at this, but I'd like to know where it's going. Output of free right now: total used free shared buffers cached Mem: 1012896 1002664 10232 0 2776 738348 -/+ buffers/cache: 261540 751356 Swap: 0 0 0 So at least the e1000e driver can deal with memory allocation failures. /proc/meminfo should give you an overview of memory usage, and slabtop should show a breakdown of what is using kernel memory allocations. Oh wait, you have lots of free memory there. Try increasing /proc/sys/vm/min_free_kbytes to 32000 or more. Ok, setting /proc/sys/vm/min_free_kbytes to 64000 seems to have fixed the problem. I'm not getting anything in the logs about being unable to allocate RAM. Is this one of those "that's a knob you change on servers" things, or is there a real bug here? I'm am still running into an issue where bonnie++ on several of the clients is unable to open the file it has just created, but there's no error message on the server or client. I'm assuming this would be best reported as a separate bug, seeing as it probably has nothing to do with memory allocation problems. (In reply to comment #8) > Ok, setting /proc/sys/vm/min_free_kbytes to 64000 seems to have fixed the > problem. I'm not getting anything in the logs about being unable to allocate > RAM. Is this one of those "that's a knob you change on servers" things, or is > there a real bug here? > Yes, this is one of those things that heavily-loaded servers need to change from the default. I'll make a note to add it to the KernelCommonProblems wiki page. > I'm am still running into an issue where bonnie++ on several of the clients is > unable to open the file it has just created, but there's no error message on > the server or client. > > I'm assuming this would be best reported as a separate bug, seeing as it > probably has nothing to do with memory allocation problems. Report a new bug. Ok, thanks much for helping me track this down. I really appreciate it. |