Hide Forgot
Description of problem: When our web server gets a more heavier load (like 2-5 in top) it usually crashes with the following messages: Aug 8 12:08:11 post kernel: httpd: page allocation failure. order:1, mode:0x20 Aug 8 12:08:11 post kernel: Pid: 20403, comm: httpd Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 Aug 8 12:08:11 post kernel: Call Trace: Aug 8 12:08:11 post kernel: <IRQ> [<ffffffff811200be>] ? __alloc_pages_nodemask+0x71e/0x8b0 Aug 8 12:08:11 post kernel: [<ffffffff81276500>] ? percpu_counter_compare+0x10/0x90 Aug 8 12:08:11 post kernel: [<ffffffff81159942>] ? kmem_getpages+0x62/0x170 Aug 8 12:08:11 post kernel: [<ffffffff8115a55a>] ? fallback_alloc+0x1ba/0x270 Aug 8 12:08:11 post kernel: [<ffffffff81159faf>] ? cache_grow+0x2cf/0x320 Aug 8 12:08:11 post kernel: [<ffffffff8115a2d9>] ? ____cache_alloc_node+0x99/0x160 Aug 8 12:08:11 post kernel: [<ffffffff8115b09b>] ? kmem_cache_alloc+0x11b/0x190 Aug 8 12:08:11 post kernel: [<ffffffff81411808>] ? sk_prot_alloc+0x48/0x180 Aug 8 12:08:11 post kernel: [<ffffffff81411a52>] ? sk_clone+0x22/0x2c0 Aug 8 12:08:11 post kernel: [<ffffffff8145c736>] ? inet_csk_clone+0x16/0xd0 Aug 8 12:08:11 post kernel: [<ffffffff81475833>] ? tcp_create_openreq_child+0x23/0x450 Aug 8 12:08:11 post kernel: [<ffffffff8147320d>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0 Aug 8 12:08:11 post kernel: [<ffffffff814755f1>] ? tcp_check_req+0x201/0x420 Aug 8 12:08:11 post kernel: [<ffffffff81472c2b>] ? tcp_v4_do_rcv+0x35b/0x430 Aug 8 12:08:11 post kernel: [<ffffffffa0170557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] Aug 8 12:08:11 post kernel: [<ffffffff814743a3>] ? tcp_v4_rcv+0x4e3/0x870 Aug 8 12:08:11 post kernel: [<ffffffff81451fe0>] ? ip_local_deliver_finish+0x0/0x2d0 Aug 8 12:08:11 post kernel: [<ffffffff81447444>] ? nf_hook_slow+0x74/0x100 Aug 8 12:08:11 post kernel: [<ffffffff81451fe0>] ? ip_local_deliver_finish+0x0/0x2d0 Aug 8 12:08:11 post kernel: [<ffffffff814520bd>] ? ip_local_deliver_finish+0xdd/0x2d0 Aug 8 12:08:11 post kernel: [<ffffffff81452348>] ? ip_local_deliver+0x98/0xa0 Aug 8 12:08:11 post kernel: [<ffffffff8145180d>] ? ip_rcv_finish+0x12d/0x440 Aug 8 12:08:11 post kernel: [<ffffffff81451d95>] ? ip_rcv+0x275/0x350 Aug 8 12:08:11 post kernel: [<ffffffff8141d90b>] ? __netif_receive_skb+0x39b/0x6b0 Aug 8 12:08:11 post kernel: [<ffffffff8141fc18>] ? netif_receive_skb+0x58/0x60 Aug 8 12:08:11 post kernel: [<ffffffff8141fd20>] ? napi_skb_finish+0x50/0x70 Aug 8 12:08:11 post kernel: [<ffffffff81422059>] ? napi_gro_receive+0x39/0x50 Aug 8 12:08:11 post kernel: [<ffffffffa011a121>] ? tg3_poll_work+0x6b1/0xdf0 [tg3] Aug 8 12:08:11 post kernel: [<ffffffff8141fc18>] ? netif_receive_skb+0x58/0x60 Aug 8 12:08:11 post kernel: [<ffffffff8141fdbd>] ? napi_gro_complete+0x7d/0xd0 Aug 8 12:08:11 post kernel: [<ffffffffa011a8c4>] ? tg3_poll+0x64/0x210 [tg3] Aug 8 12:08:11 post kernel: [<ffffffff81422173>] ? net_rx_action+0x103/0x2f0 Aug 8 12:08:11 post kernel: [<ffffffff8106f6e1>] ? __do_softirq+0xc1/0x1d0 Aug 8 12:08:11 post kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30 Aug 8 12:08:11 post kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0 Aug 8 12:08:11 post kernel: [<ffffffff8106f4c5>] ? irq_exit+0x85/0x90 Aug 8 12:08:11 post kernel: [<ffffffff814e2e15>] ? do_IRQ+0x75/0xf0 Aug 8 12:08:11 post kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11 Aug 8 12:08:11 post kernel: <EOI> Version-Release number of selected component (if applicable): kernel-2.6.32-131.6.1.el6.x86_64 (but it also exist in at least two previous versions) How reproducible: From time to time. Steps to Reproduce: 1. Just leave the server for a while. Actual results: The httpd process hungs and kernel crashes. Expected results: It should not crash. Additional info: The webserver runs *some* files from NFS volume.
Hi, on which kernel version did you find this issue? Thanks!
As I said in original report this is under kernel-2.6.32-131.6.1.el6.x86_64. But I can also confirm this issue in at least two previous RPM packages kernels and the latest kernel-2.6.32-131.12.1.el6.x86_64 too.
Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
I think this it related to Bug #713546 . https://bugzilla.redhat.com/show_bug.cgi?id=713546
The warning in this bug report is merely that - a warning from the kernel stating that it could not allocate an order 1 (2 contiguous pages) buffer for a network packet. This should not cause either the system or httpd to hang or crash. Vilius, does the system actually hang or crash? If so, what is the kernel message (if any) showing the hang or crash? If you have trouble identifying the information that will help engineering fix the issue, please contact Red Hat support so things can get properly analyzed before they get escalated to engineering.
Basically when this starts, I see more and more such messages. Sometimes messages are sent "from" httpd, sometimes from other software on the server, or for example the swapper: Nov 4 22:21:04 post kernel: swapper: page allocation failure. order:1, mode:0x20 Nov 4 22:21:04 post kernel: Pid: 0, comm: swapper Not tainted 2.6.32-131.17.1.el6.x86_64 #1 Nov 4 22:21:04 post kernel: Call Trace: Nov 4 22:21:04 post kernel: <IRQ> [<ffffffff8112016e>] ? __alloc_pages_nodemask+0x71e/0x8b0 Nov 4 22:21:04 post kernel: [<ffffffff81159a52>] ? kmem_getpages+0x62/0x170 Nov 4 22:21:04 post kernel: [<ffffffff8115a66a>] ? fallback_alloc+0x1ba/0x270 Nov 4 22:21:04 post kernel: [<ffffffff8115a0bf>] ? cache_grow+0x2cf/0x320 Nov 4 22:21:04 post kernel: [<ffffffff8115a3e9>] ? ____cache_alloc_node+0x99/0x160 Nov 4 22:21:04 post kernel: [<ffffffff8115b1ab>] ? kmem_cache_alloc+0x11b/0x190 Nov 4 22:21:04 post kernel: [<ffffffff81411ba8>] ? sk_prot_alloc+0x48/0x1a0 Nov 4 22:21:04 post kernel: [<ffffffff81411e12>] ? sk_clone+0x22/0x2c0 Nov 4 22:21:04 post kernel: [<ffffffff8145caf6>] ? inet_csk_clone+0x16/0xd0 Nov 4 22:21:04 post kernel: [<ffffffff81475be3>] ? tcp_create_openreq_child+0x23/0x450 Nov 4 22:21:04 post kernel: [<ffffffff814735cd>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0 Nov 4 22:21:04 post kernel: [<ffffffff814759a1>] ? tcp_check_req+0x201/0x420 Nov 4 22:21:04 post kernel: [<ffffffff81472feb>] ? tcp_v4_do_rcv+0x35b/0x430 Nov 4 22:21:04 post kernel: [<ffffffffa0261557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] Nov 4 22:21:04 post kernel: [<ffffffff81474760>] ? tcp_v4_rcv+0x4e0/0x860 Nov 4 22:21:04 post kernel: [<ffffffff814523a0>] ? ip_local_deliver_finish+0x0/0x2d0 Nov 4 22:21:04 post kernel: [<ffffffff81447804>] ? nf_hook_slow+0x74/0x100 Nov 4 22:21:04 post kernel: [<ffffffff814523a0>] ? ip_local_deliver_finish+0x0/0x2d0 Nov 4 22:21:04 post kernel: [<ffffffff8145247d>] ? ip_local_deliver_finish+0xdd/0x2d0 Nov 4 22:21:04 post kernel: [<ffffffff81452708>] ? ip_local_deliver+0x98/0xa0 Nov 4 22:21:04 post kernel: [<ffffffff81451bcd>] ? ip_rcv_finish+0x12d/0x440 Nov 4 22:21:04 post kernel: [<ffffffff81452155>] ? ip_rcv+0x275/0x350 Nov 4 22:21:04 post kernel: [<ffffffff8141dccb>] ? __netif_receive_skb+0x39b/0x6b0 Nov 4 22:21:04 post kernel: [<ffffffff8141ffd8>] ? netif_receive_skb+0x58/0x60 Nov 4 22:21:04 post kernel: [<ffffffff814200e0>] ? napi_skb_finish+0x50/0x70 Nov 4 22:21:04 post kernel: [<ffffffff81422419>] ? napi_gro_receive+0x39/0x50 Nov 4 22:21:04 post kernel: [<ffffffffa0169121>] ? tg3_poll_work+0x6b1/0xdf0 [tg3] Nov 4 22:21:04 post kernel: [<ffffffffa02358b6>] ? destroy_conntrack+0xd6/0x150 [nf_conntrack] Nov 4 22:21:04 post kernel: [<ffffffffa01698c4>] ? tg3_poll+0x64/0x210 [tg3] Nov 4 22:21:04 post kernel: [<ffffffff81422533>] ? net_rx_action+0x103/0x2f0 Nov 4 22:21:04 post kernel: [<ffffffff8106f6e1>] ? __do_softirq+0xc1/0x1d0 Nov 4 22:21:04 post kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30 Nov 4 22:21:04 post kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0 Nov 4 22:21:04 post kernel: [<ffffffff8106f4c5>] ? irq_exit+0x85/0x90 Nov 4 22:21:04 post kernel: [<ffffffff814e3195>] ? do_IRQ+0x75/0xf0 Nov 4 22:21:04 post kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11 Nov 4 22:21:04 post kernel: <EOI> [<ffffffff81014197>] ? mwait_idle+0x77/0xd0 Nov 4 22:21:04 post kernel: [<ffffffff814e09ea>] ? atomic_notifier_call_chain+0x1a/0x20 Nov 4 22:21:04 post kernel: [<ffffffff81009e86>] ? cpu_idle+0xb6/0x110 Nov 4 22:21:04 post kernel: [<ffffffff814d45da>] ? start_secondary+0x202/0x245 Server doesn't hand on every such message, but after a while it does. I will make a sceenshot of the last message next time this happens.
OK, I finally solved this myself. Had to upgrade 04:03.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 01) firmware. Now the system is stable. Not sure if this is a bug in the controller or in the kernel, so don't know how to close this bug report right. Leaving this for others to do.
Since all you ever got from the kernel was warnings that a network buffer could not be allocated - which results in a lost packet, that can be retransmitted later - I suspect that you simply had a correlation going on. Lost network packets happen when system load is very high. Having the RAID controller hang the system also happens when system load is very high. If there was another bug in the kernel somewhere, you would have seen an error message to that effect. Closing the bug, since it appears to have been a hardware issue.
I see this happening on RHEL6.5 too, running as a VM under ESXI 5.1. See http://karlsbakk.net/tmp/kernel-errors.gz for the errors. roy