Bug 729229

Summary: kernel crash under heavier load
Product: Red Hat Enterprise Linux 6 Reporter: Vilius Šumskas <vilius>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.1CC: bernhard.furtmueller, kzhang, michael.hagmann, riel, roy, watanabe.yu
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-13 17:50:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vilius Šumskas 2011-08-09 06:41:56 UTC
Description of problem:
When our web server gets a more heavier load (like 2-5 in top) it usually crashes with the following messages:

Aug  8 12:08:11 post kernel: httpd: page allocation failure. order:1, mode:0x20
Aug  8 12:08:11 post kernel: Pid: 20403, comm: httpd Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
Aug  8 12:08:11 post kernel: Call Trace:
Aug  8 12:08:11 post kernel: <IRQ>  [<ffffffff811200be>] ? __alloc_pages_nodemask+0x71e/0x8b0
Aug  8 12:08:11 post kernel: [<ffffffff81276500>] ? percpu_counter_compare+0x10/0x90
Aug  8 12:08:11 post kernel: [<ffffffff81159942>] ? kmem_getpages+0x62/0x170
Aug  8 12:08:11 post kernel: [<ffffffff8115a55a>] ? fallback_alloc+0x1ba/0x270
Aug  8 12:08:11 post kernel: [<ffffffff81159faf>] ? cache_grow+0x2cf/0x320
Aug  8 12:08:11 post kernel: [<ffffffff8115a2d9>] ? ____cache_alloc_node+0x99/0x160
Aug  8 12:08:11 post kernel: [<ffffffff8115b09b>] ? kmem_cache_alloc+0x11b/0x190
Aug  8 12:08:11 post kernel: [<ffffffff81411808>] ? sk_prot_alloc+0x48/0x180
Aug  8 12:08:11 post kernel: [<ffffffff81411a52>] ? sk_clone+0x22/0x2c0
Aug  8 12:08:11 post kernel: [<ffffffff8145c736>] ? inet_csk_clone+0x16/0xd0
Aug  8 12:08:11 post kernel: [<ffffffff81475833>] ? tcp_create_openreq_child+0x23/0x450
Aug  8 12:08:11 post kernel: [<ffffffff8147320d>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0
Aug  8 12:08:11 post kernel: [<ffffffff814755f1>] ? tcp_check_req+0x201/0x420
Aug  8 12:08:11 post kernel: [<ffffffff81472c2b>] ? tcp_v4_do_rcv+0x35b/0x430
Aug  8 12:08:11 post kernel: [<ffffffffa0170557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
Aug  8 12:08:11 post kernel: [<ffffffff814743a3>] ? tcp_v4_rcv+0x4e3/0x870
Aug  8 12:08:11 post kernel: [<ffffffff81451fe0>] ? ip_local_deliver_finish+0x0/0x2d0
Aug  8 12:08:11 post kernel: [<ffffffff81447444>] ? nf_hook_slow+0x74/0x100
Aug  8 12:08:11 post kernel: [<ffffffff81451fe0>] ? ip_local_deliver_finish+0x0/0x2d0
Aug  8 12:08:11 post kernel: [<ffffffff814520bd>] ? ip_local_deliver_finish+0xdd/0x2d0
Aug  8 12:08:11 post kernel: [<ffffffff81452348>] ? ip_local_deliver+0x98/0xa0
Aug  8 12:08:11 post kernel: [<ffffffff8145180d>] ? ip_rcv_finish+0x12d/0x440
Aug  8 12:08:11 post kernel: [<ffffffff81451d95>] ? ip_rcv+0x275/0x350
Aug  8 12:08:11 post kernel: [<ffffffff8141d90b>] ? __netif_receive_skb+0x39b/0x6b0
Aug  8 12:08:11 post kernel: [<ffffffff8141fc18>] ? netif_receive_skb+0x58/0x60
Aug  8 12:08:11 post kernel: [<ffffffff8141fd20>] ? napi_skb_finish+0x50/0x70
Aug  8 12:08:11 post kernel: [<ffffffff81422059>] ? napi_gro_receive+0x39/0x50
Aug  8 12:08:11 post kernel: [<ffffffffa011a121>] ? tg3_poll_work+0x6b1/0xdf0 [tg3]
Aug  8 12:08:11 post kernel: [<ffffffff8141fc18>] ? netif_receive_skb+0x58/0x60
Aug  8 12:08:11 post kernel: [<ffffffff8141fdbd>] ? napi_gro_complete+0x7d/0xd0
Aug  8 12:08:11 post kernel: [<ffffffffa011a8c4>] ? tg3_poll+0x64/0x210 [tg3]
Aug  8 12:08:11 post kernel: [<ffffffff81422173>] ? net_rx_action+0x103/0x2f0
Aug  8 12:08:11 post kernel: [<ffffffff8106f6e1>] ? __do_softirq+0xc1/0x1d0
Aug  8 12:08:11 post kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30
Aug  8 12:08:11 post kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0
Aug  8 12:08:11 post kernel: [<ffffffff8106f4c5>] ? irq_exit+0x85/0x90
Aug  8 12:08:11 post kernel: [<ffffffff814e2e15>] ? do_IRQ+0x75/0xf0
Aug  8 12:08:11 post kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
Aug  8 12:08:11 post kernel: <EOI>


Version-Release number of selected component (if applicable):
kernel-2.6.32-131.6.1.el6.x86_64 (but it also exist in at least two previous versions)

How reproducible:
From time to time.

Steps to Reproduce:
1. Just leave the server for a while.
  
Actual results:
The httpd process hungs and kernel crashes.

Expected results:
It should not crash.

Additional info:
The webserver runs *some* files from NFS volume.

Comment 2 Zhang Kexin 2011-08-25 08:48:52 UTC
Hi, on which kernel version did you find this issue? Thanks!

Comment 3 Vilius Šumskas 2011-08-25 09:14:41 UTC
As I said in original report this is under kernel-2.6.32-131.6.1.el6.x86_64. But I can also confirm this issue in at least two previous RPM packages kernels and the latest kernel-2.6.32-131.12.1.el6.x86_64 too.

Comment 4 RHEL Program Management 2011-10-07 15:44:02 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Yu Watanabe 2011-10-11 05:03:00 UTC
I think this it related to Bug #713546 .
https://bugzilla.redhat.com/show_bug.cgi?id=713546

Comment 6 Rik van Riel 2011-11-04 19:07:59 UTC
The warning in this bug report is merely that - a warning from the kernel stating that it could not allocate an order 1 (2 contiguous pages) buffer for a network packet.

This should not cause either the system or httpd to hang or crash.

Vilius, does the system actually hang or crash?

If so, what is the kernel message (if any) showing the hang or crash?

If you have trouble identifying the information that will help engineering fix the issue, please contact Red Hat support so things can get properly analyzed before they get escalated to engineering.

Comment 7 Vilius Šumskas 2011-11-04 22:25:11 UTC
Basically when this starts, I see more and more such messages. Sometimes messages are sent "from" httpd, sometimes from other software on the server, or for example the swapper:

Nov  4 22:21:04 post kernel: swapper: page allocation failure. order:1, mode:0x20
Nov  4 22:21:04 post kernel: Pid: 0, comm: swapper Not tainted 2.6.32-131.17.1.el6.x86_64 #1
Nov  4 22:21:04 post kernel: Call Trace:
Nov  4 22:21:04 post kernel: <IRQ>  [<ffffffff8112016e>] ? __alloc_pages_nodemask+0x71e/0x8b0
Nov  4 22:21:04 post kernel: [<ffffffff81159a52>] ? kmem_getpages+0x62/0x170
Nov  4 22:21:04 post kernel: [<ffffffff8115a66a>] ? fallback_alloc+0x1ba/0x270
Nov  4 22:21:04 post kernel: [<ffffffff8115a0bf>] ? cache_grow+0x2cf/0x320
Nov  4 22:21:04 post kernel: [<ffffffff8115a3e9>] ? ____cache_alloc_node+0x99/0x160
Nov  4 22:21:04 post kernel: [<ffffffff8115b1ab>] ? kmem_cache_alloc+0x11b/0x190
Nov  4 22:21:04 post kernel: [<ffffffff81411ba8>] ? sk_prot_alloc+0x48/0x1a0
Nov  4 22:21:04 post kernel: [<ffffffff81411e12>] ? sk_clone+0x22/0x2c0
Nov  4 22:21:04 post kernel: [<ffffffff8145caf6>] ? inet_csk_clone+0x16/0xd0
Nov  4 22:21:04 post kernel: [<ffffffff81475be3>] ? tcp_create_openreq_child+0x23/0x450
Nov  4 22:21:04 post kernel: [<ffffffff814735cd>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0
Nov  4 22:21:04 post kernel: [<ffffffff814759a1>] ? tcp_check_req+0x201/0x420
Nov  4 22:21:04 post kernel: [<ffffffff81472feb>] ? tcp_v4_do_rcv+0x35b/0x430
Nov  4 22:21:04 post kernel: [<ffffffffa0261557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
Nov  4 22:21:04 post kernel: [<ffffffff81474760>] ? tcp_v4_rcv+0x4e0/0x860
Nov  4 22:21:04 post kernel: [<ffffffff814523a0>] ? ip_local_deliver_finish+0x0/0x2d0
Nov  4 22:21:04 post kernel: [<ffffffff81447804>] ? nf_hook_slow+0x74/0x100
Nov  4 22:21:04 post kernel: [<ffffffff814523a0>] ? ip_local_deliver_finish+0x0/0x2d0
Nov  4 22:21:04 post kernel: [<ffffffff8145247d>] ? ip_local_deliver_finish+0xdd/0x2d0
Nov  4 22:21:04 post kernel: [<ffffffff81452708>] ? ip_local_deliver+0x98/0xa0
Nov  4 22:21:04 post kernel: [<ffffffff81451bcd>] ? ip_rcv_finish+0x12d/0x440
Nov  4 22:21:04 post kernel: [<ffffffff81452155>] ? ip_rcv+0x275/0x350
Nov  4 22:21:04 post kernel: [<ffffffff8141dccb>] ? __netif_receive_skb+0x39b/0x6b0
Nov  4 22:21:04 post kernel: [<ffffffff8141ffd8>] ? netif_receive_skb+0x58/0x60
Nov  4 22:21:04 post kernel: [<ffffffff814200e0>] ? napi_skb_finish+0x50/0x70
Nov  4 22:21:04 post kernel: [<ffffffff81422419>] ? napi_gro_receive+0x39/0x50
Nov  4 22:21:04 post kernel: [<ffffffffa0169121>] ? tg3_poll_work+0x6b1/0xdf0 [tg3]
Nov  4 22:21:04 post kernel: [<ffffffffa02358b6>] ? destroy_conntrack+0xd6/0x150 [nf_conntrack]
Nov  4 22:21:04 post kernel: [<ffffffffa01698c4>] ? tg3_poll+0x64/0x210 [tg3]
Nov  4 22:21:04 post kernel: [<ffffffff81422533>] ? net_rx_action+0x103/0x2f0
Nov  4 22:21:04 post kernel: [<ffffffff8106f6e1>] ? __do_softirq+0xc1/0x1d0
Nov  4 22:21:04 post kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30
Nov  4 22:21:04 post kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0
Nov  4 22:21:04 post kernel: [<ffffffff8106f4c5>] ? irq_exit+0x85/0x90
Nov  4 22:21:04 post kernel: [<ffffffff814e3195>] ? do_IRQ+0x75/0xf0
Nov  4 22:21:04 post kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
Nov  4 22:21:04 post kernel: <EOI>  [<ffffffff81014197>] ? mwait_idle+0x77/0xd0
Nov  4 22:21:04 post kernel: [<ffffffff814e09ea>] ? atomic_notifier_call_chain+0x1a/0x20
Nov  4 22:21:04 post kernel: [<ffffffff81009e86>] ? cpu_idle+0xb6/0x110
Nov  4 22:21:04 post kernel: [<ffffffff814d45da>] ? start_secondary+0x202/0x245


Server doesn't hand on every such message, but after a while it does. I will make a sceenshot of the last message next time this happens.

Comment 8 Vilius Šumskas 2012-02-10 13:35:55 UTC
OK, I finally solved this myself. Had to upgrade 04:03.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 01) firmware. Now the system is stable. Not sure if this is a bug in the controller or in the kernel, so don't know how to close this bug report right. Leaving this for others to do.

Comment 9 Rik van Riel 2012-02-13 17:50:46 UTC
Since all you ever got from the kernel was warnings that a network buffer could not be allocated - which results in a lost packet, that can be retransmitted later - I suspect that you simply had a correlation going on.

Lost network packets happen when system load is very high.

Having the RAID controller hang the system also happens when system load is very high.

If there was another bug in the kernel somewhere, you would have seen an error message to that effect.

Closing the bug, since it appears to have been a hardware issue.

Comment 10 Roy Sigurd Karlsbakk 2014-05-27 10:12:26 UTC
I see this happening on RHEL6.5 too, running as a VM under ESXI 5.1. See http://karlsbakk.net/tmp/kernel-errors.gz for the errors.

roy