Bug 729229
Summary: | kernel crash under heavier load | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Vilius Šumskas <vilius> |
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.1 | CC: | bernhard.furtmueller, kzhang, michael.hagmann, riel, roy, watanabe.yu |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-02-13 17:50:46 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vilius Šumskas
2011-08-09 06:41:56 UTC
Hi, on which kernel version did you find this issue? Thanks! As I said in original report this is under kernel-2.6.32-131.6.1.el6.x86_64. But I can also confirm this issue in at least two previous RPM packages kernels and the latest kernel-2.6.32-131.12.1.el6.x86_64 too. Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. I think this it related to Bug #713546 . https://bugzilla.redhat.com/show_bug.cgi?id=713546 The warning in this bug report is merely that - a warning from the kernel stating that it could not allocate an order 1 (2 contiguous pages) buffer for a network packet. This should not cause either the system or httpd to hang or crash. Vilius, does the system actually hang or crash? If so, what is the kernel message (if any) showing the hang or crash? If you have trouble identifying the information that will help engineering fix the issue, please contact Red Hat support so things can get properly analyzed before they get escalated to engineering. Basically when this starts, I see more and more such messages. Sometimes messages are sent "from" httpd, sometimes from other software on the server, or for example the swapper: Nov 4 22:21:04 post kernel: swapper: page allocation failure. order:1, mode:0x20 Nov 4 22:21:04 post kernel: Pid: 0, comm: swapper Not tainted 2.6.32-131.17.1.el6.x86_64 #1 Nov 4 22:21:04 post kernel: Call Trace: Nov 4 22:21:04 post kernel: <IRQ> [<ffffffff8112016e>] ? __alloc_pages_nodemask+0x71e/0x8b0 Nov 4 22:21:04 post kernel: [<ffffffff81159a52>] ? kmem_getpages+0x62/0x170 Nov 4 22:21:04 post kernel: [<ffffffff8115a66a>] ? fallback_alloc+0x1ba/0x270 Nov 4 22:21:04 post kernel: [<ffffffff8115a0bf>] ? cache_grow+0x2cf/0x320 Nov 4 22:21:04 post kernel: [<ffffffff8115a3e9>] ? ____cache_alloc_node+0x99/0x160 Nov 4 22:21:04 post kernel: [<ffffffff8115b1ab>] ? kmem_cache_alloc+0x11b/0x190 Nov 4 22:21:04 post kernel: [<ffffffff81411ba8>] ? sk_prot_alloc+0x48/0x1a0 Nov 4 22:21:04 post kernel: [<ffffffff81411e12>] ? sk_clone+0x22/0x2c0 Nov 4 22:21:04 post kernel: [<ffffffff8145caf6>] ? inet_csk_clone+0x16/0xd0 Nov 4 22:21:04 post kernel: [<ffffffff81475be3>] ? tcp_create_openreq_child+0x23/0x450 Nov 4 22:21:04 post kernel: [<ffffffff814735cd>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0 Nov 4 22:21:04 post kernel: [<ffffffff814759a1>] ? tcp_check_req+0x201/0x420 Nov 4 22:21:04 post kernel: [<ffffffff81472feb>] ? tcp_v4_do_rcv+0x35b/0x430 Nov 4 22:21:04 post kernel: [<ffffffffa0261557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] Nov 4 22:21:04 post kernel: [<ffffffff81474760>] ? tcp_v4_rcv+0x4e0/0x860 Nov 4 22:21:04 post kernel: [<ffffffff814523a0>] ? ip_local_deliver_finish+0x0/0x2d0 Nov 4 22:21:04 post kernel: [<ffffffff81447804>] ? nf_hook_slow+0x74/0x100 Nov 4 22:21:04 post kernel: [<ffffffff814523a0>] ? ip_local_deliver_finish+0x0/0x2d0 Nov 4 22:21:04 post kernel: [<ffffffff8145247d>] ? ip_local_deliver_finish+0xdd/0x2d0 Nov 4 22:21:04 post kernel: [<ffffffff81452708>] ? ip_local_deliver+0x98/0xa0 Nov 4 22:21:04 post kernel: [<ffffffff81451bcd>] ? ip_rcv_finish+0x12d/0x440 Nov 4 22:21:04 post kernel: [<ffffffff81452155>] ? ip_rcv+0x275/0x350 Nov 4 22:21:04 post kernel: [<ffffffff8141dccb>] ? __netif_receive_skb+0x39b/0x6b0 Nov 4 22:21:04 post kernel: [<ffffffff8141ffd8>] ? netif_receive_skb+0x58/0x60 Nov 4 22:21:04 post kernel: [<ffffffff814200e0>] ? napi_skb_finish+0x50/0x70 Nov 4 22:21:04 post kernel: [<ffffffff81422419>] ? napi_gro_receive+0x39/0x50 Nov 4 22:21:04 post kernel: [<ffffffffa0169121>] ? tg3_poll_work+0x6b1/0xdf0 [tg3] Nov 4 22:21:04 post kernel: [<ffffffffa02358b6>] ? destroy_conntrack+0xd6/0x150 [nf_conntrack] Nov 4 22:21:04 post kernel: [<ffffffffa01698c4>] ? tg3_poll+0x64/0x210 [tg3] Nov 4 22:21:04 post kernel: [<ffffffff81422533>] ? net_rx_action+0x103/0x2f0 Nov 4 22:21:04 post kernel: [<ffffffff8106f6e1>] ? __do_softirq+0xc1/0x1d0 Nov 4 22:21:04 post kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30 Nov 4 22:21:04 post kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0 Nov 4 22:21:04 post kernel: [<ffffffff8106f4c5>] ? irq_exit+0x85/0x90 Nov 4 22:21:04 post kernel: [<ffffffff814e3195>] ? do_IRQ+0x75/0xf0 Nov 4 22:21:04 post kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11 Nov 4 22:21:04 post kernel: <EOI> [<ffffffff81014197>] ? mwait_idle+0x77/0xd0 Nov 4 22:21:04 post kernel: [<ffffffff814e09ea>] ? atomic_notifier_call_chain+0x1a/0x20 Nov 4 22:21:04 post kernel: [<ffffffff81009e86>] ? cpu_idle+0xb6/0x110 Nov 4 22:21:04 post kernel: [<ffffffff814d45da>] ? start_secondary+0x202/0x245 Server doesn't hand on every such message, but after a while it does. I will make a sceenshot of the last message next time this happens. OK, I finally solved this myself. Had to upgrade 04:03.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 01) firmware. Now the system is stable. Not sure if this is a bug in the controller or in the kernel, so don't know how to close this bug report right. Leaving this for others to do. Since all you ever got from the kernel was warnings that a network buffer could not be allocated - which results in a lost packet, that can be retransmitted later - I suspect that you simply had a correlation going on. Lost network packets happen when system load is very high. Having the RAID controller hang the system also happens when system load is very high. If there was another bug in the kernel somewhere, you would have seen an error message to that effect. Closing the bug, since it appears to have been a hardware issue. I see this happening on RHEL6.5 too, running as a VM under ESXI 5.1. See http://karlsbakk.net/tmp/kernel-errors.gz for the errors. roy |