Bug 718628

Summary: NMI Watchdog detected LOCKUP on CPU 14
Product: Red Hat Enterprise Linux 5 Reporter: Liang Zheng <lzheng>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Zhang Kexin <kzhang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.7CC: aquini, ccui, kzhang, lzheng
Target Milestone: rcFlags: pm-rhel: needinfo? (lzheng)
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-02 13:21:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Liang Zheng 2011-07-04 04:25:28 UTC
Description of problem:


Version-Release number of selected component (if applicable):
kernel 2.6.18-269.el5

How reproducible:
Random

Steps to Reproduce:
Run netperf on vlan over bonding for 3 days.
  
Actual results:
[root@hp-dl580g7-01 ~]# NMI Watchdog detected LOCKUP on CPU 14
CPU 14 
Modules linked in: bonding autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev be2net sr_mod cdrom sg tpm_tis tpm i7core_edac tpm_bios ixgbe edac_mc serio_raw pcspkr 8021q netxen_nic hpilo dca dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod lpfc scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 6324, comm: hald Not tainted 2.6.18-269.el5 #1
RIP: 0010:[<ffffffff80159086>]  [<ffffffff80159086>] list_del+0x8/0x6b
RSP: 0018:ffff81043a717ab8  EFLAGS: 00000096
RAX: 0000000000000008 RBX: 0000000000000011 RCX: ffff810107e4e3c0
RDX: 0000000000000092 RSI: ffff810238ddd240 RDI: ffff81023f9f5840
RBP: ffff81023f9f5840 R08: ffff81043fd04600 R09: ffff810107e6e000
R10: ffff81043a717be8 R11: 0000000000000048 R12: ffff81043fd04600
R13: ffff810107e4e3c0 R14: 000000000000000a R15: ffff810107e54240
FS:  00002af1e9654d60(0000) GS:ffff810107f1e1c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b2b30bb6000 CR3: 000000043973f000 CR4: 00000000000006a0
Process hald (pid: 6324, threadinfo ffff81043a716000, task ffff81043fc89040)
Stack:  0000000000000092 ffffffff8005bd87 000004d03a717bc8 ffff810107e54240
 0000000000000246 00000000000004d0 0000000000000080 ffff81023f463340
 0000000000000000 ffffffff800de770 ffff810222acc9c0 00000000000004d0
Call Trace:
 [<ffffffff8005bd87>] cache_alloc_refill+0xf3/0x188
 [<ffffffff800de770>] __kmalloc+0x95/0x9f
 [<ffffffff8002decd>] __alloc_skb+0x5c/0x12e
 [<ffffffff8022f30e>] sock_alloc_send_pskb+0x7d/0x282
 [<ffffffff8012e677>] avc_has_perm+0x46/0x58
 [<ffffffff8004a20f>] unix_stream_sendmsg+0x15f/0x35b
 [<ffffffff80037b0a>] do_sock_write+0xc6/0x102
 [<ffffffff8022e689>] sock_writev+0xb7/0xd1
 [<ffffffff8012e677>] avc_has_perm+0x46/0x58
 [<ffffffff800a2e4e>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8022cb9a>] sock_aio_read+0x4f/0x5e
 [<ffffffff8000cfdf>] do_sync_read+0xc7/0x104
 [<ffffffff800e3358>] do_readv_writev+0x172/0x291
 [<ffffffff800b9cb1>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff800e3501>] sys_writev+0x45/0x93
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 89 fe 48 8b 11 48 39 fa 74 1a 48 c7 c7 d4 77 2c 80 31 c0 
Kernel panic - not syncing: nmi watchdog


Expected results:


Additional info:

Comment 1 Liang Zheng 2011-07-04 04:27:04 UTC
There is a similar bug in RHEL4
https://bugzilla.redhat.com/show_bug.cgi?id=460935

Comment 2 RHEL Program Management 2011-08-12 01:10:58 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Larry Woodman 2012-01-18 16:52:20 UTC
Sorry about the delay here.  Can anyone reproduce this and get a crash dump so I can see what all the CPUs are doing and who has the spinlock that is being taken with spinlock_irq().

Thanks, Larry

Comment 6 Rafael Aquini 2012-10-26 18:33:29 UTC
Sorry by this late update as well, but I believe this issue is due to a known cache_alloc_refill() infinite loop condition that usually happens due to a slab corruption that stroke elsewhere in code execution. There's no issue around cache_alloc_refill() function bits and the real offender is hidden among all other slab users, unfortunately.

As a matter of fact, upstream just had the following excerpt included to catch that exceptional condition and break the loop (crashing the box) when it strikes:

----
commit 714b8171af9c930a59a0da8f6fe50518e70ab035
Author: Pekka Enberg <penberg.fi>
Date:   Sun May 6 14:49:03 2007 -0700

    slab: ensure cache_alloc_refill terminates
    
    If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
    loop as detailed by Michael Richardson in the following post:
    <http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch
    those cases.
    
    Cc: Michael Richardson <mcr>
    Acked-by: Christoph Lameter <clameter>
    Signed-off-by: Pekka Enberg <penberg.fi>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

diff --git a/mm/slab.c b/mm/slab.c
index 8b71a9c..21b2aef 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2990,6 +2990,14 @@ retry:
                slabp = list_entry(entry, struct slab, list);
                check_slabp(cachep, slabp);
                check_spinlock_acquired(cachep);
+
+               /*
+                * The slab was either on partial or free list so
+                * there must be at least one object available for
+                * allocation.
+                */
+               BUG_ON(slabp->inuse < 0 || slabp->inuse >= cachep->num);
+
                while (slabp->inuse < cachep->num && batchcount--) {
                        STATS_INC_ALLOCED(cachep);
                        STATS_INC_ACTIVE(cachep);
----

If this sort of condition (hung/crash) is being observed quite often on the system, we might want to grab a vmcore while running the -debug kernel, as there's VM / SLAB instrumentation to help us on identifying who is causing the slab corruption which leads to this undesirable cache_alloc_refill() infinite loop.

Comment 7 RHEL Program Management 2012-10-30 06:12:09 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 10 RHEL Program Management 2014-03-07 12:17:03 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the  last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via  appropriate support channels and provide additional business and/or technical details about its importance to you.

Comment 11 RHEL Program Management 2014-06-02 13:21:15 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).