Hide Forgot
Description of problem: Version-Release number of selected component (if applicable): kernel 2.6.18-269.el5 How reproducible: Random Steps to Reproduce: Run netperf on vlan over bonding for 3 days. Actual results: [root@hp-dl580g7-01 ~]# NMI Watchdog detected LOCKUP on CPU 14 CPU 14 Modules linked in: bonding autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev be2net sr_mod cdrom sg tpm_tis tpm i7core_edac tpm_bios ixgbe edac_mc serio_raw pcspkr 8021q netxen_nic hpilo dca dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod lpfc scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 6324, comm: hald Not tainted 2.6.18-269.el5 #1 RIP: 0010:[<ffffffff80159086>] [<ffffffff80159086>] list_del+0x8/0x6b RSP: 0018:ffff81043a717ab8 EFLAGS: 00000096 RAX: 0000000000000008 RBX: 0000000000000011 RCX: ffff810107e4e3c0 RDX: 0000000000000092 RSI: ffff810238ddd240 RDI: ffff81023f9f5840 RBP: ffff81023f9f5840 R08: ffff81043fd04600 R09: ffff810107e6e000 R10: ffff81043a717be8 R11: 0000000000000048 R12: ffff81043fd04600 R13: ffff810107e4e3c0 R14: 000000000000000a R15: ffff810107e54240 FS: 00002af1e9654d60(0000) GS:ffff810107f1e1c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002b2b30bb6000 CR3: 000000043973f000 CR4: 00000000000006a0 Process hald (pid: 6324, threadinfo ffff81043a716000, task ffff81043fc89040) Stack: 0000000000000092 ffffffff8005bd87 000004d03a717bc8 ffff810107e54240 0000000000000246 00000000000004d0 0000000000000080 ffff81023f463340 0000000000000000 ffffffff800de770 ffff810222acc9c0 00000000000004d0 Call Trace: [<ffffffff8005bd87>] cache_alloc_refill+0xf3/0x188 [<ffffffff800de770>] __kmalloc+0x95/0x9f [<ffffffff8002decd>] __alloc_skb+0x5c/0x12e [<ffffffff8022f30e>] sock_alloc_send_pskb+0x7d/0x282 [<ffffffff8012e677>] avc_has_perm+0x46/0x58 [<ffffffff8004a20f>] unix_stream_sendmsg+0x15f/0x35b [<ffffffff80037b0a>] do_sock_write+0xc6/0x102 [<ffffffff8022e689>] sock_writev+0xb7/0xd1 [<ffffffff8012e677>] avc_has_perm+0x46/0x58 [<ffffffff800a2e4e>] autoremove_wake_function+0x0/0x2e [<ffffffff8022cb9a>] sock_aio_read+0x4f/0x5e [<ffffffff8000cfdf>] do_sync_read+0xc7/0x104 [<ffffffff800e3358>] do_readv_writev+0x172/0x291 [<ffffffff800b9cb1>] audit_syscall_entry+0x1a8/0x1d3 [<ffffffff800e3501>] sys_writev+0x45/0x93 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 89 fe 48 8b 11 48 39 fa 74 1a 48 c7 c7 d4 77 2c 80 31 c0 Kernel panic - not syncing: nmi watchdog Expected results: Additional info:
There is a similar bug in RHEL4 https://bugzilla.redhat.com/show_bug.cgi?id=460935
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Sorry about the delay here. Can anyone reproduce this and get a crash dump so I can see what all the CPUs are doing and who has the spinlock that is being taken with spinlock_irq(). Thanks, Larry
Sorry by this late update as well, but I believe this issue is due to a known cache_alloc_refill() infinite loop condition that usually happens due to a slab corruption that stroke elsewhere in code execution. There's no issue around cache_alloc_refill() function bits and the real offender is hidden among all other slab users, unfortunately. As a matter of fact, upstream just had the following excerpt included to catch that exceptional condition and break the loop (crashing the box) when it strikes: ---- commit 714b8171af9c930a59a0da8f6fe50518e70ab035 Author: Pekka Enberg <penberg.fi> Date: Sun May 6 14:49:03 2007 -0700 slab: ensure cache_alloc_refill terminates If slab->inuse is corrupted, cache_alloc_refill can enter an infinite loop as detailed by Michael Richardson in the following post: <http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch those cases. Cc: Michael Richardson <mcr> Acked-by: Christoph Lameter <clameter> Signed-off-by: Pekka Enberg <penberg.fi> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> diff --git a/mm/slab.c b/mm/slab.c index 8b71a9c..21b2aef 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2990,6 +2990,14 @@ retry: slabp = list_entry(entry, struct slab, list); check_slabp(cachep, slabp); check_spinlock_acquired(cachep); + + /* + * The slab was either on partial or free list so + * there must be at least one object available for + * allocation. + */ + BUG_ON(slabp->inuse < 0 || slabp->inuse >= cachep->num); + while (slabp->inuse < cachep->num && batchcount--) { STATS_INC_ALLOCED(cachep); STATS_INC_ACTIVE(cachep); ---- If this sort of condition (hung/crash) is being observed quite often on the system, we might want to grab a vmcore while running the -debug kernel, as there's VM / SLAB instrumentation to help us on identifying who is causing the slab corruption which leads to this undesirable cache_alloc_refill() infinite loop.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via appropriate support channels and provide additional business and/or technical details about its importance to you.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).