Bug 508014 - PANIC: 'Kernel panic - not syncing: nmi watchdog' due to deadlock in do_cciss_intr()
Summary: PANIC: 'Kernel panic - not syncing: nmi watchdog' due to deadlock in do_cciss...
Keywords:
Status: CLOSED DUPLICATE of bug 509816
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.7
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 4.9
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-25 07:03 UTC by Lachlan McIlroy
Modified: 2015-04-12 23:14 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-07-09 01:40:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Lachlan McIlroy 2009-06-25 07:03:01 UTC
Description of problem:

cciss: controller cciss0 failed, stopping.
cciss0: controller not responding.
NMI Watchdog detected LOCKUP, CPU=6, registers:
CPU 6 
Modules linked in: vxodm(U) parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc iptable_filter ip_tables ib_srp ib_sdp ib_ipoib md5 ipv6 rdma_ucm rdma_cm iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core ide_dump scsi_dump diskdump zlib_deflate vfat fat vxportal(U) fdd(U) vxfs(U) dmphpalua(U) dmpaaa(U) vxspec(U) vxio(U) vxdmp(U) button battery ac k8_edac edac_mc netxen_nic e1000 bnx2 bonding(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 cciss qla2xxx scsi_transport_fc usb_storage uhci_hcd ohci_hcd ehci_hcd sd_mod scsi_mod
Pid: 0, comm: swapper Tainted: PF     2.6.9-78.0.1.ELsmp
RIP: 0010:[<ffffffff80319600>] <ffffffff80319600>{.text.lock.spinlock+14}
RSP: 0018:000001007fd0bef8  EFLAGS: 00000082
RAX: 0000000000000026 RBX: 0000010871f238dc RCX: 0000000000000046
RDX: 000000000010a5e9 RSI: 0000000000000046 RDI: 0000010871f238dc
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000010871f20000
R10: 000001046f73e000 R11: 0000000000000000 R12: 0000000000000062
R13: 0000010c71f9fe98 R14: 0000000000000000 R15: 0000000000000002
FS:  0000002a95576b00(0000) GS:ffffffff8050d580(0000) knlGS:00000000de9f5ba0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000ad8984 CR3: 0000000c86174000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 0000010c71f9e000, task 00000104800717f0)
Stack: 0000000000000046 0000000000000002 0000010871f20000 ffffffffa00939c6 
       0000000000000062 00000104702c9a40 0000000000000001 0000000000000062 
       0000010c71f9fe98 0000010c71f9fe98 
Call Trace:<IRQ> <ffffffffa00939c6>{:cciss:do_cciss_intr+210} <ffffffff80112ff2>{handle_IRQ_event+41} 
       <ffffffff8011326c>{do_IRQ+197} <ffffffff801108bf>{ret_from_intr+0} 
        <EOI> <ffffffff8010e789>{default_idle+0} <ffffffff8010e7a9>{default_idle+32} 
       <ffffffff8010e81c>{cpu_idle+26} 

Code: 83 3b 00 7e f9 e9 ce fc ff ff e8 a9 84 ed ff e9 4a fd ff ff 
Kernel panic - not syncing: nmi watchdog



cciss: controller cciss0 failed, stopping.
cciss0: controller not responding.

From the above messages and the stacktrace we can see that do_cciss_intr() has called into fail_all_cmds() which is trying to acquire a spinlock.

static irqreturn_t do_cciss_intr(int irq, void *dev_id, struct pt_regs *regs)
{
	ctlr_info_t *h = dev_id;
	CommandList_struct *c;
	unsigned long flags;
	__u32 a, a1, a2;
	int j;
	int start_queue = h->next_to_run;


	if (interrupt_not_for_us(h))
		return IRQ_NONE;
	/*
	 * If there are completed commands in the completion queue,
	 * we had better do something about it.
	 */
	spin_lock_irqsave(CCISS_LOCK(h->ctlr), flags);
	while (interrupt_pending(h)) {
		while((a = get_next_completion(h)) != FIFO_EMPTY) {
			a1 = a;
			if ((a & 0x04)) {
				a2 = (a >> 3);
				if (a2 >= h->max_nr_cmds) {
					printk(KERN_WARNING "cciss: controller cciss%d failed, stopping.\n", h->ctlr);
					fail_all_cmds(h->ctlr);
					return IRQ_HANDLED;
				}
...

static void fail_all_cmds(unsigned long ctlr)
{
	/* If we get here, the board is apparently dead. */
	ctlr_info_t *h = hba[ctlr];
	CommandList_struct *c;
	unsigned long flags;

	printk(KERN_WARNING "cciss%d: controller not responding.\n", h->ctlr);
	h->alive = 0;	/* the controller apparently died... */ 

	spin_lock_irqsave(CCISS_LOCK(ctlr), flags);
...

#define CCISS_LOCK(i)	(&hba[i]->lock)

The spinlock that file_all_cmds() is trying to acquire is already held by the same thread (because it was acquired in do_cciss_intr()) so we've deadlocked against ourselves and this is what caused the watchdog to report a lockup.

Version-Release number of selected component (if applicable):
2.6.9-78.0.1.ELsmp

How reproducible:
Unknown but customer is seeing this and similar bugs on 6 systems.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Prarit Bhargava 2009-07-06 12:40:36 UTC
<sheepish>

I did something really stupid here.  I didn't keep the ENTIRE boot log.  I hope that isn't an issue for you Tomas :/.  The deadlock seems pretty obvious ...

</sheepish>

P.

Comment 2 Lachlan McIlroy 2009-07-09 01:40:42 UTC
Marking duplicate of 509816 since the fix has been proposed there.

*** This bug has been marked as a duplicate of bug 509816 ***


Note You need to log in before you can comment on or make changes to this bug.