509816 – cciss: spinlock deadlock causes NMI on HP systems

Bug 509816 - cciss: spinlock deadlock causes NMI on HP systems

Summary: cciss: spinlock deadlock causes NMI on HP systems

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.9
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Henzl
QA Contact:	Evan McNabb
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	508014 (view as bug list)
Depends On:
Blocks:	509818 525725
TreeView+	depends on / blocked

Reported:	2009-07-06 12:36 UTC by Prarit Bhargava
Modified:	2018-10-20 00:57 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	509818 (view as bug list)
Environment:
Last Closed:	2011-02-16 15:24:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
corrects the spinlock use (1.01 KB, patch) 2009-07-07 09:56 UTC, Tomas Henzl	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0263	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update	2011-02-16 15:14:55 UTC

Description Prarit Bhargava 2009-07-06 12:36:11 UTC

Description of problem:

While comparing boot sequences from a 5.4 boot versus a RHEL4 boot, the following panic occurred (see details below).  After looking into the problem it looks like there is a spinlock deadlock in the code.

Version-Release number of selected component (if applicable): 2.6.9-78.0.1.ELsmp (but seems to be in current kernel as well).


How reproducible: Unknown/very low reproducibility.  Saw this one time on an HP system in the lab.  The next boot everything was okay....


Steps to Reproduce:
1.  Boot kernel.
2.
3.
  
Actual results:

cciss: controller cciss0 failed, stopping.
cciss0: controller not responding.
NMI Watchdog detected LOCKUP, CPU=6, registers:
CPU 6 
Modules linked in: vxodm(U) parport_pc lp parport netconsole netdump autofs4
i2c_dev i2c_core sunrpc iptable_filter ip_tables ib_srp ib_sdp ib_ipoib md5
ipv6 rdma_ucm rdma_cm iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad
ib_core ide_dump scsi_dump diskdump zlib_deflate vfat fat vxportal(U) fdd(U)
vxfs(U) dmphpalua(U) dmpaaa(U) vxspec(U) vxio(U) vxdmp(U) button battery ac
k8_edac edac_mc netxen_nic e1000 bnx2 bonding(U) dm_snapshot dm_zero dm_mirror
ext3 jbd dm_mod qla2400 cciss qla2xxx scsi_transport_fc usb_storage uhci_hcd
ohci_hcd ehci_hcd sd_mod scsi_mod
Pid: 0, comm: swapper Tainted: PF     2.6.9-78.0.1.ELsmp
RIP: 0010:[<ffffffff80319600>] <ffffffff80319600>{.text.lock.spinlock+14}
RSP: 0018:000001007fd0bef8  EFLAGS: 00000082
RAX: 0000000000000026 RBX: 0000010871f238dc RCX: 0000000000000046
RDX: 000000000010a5e9 RSI: 0000000000000046 RDI: 0000010871f238dc
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000010871f20000
R10: 000001046f73e000 R11: 0000000000000000 R12: 0000000000000062
R13: 0000010c71f9fe98 R14: 0000000000000000 R15: 0000000000000002
FS:  0000002a95576b00(0000) GS:ffffffff8050d580(0000) knlGS:00000000de9f5ba0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000ad8984 CR3: 0000000c86174000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 0000010c71f9e000, task 00000104800717f0)
Stack: 0000000000000046 0000000000000002 0000010871f20000 ffffffffa00939c6 
       0000000000000062 00000104702c9a40 0000000000000001 0000000000000062 
       0000010c71f9fe98 0000010c71f9fe98 
Call Trace:<IRQ> <ffffffffa00939c6>{:cciss:do_cciss_intr+210}
<ffffffff80112ff2>{handle_IRQ_event+41} 
       <ffffffff8011326c>{do_IRQ+197} <ffffffff801108bf>{ret_from_intr+0} 
        <EOI> <ffffffff8010e789>{default_idle+0}
<ffffffff8010e7a9>{default_idle+32} 
       <ffffffff8010e81c>{cpu_idle+26} 

Expected results:

No NMI lockup should be seen.

Additional info:  I looked into the code and this is what I've come up with.

do_cciss_intr acquires the CCISS_LOCK(h->ctlr) lock.

In an error-handling situation, as evidenced by the above output, fail_all_cmds() is called.

fail_all_cmds() attempts to *reacquire* the lock.

DEADLOCK.

FWIW, the same code exists in RHEL5 (I will clone this to RHEL5).

Comment 1 Tomas Henzl 2009-07-07 09:56:29 UTC

Created attachment 350763 [details]
corrects the spinlock use

This patch removes the spinlock lock/unlock from fail_all_cmds and adds a spin_unlock after the call to fail_all_cmds before the return.

Comment 2 RHEL Program Management 2009-07-07 10:03:02 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Mike Miller (OS Dev) 2009-07-07 16:05:42 UTC

The fix looks good to me. Do I need to do anything for this bug?

Comment 4 Tomas Henzl 2009-07-08 14:17:54 UTC

(In reply to comment #3)
> The fix looks good to me. Do I need to do anything for this bug?  

No, and thanks for the review. For open issues with cciss look for example here - 505506, 479090, 250485.

Comment 5 Tomas Henzl 2009-07-08 15:51:56 UTC

Posted.

Comment 7 Lachlan McIlroy 2009-07-09 01:40:42 UTC

*** Bug 508014 has been marked as a duplicate of this bug. ***

Comment 8 Vivek Goyal 2009-07-14 18:52:29 UTC

Committed in 89.6.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 16 errata-xmlrpc 2011-02-16 15:24:41 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html

Note You need to log in before you can comment on or make changes to this bug.