Bug 509818 - cciss: spinlock deadlock causes NMI on HP systems
Summary: cciss: spinlock deadlock causes NMI on HP systems
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
Target Milestone: rc
: ---
Assignee: Tomas Henzl
QA Contact: Red Hat Kernel QE team
Depends On: 509816
Blocks: 525728
TreeView+ depends on / blocked
Reported: 2009-07-06 12:38 UTC by Prarit Bhargava
Modified: 2009-09-29 10:57 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 509816
Last Closed: 2009-09-02 08:37:27 UTC
Target Upstream Version:

Attachments (Terms of Use)
corrects the spinlock use (1.01 KB, patch)
2009-07-07 10:21 UTC, Tomas Henzl
no flags Details | Diff

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Prarit Bhargava 2009-07-06 12:38:19 UTC
*** The stack trace below is from a RHEL4 boot, however, the same deadlock path exists in the RHEL5 codebase ****

+++ This bug was initially created as a clone of Bug #509816 +++

Description of problem:

While comparing boot sequences from a 5.4 boot versus a RHEL4 boot, the following panic occurred (see details below).  After looking into the problem it looks like there is a spinlock deadlock in the code.

Version-Release number of selected component (if applicable): 2.6.9-78.0.1.ELsmp (but seems to be in current kernel as well).

How reproducible: Unknown/very low reproducibility.  Saw this one time on an HP system in the lab.  The next boot everything was okay....

Steps to Reproduce:
1.  Boot kernel.
Actual results:

cciss: controller cciss0 failed, stopping.
cciss0: controller not responding.
NMI Watchdog detected LOCKUP, CPU=6, registers:
CPU 6 
Modules linked in: vxodm(U) parport_pc lp parport netconsole netdump autofs4
i2c_dev i2c_core sunrpc iptable_filter ip_tables ib_srp ib_sdp ib_ipoib md5
ipv6 rdma_ucm rdma_cm iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad
ib_core ide_dump scsi_dump diskdump zlib_deflate vfat fat vxportal(U) fdd(U)
vxfs(U) dmphpalua(U) dmpaaa(U) vxspec(U) vxio(U) vxdmp(U) button battery ac
k8_edac edac_mc netxen_nic e1000 bnx2 bonding(U) dm_snapshot dm_zero dm_mirror
ext3 jbd dm_mod qla2400 cciss qla2xxx scsi_transport_fc usb_storage uhci_hcd
ohci_hcd ehci_hcd sd_mod scsi_mod
Pid: 0, comm: swapper Tainted: PF     2.6.9-78.0.1.ELsmp
RIP: 0010:[<ffffffff80319600>] <ffffffff80319600>{.text.lock.spinlock+14}
RSP: 0018:000001007fd0bef8  EFLAGS: 00000082
RAX: 0000000000000026 RBX: 0000010871f238dc RCX: 0000000000000046
RDX: 000000000010a5e9 RSI: 0000000000000046 RDI: 0000010871f238dc
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000010871f20000
R10: 000001046f73e000 R11: 0000000000000000 R12: 0000000000000062
R13: 0000010c71f9fe98 R14: 0000000000000000 R15: 0000000000000002
FS:  0000002a95576b00(0000) GS:ffffffff8050d580(0000) knlGS:00000000de9f5ba0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000ad8984 CR3: 0000000c86174000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 0000010c71f9e000, task 00000104800717f0)
Stack: 0000000000000046 0000000000000002 0000010871f20000 ffffffffa00939c6 
       0000000000000062 00000104702c9a40 0000000000000001 0000000000000062 
       0000010c71f9fe98 0000010c71f9fe98 
Call Trace:<IRQ> <ffffffffa00939c6>{:cciss:do_cciss_intr+210}
       <ffffffff8011326c>{do_IRQ+197} <ffffffff801108bf>{ret_from_intr+0} 
        <EOI> <ffffffff8010e789>{default_idle+0}

Expected results:

No NMI lockup should be seen.

Additional info:  I looked into the code and this is what I've come up with.

do_cciss_intr acquires the CCISS_LOCK(h->ctlr) lock.

In an error-handling situation, as evidenced by the above output, fail_all_cmds() is called.

fail_all_cmds() attempts to *reacquire* the lock.


FWIW, the same code exists in RHEL5 (I will clone this to RHEL5).

Comment 1 Prarit Bhargava 2009-07-06 12:40:29 UTC

I did something really stupid here.  I didn't keep the ENTIRE boot log.  I hope that isn't an issue for you Tomas :/.  The deadlock seems pretty obvious ...



Comment 4 Tomas Henzl 2009-07-07 10:21:18 UTC
Created attachment 350768 [details]
corrects the spinlock use

This patch removes the spinlock lock/unlock from fail_all_cmds and adds a
spin_unlock after the call to fail_all_cmds before the return.

Comment 5 RHEL Program Management 2009-07-07 10:23:42 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update

Comment 6 Tomas Henzl 2009-07-08 15:51:40 UTC

Comment 9 Don Zickus 2009-07-14 20:58:16 UTC
in kernel-2.6.18-158.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 15 errata-xmlrpc 2009-09-02 08:37:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.