Bug 509816
| Summary: | cciss: spinlock deadlock causes NMI on HP systems | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Prarit Bhargava <prarit> | ||||
| Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Evan McNabb <emcnabb> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.9 | CC: | coughlan, dhoward, emcnabb, jpirko, jplans, kris.strecker, lmcilroy, mike.miller, sandy.garza, tao, vgoyal | ||||
| Target Milestone: | rc | Keywords: | ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 509818 (view as bug list) | Environment: | |||||
| Last Closed: | 2011-02-16 15:24:41 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 509818, 525725 | ||||||
| Attachments: |
|
||||||
Created attachment 350763 [details]
corrects the spinlock use
This patch removes the spinlock lock/unlock from fail_all_cmds and adds a spin_unlock after the call to fail_all_cmds before the return.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. The fix looks good to me. Do I need to do anything for this bug? (In reply to comment #3) > The fix looks good to me. Do I need to do anything for this bug? No, and thanks for the review. For open issues with cciss look for example here - 505506, 479090, 250485. Posted. *** Bug 508014 has been marked as a duplicate of this bug. *** Committed in 89.6.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html |
Description of problem: While comparing boot sequences from a 5.4 boot versus a RHEL4 boot, the following panic occurred (see details below). After looking into the problem it looks like there is a spinlock deadlock in the code. Version-Release number of selected component (if applicable): 2.6.9-78.0.1.ELsmp (but seems to be in current kernel as well). How reproducible: Unknown/very low reproducibility. Saw this one time on an HP system in the lab. The next boot everything was okay.... Steps to Reproduce: 1. Boot kernel. 2. 3. Actual results: cciss: controller cciss0 failed, stopping. cciss0: controller not responding. NMI Watchdog detected LOCKUP, CPU=6, registers: CPU 6 Modules linked in: vxodm(U) parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc iptable_filter ip_tables ib_srp ib_sdp ib_ipoib md5 ipv6 rdma_ucm rdma_cm iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core ide_dump scsi_dump diskdump zlib_deflate vfat fat vxportal(U) fdd(U) vxfs(U) dmphpalua(U) dmpaaa(U) vxspec(U) vxio(U) vxdmp(U) button battery ac k8_edac edac_mc netxen_nic e1000 bnx2 bonding(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 cciss qla2xxx scsi_transport_fc usb_storage uhci_hcd ohci_hcd ehci_hcd sd_mod scsi_mod Pid: 0, comm: swapper Tainted: PF 2.6.9-78.0.1.ELsmp RIP: 0010:[<ffffffff80319600>] <ffffffff80319600>{.text.lock.spinlock+14} RSP: 0018:000001007fd0bef8 EFLAGS: 00000082 RAX: 0000000000000026 RBX: 0000010871f238dc RCX: 0000000000000046 RDX: 000000000010a5e9 RSI: 0000000000000046 RDI: 0000010871f238dc RBP: 0000000000000000 R08: 0000000000000005 R09: 0000010871f20000 R10: 000001046f73e000 R11: 0000000000000000 R12: 0000000000000062 R13: 0000010c71f9fe98 R14: 0000000000000000 R15: 0000000000000002 FS: 0000002a95576b00(0000) GS:ffffffff8050d580(0000) knlGS:00000000de9f5ba0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000ad8984 CR3: 0000000c86174000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo 0000010c71f9e000, task 00000104800717f0) Stack: 0000000000000046 0000000000000002 0000010871f20000 ffffffffa00939c6 0000000000000062 00000104702c9a40 0000000000000001 0000000000000062 0000010c71f9fe98 0000010c71f9fe98 Call Trace:<IRQ> <ffffffffa00939c6>{:cciss:do_cciss_intr+210} <ffffffff80112ff2>{handle_IRQ_event+41} <ffffffff8011326c>{do_IRQ+197} <ffffffff801108bf>{ret_from_intr+0} <EOI> <ffffffff8010e789>{default_idle+0} <ffffffff8010e7a9>{default_idle+32} <ffffffff8010e81c>{cpu_idle+26} Expected results: No NMI lockup should be seen. Additional info: I looked into the code and this is what I've come up with. do_cciss_intr acquires the CCISS_LOCK(h->ctlr) lock. In an error-handling situation, as evidenced by the above output, fail_all_cmds() is called. fail_all_cmds() attempts to *reacquire* the lock. DEADLOCK. FWIW, the same code exists in RHEL5 (I will clone this to RHEL5).