Bug 474465 - RHEL5.3: Calgary DMA errors on IBM systems
RHEL5.3: Calgary DMA errors on IBM systems
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Tomas Henzl
Martin Jenner
Depends On: 474047
  Show dependency treegraph
Reported: 2008-12-03 17:20 EST by Prarit Bhargava
Modified: 2009-06-15 09:39 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-20 15:05:06 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
proposed patch (562 bytes, patch)
2008-12-08 11:10 EST, Tomas Henzl
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 11:06:24 EST

  None (edit)
Description Prarit Bhargava 2008-12-03 17:20:30 EST
Description of problem:

The following error is seen during boot of IBM systems with Calgary IOMMU:

mptbase: ioc0: PCI-MSI enabled
Calgary: DMA error on CalIOC2 PHB 0x33
Calgary: 0x02000000@CSR 0x00000000@PLSSR 0xb0008000@CSMR 0x00000000@MCK
Calgary: 0x00000000@0x810 0xfee0c000@0x820 0x00000000@0x830 0x00000000@0x840 0x03804a00@0x850 0x00000000@0x860 0x00000000@0x870 
Calgary: 0x00000000@0xcb0
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
BUG: soft lockup - CPU#16 stuck for 10s! [swapper:0]
CPU 16:
Modules linked in: mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) raid0(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 0, comm: swapper Tainted: G #6
RIP: 0010:[<ffffffff8000c6f2>]  [<ffffffff8000c6f2>] __delay+0x8/0x10
RSP: 0018:ffff81048fd6bbb8  EFLAGS: 00000283
RAX: 0000000088fce05a RBX: 000000000000001c RCX: 0000000088d03422
RDX: 00000000000000d8 RSI: 0000000000000001 RDI: 00000000002cb8f0
RBP: ffff81048fd6bb30 R08: 0000000000000005 R09: 0000000000000028
R10: ffffffff803d9520 R11: ffffffff88102a59 R12: ffffffff8005dc8e
R13: ffff81048fda1000 R14: ffffffff800774e1 R15: ffff81048fd6bb30
FS:  0000000000000000(0000) GS:ffff81048fd4a9c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000575490 CR3: 000000029f293000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff880d3bc9>] :mptbase:KickStart+0x257/0xa91
 [<ffffffff880d4550>] :mptbase:MakeIocReady+0x14d/0x2cf
 [<ffffffff880d63be>] :mptbase:mpt_do_ioc_recovery+0xc2/0x14fa
 [<ffffffff80148bd4>] __next_cpu+0x19/0x28
 [<ffffffff80088fbc>] find_busiest_group+0x20d/0x621
 [<ffffffff80089e6d>] __activate_task+0x56/0x6d
 [<ffffffff880d7900>] :mptbase:mpt_HardResetHandler+0x10a/0x196
 [<ffffffff880d798c>] :mptbase:mpt_timer_expired+0x0/0x57
 [<ffffffff880d79b2>] :mptbase:mpt_timer_expired+0x26/0x57
 [<ffffffff80094dc1>] run_timer_softirq+0x133/0x1af
 [<ffffffff8001214f>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cada>] do_softirq+0x2c/0x85
 [<ffffffff80056ceb>] mwait_idle+0x0/0x4a
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff80056d21>] mwait_idle+0x36/0x4a
 [<ffffffff80048e69>] cpu_idle+0x95/0xb8
 [<ffffffff80076bed>] start_secondary+0x45a/0x469

Version-Release number of selected component (if applicable): -125.el5 + patch for BZ 474047.

How reproducible: 100%

Steps to Reproduce:
1. Boot kernel
Actual results:

See above comment.

Expected results:  No DMA errors should be seen.

Additional info:  This happens with or without PCI_DOMAIN support.
Comment 1 Prarit Bhargava 2008-12-03 17:25:18 EST

implies some sort of .config change is required.
Comment 2 Prarit Bhargava 2008-12-05 14:53:00 EST
Scratch comment #1 -- I tracked this down to the following driver update:

commit 7a3056c312c6a32e1028dd32f52d795e8a6b5e9d
Author: Tomas Henzl <thenzl@redhat.com>
Date:   Thu Aug 28 16:10:27 2008 +0200

     Bug 442025 - Update mpt fusion to version 3.04.07
    Message-id: 48B6B1D3.2070806@redhat.com
    O-Subject: [RHEL 5.3 PATCH] Bug 442025 - Update mpt fusion to version 3.04.0
    Bugzilla: 442025


I'm reassigning this to you.  To avoid any issues with machine reservations, I have reserved ibm-x3950m2-02.rhts.bos.redhat.com.  It is currently on reserve for 99 hours, and can be extended by executing extendtesttime.sh .

To reproduce the error you *MUST* build -125.el5 with the patch from 474047, o/w you will hit a different panic earlier in the boot.

Comment 3 Prarit Bhargava 2008-12-08 08:13:42 EST
Oops -- one other thing, you must boot with "iommu=calgary" option.

Comment 4 Tomas Henzl 2008-12-08 09:45:47 EST
(In reply to comment #3)
> Oops -- one other thing, you must boot with "iommu=calgary" option.
> P.
Thanks,I think I've just seen the other error.
Comment 5 Tomas Henzl 2008-12-08 09:56:56 EST
(In reply to comment #3)
> Oops -- one other thing, you must boot with "iommu=calgary" option.
> P.
The problem with mptbase is maybe caused by the change to use MSI by default,
introduced in the patch 3.04.07. I'm going to test it.
Comment 6 Tomas Henzl 2008-12-08 11:10:09 EST
Created attachment 326137 [details]
proposed patch

This patch removes the MSI option switched on by default. I don't know if the problem is in the controller or in the mainboard. I've also seen other more sensible approaches to this issue on lkml, but given the fact it is too late for every kind of testing I would go with this patch, because it is very simple.
Comment 10 Don Zickus 2008-12-09 16:06:18 EST
in kernel-2.6.18-126.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 13 errata-xmlrpc 2009-01-20 15:05:06 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

Comment 14 Oliver Falk 2009-06-15 09:39:25 EDT
May I tell you, that this also happens on Fedora 10...
I have not yet tested with Fedora 11, but I guess it's the same.

Note You need to log in before you can comment on or make changes to this bug.