Description of problem: The following error is seen during boot of IBM systems with Calgary IOMMU: mptbase: ioc0: PCI-MSI enabled Calgary: DMA error on CalIOC2 PHB 0x33 Calgary: 0x02000000@CSR 0x00000000@PLSSR 0xb0008000@CSMR 0x00000000@MCK Calgary: 0x00000000@0x810 0xfee0c000@0x820 0x00000000@0x830 0x00000000@0x840 0x03804a00@0x850 0x00000000@0x860 0x00000000@0x870 Calgary: 0x00000000@0xcb0 mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - Unexpected doorbell active! BUG: soft lockup - CPU#16 stuck for 10s! [swapper:0] CPU 16: Modules linked in: mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) raid0(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 0, comm: swapper Tainted: G 2.6.18.4 #6 RIP: 0010:[<ffffffff8000c6f2>] [<ffffffff8000c6f2>] __delay+0x8/0x10 RSP: 0018:ffff81048fd6bbb8 EFLAGS: 00000283 RAX: 0000000088fce05a RBX: 000000000000001c RCX: 0000000088d03422 RDX: 00000000000000d8 RSI: 0000000000000001 RDI: 00000000002cb8f0 RBP: ffff81048fd6bb30 R08: 0000000000000005 R09: 0000000000000028 R10: ffffffff803d9520 R11: ffffffff88102a59 R12: ffffffff8005dc8e R13: ffff81048fda1000 R14: ffffffff800774e1 R15: ffff81048fd6bb30 FS: 0000000000000000(0000) GS:ffff81048fd4a9c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000575490 CR3: 000000029f293000 CR4: 00000000000006e0 Call Trace: <IRQ> [<ffffffff880d3bc9>] :mptbase:KickStart+0x257/0xa91 [<ffffffff880d4550>] :mptbase:MakeIocReady+0x14d/0x2cf [<ffffffff880d63be>] :mptbase:mpt_do_ioc_recovery+0xc2/0x14fa [<ffffffff80148bd4>] __next_cpu+0x19/0x28 [<ffffffff80088fbc>] find_busiest_group+0x20d/0x621 [<ffffffff80089e6d>] __activate_task+0x56/0x6d [<ffffffff880d7900>] :mptbase:mpt_HardResetHandler+0x10a/0x196 [<ffffffff880d798c>] :mptbase:mpt_timer_expired+0x0/0x57 [<ffffffff880d79b2>] :mptbase:mpt_timer_expired+0x26/0x57 [<ffffffff80094dc1>] run_timer_softirq+0x133/0x1af [<ffffffff8001214f>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cada>] do_softirq+0x2c/0x85 [<ffffffff80056ceb>] mwait_idle+0x0/0x4a [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff80056d21>] mwait_idle+0x36/0x4a [<ffffffff80048e69>] cpu_idle+0x95/0xb8 [<ffffffff80076bed>] start_secondary+0x45a/0x469 Version-Release number of selected component (if applicable): -125.el5 + patch for BZ 474047. How reproducible: 100% Steps to Reproduce: 1. Boot kernel 2. 3. Actual results: See above comment. Expected results: No DMA errors should be seen. Additional info: This happens with or without PCI_DOMAIN support.
http://lists-archives.org/linux-kernel/14495489-x86-per-device-dma_mapping_ops.html implies some sort of .config change is required.
Scratch comment #1 -- I tracked this down to the following driver update: commit 7a3056c312c6a32e1028dd32f52d795e8a6b5e9d Author: Tomas Henzl <thenzl> Date: Thu Aug 28 16:10:27 2008 +0200 Bug 442025 - Update mpt fusion to version 3.04.07 Message-id: 48B6B1D3.2070806 O-Subject: [RHEL 5.3 PATCH] Bug 442025 - Update mpt fusion to version 3.04.0 Bugzilla: 442025 Tomas, I'm reassigning this to you. To avoid any issues with machine reservations, I have reserved ibm-x3950m2-02.rhts.bos.redhat.com. It is currently on reserve for 99 hours, and can be extended by executing extendtesttime.sh . To reproduce the error you *MUST* build -125.el5 with the patch from 474047, o/w you will hit a different panic earlier in the boot. P.
Oops -- one other thing, you must boot with "iommu=calgary" option. P.
(In reply to comment #3) > Oops -- one other thing, you must boot with "iommu=calgary" option. > > P. Thanks,I think I've just seen the other error.
(In reply to comment #3) > Oops -- one other thing, you must boot with "iommu=calgary" option. > > P. The problem with mptbase is maybe caused by the change to use MSI by default, introduced in the patch 3.04.07. I'm going to test it.
Created attachment 326137 [details] proposed patch This patch removes the MSI option switched on by default. I don't know if the problem is in the controller or in the mainboard. I've also seen other more sensible approaches to this issue on lkml, but given the fact it is too late for every kind of testing I would go with this patch, because it is very simple.
in kernel-2.6.18-126.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html
May I tell you, that this also happens on Fedora 10... I have not yet tested with Fedora 11, but I guess it's the same.