Bug 541615

Summary: Calgary: DMA error on CalIOC2 PHB 0x3
Product: [Fedora] Fedora Reporter: Luc Stepniewski <lior>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: low    
Version: 12CC: dougsland, dwheeler, gansalmon, itamar, kernel-maint, lior, mishu, ngaywood
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-12-04 02:45:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luc Stepniewski 2009-11-26 14:01:47 UTC
Description of problem:
When doing "intensive" I/O, the mpt* drivers crashes the filesystem, on Fedora 12.

The problem is on an IBM x3580 M2 machine, using the integrated LSI SAS1078 C1 PCI-express Fusion-MPT SAS.

Steps to Reproduce:
1. Create a big allocated space (20GB for example)
2. dd if=/dev/vg/mybigspace of=/dev/null
3. After a few minutes, the filesystem access becomes impossible. Looking at dmesg, you get the following:

Calgary: DMA error on CalIOC2 PHB 0x3
Calgary: 0x80000000@CSR 0x00000000@PLSSR 0xb0008000@CSMR 0x00000000@MCK
Calgary: 0x00000000@0x810 0x00000000@0x820 0x00000000@0x830 0x00000000@0x840 0x00000000@0x850 0x00000000@0x860 0x00000000@0x870 
Calgary: 0x40000000@0xcb0
irq 46: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.31.5-127.fc12.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8109aefc>] __report_bad_irq+0x3d/0x8c
 [<ffffffff8109b063>] note_interrupt+0x118/0x17d
 [<ffffffff8109b6f2>] handle_fasteoi_irq+0xa1/0xc6
 [<ffffffff8101463c>] handle_irq+0x8b/0x93
 [<ffffffff8141e9cc>] do_IRQ+0x5c/0xbc
 [<ffffffff810126d3>] ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff8101907f>] ? mwait_idle+0x91/0xae
 [<ffffffff8101907f>] ? mwait_idle+0x91/0xae
 [<ffffffff81019021>] ? mwait_idle+0x33/0xae
 [<ffffffff8141d079>] ? atomic_notifier_call_chain+0x13/0x15
 [<ffffffff81010bb8>] ? enter_idle+0x25/0x27
 [<ffffffff81010c60>] ? cpu_idle+0xa6/0xe9
 [<ffffffff814145be>] ? start_secondary+0x1f3/0x234
handlers:
[<ffffffffa00e3d7e>] (mpt_interrupt+0x0/0x8bb [mptbase])
Disabling IRQ #46
mptscsih: ioc0: attempting task abort! (sc=ffff880a0d8fa400)
sd 2:1:4:0: [sda] CDB: Write(10): 2a 00 00 fb 5b a7 00 00 60 00
mptscsih: ioc0: WARNING - TaskMgmt type=1: IOC Not operational (0xffffffff)!
mptscsih: ioc0: WARNING - Issuing HardReset from mptscsih_IssueTaskMgmt!!
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: WARNING - NOT READY WARNING!
mptbase: WARNING - (-1) Cannot recover ioc0
mptscsih: ioc0: WARNING - TaskMgmt HardReset FAILED!!
mptscsih: ioc0: task abort: FAILED (sc=ffff880a0d8fa400)
mptscsih: ioc0: attempting task abort! (sc=ffff880a04bb3100)
sd 2:1:4:0: [sda] CDB: Write(10): 2a 00 00 28 f5 ff 00 00 08 00
mptscsih: ioc0: WARNING - TaskMgmt type=1: IOC Not operational (0xffffffff)!
mptscsih: ioc0: WARNING - Issuing HardReset from mptscsih_IssueTaskMgmt!!
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: WARNING - NOT READY WARNING!
mptbase: WARNING - (-1) Cannot recover ioc0
mptscsih: ioc0: WARNING - TaskMgmt HardReset FAILED!!
mptscsih: ioc0: task abort: FAILED (sc=ffff880a04bb3100)
mptscsih: ioc0: attempting task abort! (sc=ffff880a04bb2600)
sd 2:1:4:0: [sda] CDB: Write(10): 2a 00 00 61 de 27 00 00 08 00
mptscsih: ioc0: WARNING - TaskMgmt type=1: IOC Not operational (0xffffffff)!
mptscsih: ioc0: WARNING - Issuing HardReset from mptscsih_IssueTaskMgmt!!
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
[root@flanders ubuntu]# 
Message from syslogd@mymachine at Nov 26 10:38:28 ...
 kernel:mpage_da_map_blocks block allocation failed for inode 11762 at logical offset 2 with max blocks 1 with error -30

Message from syslogd@mymachine at Nov 26 10:38:28 ...
 kernel:This should not happen.!! Data will be lost


The first error message ("Calgary: DMA error on CalIOC2 PHB 0x3") seems to be related to a bug in the Calgary code, as detailed in a thread in LKML:
"The calgary code can give drivers addresses above 4GB which is very bad for hardware that is only 32bit DMA addressable" (http://www.archivum.info/linux-kernel@vger.kernel.org/2008-06/05248/Re:_%5BPATCH_-mm%5D_x86_calgary:_fix_handling_of_devces_that_aren%27t_behind_the_Calgary ).
But it's from 2008, I thought this would have been corrected...

After looking on another bug report (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343749 ), the temporary solution seems to be to set iommu=soft at boot. But I guess this affects performance... Acccording to that bug report, the bug seems to be corrected on RH 9 ?!


The bug exists in Fedora 12, and makes it unusable on a x3580 M2.

Comment 1 David A. Wheeler 2009-12-29 00:03:03 UTC
I also had a SERIOUS problem installing Fedora 12 on a Dell Optiplex GX620 which *appears* to be the same thing.

I tried to install Fedora 12 on a Dell Optiplex GX620 as a 64-bit OS (x86_64), using BIOS revision A10.  When the disk was busy installing things it suddenly hung with this message:
 kernel: mpage_da_map_blocks block allocation failed for inode 211 at logical offset 0 with max blocks 1 with error -30
 kernel: This should not happen.!! Data will be lost
On each boot, I needed to add kernel entry
 iommu=soft
Later I modified /boot/grub/grub.conf so all entries for kernel added:
 iommu=soft

Previously I had tried to upgrade the BIOS to rev. A11; this caused complete loss of the USB keyboard/mouse, so I re-installed revision A10.

I'm in the middle of an install now that the iommu=soft setting has been added; so far, this *appears* to have solved the problem.

Comment 2 Norman Gaywood 2010-07-04 01:48:25 UTC
I notice that Redhat 6 Beta 2 has this in the release notes of known kernel problems:

Calgary IOMMU default detection has been disabled in this release. If you require Calgary IOMMU support add 'iommu=calgary' as a boot parameter.

So perhaps the new Enterprise kernel is now hitting this problem?

Comment 3 Bug Zapper 2010-11-04 05:16:08 UTC
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 4 Bug Zapper 2010-12-04 02:45:07 UTC
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.