Bug 538163 - Spurious DMAR faults on integrated Intel graphics
Summary: Spurious DMAR faults on integrated Intel graphics
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 13
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 573173 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-11-17 19:50 UTC by Tom London
Modified: 2010-06-15 20:39 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 561267 (view as bug list)
Environment:
Last Closed: 2010-03-22 20:09:50 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
/var/log/messages from run showing DMAR spew and freeze..... (276.92 KB, text/plain)
2009-11-17 19:50 UTC, Tom London
no flags Details
dmesg output showing DMAR DMA read errors for audio device (67.26 KB, text/plain)
2009-11-24 17:04 UTC, Tom London
no flags Details
dmesg output showing DMAR spew (82.06 KB, text/plain)
2009-11-27 15:21 UTC, Tom London
no flags Details
Photo of screen with Oops/crash (232.65 KB, image/jpeg)
2009-11-27 19:36 UTC, Tom London
no flags Details
Screen photo of Oops during boot (216.46 KB, image/jpeg)
2009-11-28 00:28 UTC, Tom London
no flags Details
Second photo of screen showing Oops (198.93 KB, image/jpeg)
2009-11-28 00:30 UTC, Tom London
no flags Details
output of 'intel_gtt' (58.90 KB, text/plain)
2009-12-02 14:38 UTC, Tom London
no flags Details
Output from 'intel_gtt' just after gnome boot.... (8.78 KB, text/plain)
2009-12-02 17:01 UTC, Tom London
no flags Details
Output from 'intel_gtt' 2 minutes later... (23.25 KB, text/plain)
2009-12-02 17:01 UTC, Tom London
no flags Details
Output of 'intel_gtt' 13 minutes later..... (29.42 KB, text/plain)
2009-12-02 17:12 UTC, Tom London
no flags Details
/var/log/messages for complete run until hard crash.... (329.29 KB, text/plain)
2009-12-02 17:29 UTC, Tom London
no flags Details
Output of 'intel_gtt' after system has been up for about 13 minutes (117.57 KB, text/plain)
2009-12-02 21:45 UTC, Tom London
no flags Details
dmesg output on kernel 2.6.32.1-9.fc13.x86_64, Thinkpad T400 (55.45 KB, text/plain)
2009-12-16 20:52 UTC, MartinG
no flags Details
dmesg with intel_iommu=igfx_off and suspend/resume cycle (71.68 KB, text/plain)
2009-12-16 21:54 UTC, MartinG
no flags Details

Description Tom London 2009-11-17 19:50:16 UTC
Created attachment 369946 [details]
/var/log/messages from run showing DMAR spew and freeze.....

Description of problem:
I installed and booted kernel-2.6.32-0.48.rc7.git1.fc13.x86_64 into an otherwise fc12 system: Lenovo X200:

00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
03:00.0 Network controller: Intel Corporation PRO/Wireless 5100 AGN [Shiloh] Network Connection


System boots up just fine, and X/gdm comes up as expected.

However, the system seems "choppy", and examining /var/log/messages, I see a steady spew of the following:

Nov 17 06:05:46 tlondon kernel: DRHD: handling fault status reg 3
Nov 17 06:05:46 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 4040000
Nov 17 06:05:46 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set

I attach the complete /var/log/messages from the run.

The system eventually "graphically froze": cursor froze, no response to keyboard.  I eventually did a hard reset and booted up to kernel-2.6.31.5-127.fc12.x86_64.



Version-Release number of selected component (if applicable):
kernel-2.6.32-0.48.rc7.git1.fc13.x86_64

How reproducible:
Every boot......

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Tom London 2009-11-19 17:02:52 UTC
This spew (and the general system choppiness) continues with kernel-2.6.32-0.51.rc7.git2.fc13.x86_64


Nov 19 09:00:05 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:05 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:05 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:05 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:07 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:07 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:07 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:07 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:07 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:07 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:07 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:07 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:07 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:10 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:10 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:10 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:11 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:11 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:11 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:12 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:12 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:12 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:14 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:14 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:14 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:14 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:14 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:14 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:15 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:15 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:15 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:16 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:16 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:16 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:19 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:19 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:19 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:22 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:22 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:22 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 19 09:00:23 tlondon kernel: DRHD: handling fault status reg 3
Nov 19 09:00:23 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Nov 19 09:00:23 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set

Comment 2 Adam Williamson 2009-11-23 19:03:50 UTC
can you confirm that the issue is not present with the f12 release kernel, and the messages do not display?

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 3 Tom London 2009-11-23 19:31:08 UTC
Yes, I can firm.

I am running kernel-2.6.31.5-127.fc12.x86_64 with only a single occurrence of such message:

[tbl@tlondon ~]$ uname -a
Linux tlondon.innopath.com 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
[tbl@tlondon ~]$ dmesg | grep DMAR
ACPI: DMAR 00000000bd406000 00120 (v01                00000001      00000000)
DMAR:Host address width 36
DMAR:DRHD base: 0x000000feb03000 flags: 0x0
DMAR:DRHD base: 0x000000feb01000 flags: 0x0
DMAR:DRHD base: 0x000000feb00000 flags: 0x0
DMAR:DRHD base: 0x000000feb02000 flags: 0x1
DMAR:RMRR base: 0x000000f2826c00 end: 0x000000f28273ff
DMAR:RMRR base: 0x000000bdc00000 end: 0x000000bfffffff
DMAR:No ATSR found
DMAR: Forcing write-buffer flush capability
DMAR:Device scope device [0000:00:03.02] not found
DMAR:Device scope device [0000:00:03.02] not found
DMAR:Device scope device [0000:00:03.03] not found
DMAR:Device scope device [0000:00:03.03] not found
DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 
DMAR:[fault reason 05] PTE Write access is not set
[tbl@tlondon ~]$ 

Here is a bit of context from /var/log/messages for this single occurrence:

Nov 23 08:50:48 tlondon kernel: IOMMU: Setting identity map for device 0000:00:1a.2 [0xf2826c00 - 0xf2827400]
Nov 23 08:50:48 tlondon kernel: IOMMU: Setting identity map for device 0000:00:1a.7 [0xf2826c00 - 0xf2827400]
Nov 23 08:50:48 tlondon kernel: IOMMU: Prepare 0-16MiB unity mapping for LPC
Nov 23 08:50:48 tlondon kernel: IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000]
Nov 23 08:50:48 tlondon kernel: DRHD: handling fault status reg 3
Nov 23 08:50:48 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000
Nov 23 08:50:48 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Nov 23 08:50:48 tlondon kernel: PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
Nov 23 08:50:48 tlondon kernel: hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0
Nov 23 08:50:48 tlondon kernel: hpet0: 4 comparators, 64-bit 14.318180 MHz counter
Nov 23 08:50:48 tlondon kernel: pnp: PnP ACPI init

Comment 4 Tom London 2009-11-24 04:59:22 UTC
I downloaded/installed the latest fc12 kernel from koji, kernel-2.6.31.6-148.fc12.x86_64, and rebooted.

I also do not see the DMAR error spew with it:

[root@tlondon ~]# uname -a
Linux tlondon.innopath.com 2.6.31.6-148.fc12.x86_64 #1 SMP Mon Nov 23 19:30:10 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@tlondon ~]# dmesg | grep DMAR
ACPI: DMAR 00000000bd406000 00120 (v01                00000001      00000000)
DMAR:Host address width 36
DMAR:DRHD base: 0x000000feb03000 flags: 0x0
DMAR:DRHD base: 0x000000feb01000 flags: 0x0
DMAR:DRHD base: 0x000000feb00000 flags: 0x0
DMAR:DRHD base: 0x000000feb02000 flags: 0x1
DMAR:RMRR base: 0x000000f2826c00 end: 0x000000f28273ff
DMAR:RMRR base: 0x000000bdc00000 end: 0x000000bfffffff
DMAR:No ATSR found
DMAR: Forcing write-buffer flush capability
DMAR:Device scope device [0000:00:03.02] not found
DMAR:Device scope device [0000:00:03.02] not found
DMAR:Device scope device [0000:00:03.03] not found
DMAR:Device scope device [0000:00:03.03] not found
DMAR:[DMA Write] Request device [00:02.0] fault addr 6f6577000 
DMAR:[fault reason 05] PTE Write access is not set
[root@tlondon ~]#

Comment 5 Tom London 2009-11-24 17:04:05 UTC
Created attachment 373488 [details]
dmesg output showing DMAR DMA read errors for audio device

Uhhh...  This may be more complicated that I reported above.....

I noticed the following DMAR DMA read errors on my current boot of kernel-2.6.31.6-148.fc12.x86_64:

<<<<SNIP>>>>
DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set
DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set
DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set
DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set
DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set
DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set
DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set
<<<<SNIP>>>>

00:1b.0 is the Intel Audio device.....:
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)

The previously reported DMAR DMA write errors were accessing 00:02.0 which is the Intel Video controller:

00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)

I attach complete output of dmesg....

Comment 6 Tom London 2009-11-27 15:21:34 UTC
Created attachment 374239 [details]
dmesg output showing DMAR spew

kernel-2.6.32-0.55.rc8.git1.fc13.x86_64 does not fix this.

I continue to get DMAR spew, system "choppiness", and ultimately graphical (system?) freeze.

I attach here the output of "dmesg" obtained just before the system froze. The system was up for about 5 minutes before freezing.

I can attach /var/log/messages output for this run if useful....

Reverting back to kernel-2.6.31.6-148.fc12.x86_64 makes the system stable (and more responsive).

Any additional info useful here?

Comment 7 Adam Williamson 2009-11-27 18:30:22 UTC
David, is this one for you?

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 8 David Woodhouse 2009-11-27 18:38:19 UTC
The F12 kernel has a "workaround broken graphics drivers" option enabled, so the graphics driver is exempt from having to use the DMA API correctly.

This option is removed from later kernels; the graphics drivers are expected (and believed) to be fixed. If you see this kind of error, it's a driver problem.

Current practice is to assign kernel/drm bugs to the corresponding xorg driver, right?

Comment 9 David Woodhouse 2009-11-27 18:40:40 UTC
The one with the audio driver is interesting. That's a completely separate IOMMU unit. Can you file it as a separate bug, please?

Comment 10 David Woodhouse 2009-11-27 18:41:16 UTC
Btw, booting with 'intel_iommu=igfx_off' should still allow you to bypass the IOMMU for the integrated graphics device.

Comment 11 Tom London 2009-11-27 19:28:53 UTC
I filed BZ for audio driver issue here: https://bugzilla.redhat.com/show_bug.cgi?id=541981

Comment 12 Tom London 2009-11-27 19:36:18 UTC
Created attachment 374296 [details]
Photo of screen with Oops/crash

(In reply to comment #10)
> Btw, booting with 'intel_iommu=igfx_off' should still allow you to bypass the
> IOMMU for the integrated graphics device.  

Arghhh.....  Booting with 'intel_iommu=igfx_off' seems to bring no joy: I get a hard crash during boot (every time).

I attach a "camera photo" of screen.

Sorry, but I don't know how to better scrape the screen in such cases. (Suggestions?)

The crash appears to be at 'notifier_call_chain', ..., i915_init, ...

Anyway, 'intel_iommu=igfx_off' appears to be a non-starter......

Comment 13 David Woodhouse 2009-11-27 22:27:35 UTC
Please try again with rc8, or with 'mem=2G'. There's a fix in 2.6.32-rc8 which makes the inteldrm driver cope with RAM above the 4GiB boundary.

And if it persists, please add 'panic_on_oops' so that the interesting part of the oops doesn't scroll off the screen.

Comment 14 David Woodhouse 2009-11-27 22:53:16 UTC
These errors are probably happening because there's noise somewhere in the GATT. I suspect we're not clearing it completely. Either we're miscalculating the size of it, or it's something to do with the 4KiB of stolen memory that we don't use on GM4X.

Comment 15 Tom London 2009-11-27 23:11:36 UTC
(In reply to comment #13)
> Please try again with rc8, or with 'mem=2G'. There's a fix in 2.6.32-rc8 which
> makes the inteldrm driver cope with RAM above the 4GiB boundary.
> 
> And if it persists, please add 'panic_on_oops' so that the interesting part of
> the oops doesn't scroll off the screen.  

Already downloaded/booted kernel-2.6.32-0.55.rc8.git1.fc13.x86_64 and it seems to fail the same way.

See comments in #6 above.

I will try with 'mem=2G panic_on_oops'.....

Comment 16 Tom London 2009-11-27 23:25:18 UTC
Sorry, I wasn't clear: the crash with intel_iommu=igfx_off was with kernel-2.6.32-0.55.rc8.git1.fc13.x86_64

Comment 17 Tom London 2009-11-28 00:28:06 UTC
Created attachment 374327 [details]
Screen photo of Oops during boot

OK. The suggested options (mem=2G, panic_on_oops) did not help with kernel-2.6.32-0.55.rc8.git1.fc13.x86_64:

With 'intel_iommu=igfx_off', the system continued to Oops and die just as dracut is about to start plymouth.

Without 'intel_iommu=igfx_off', the system boots, but I continue to get the DMAR spew along with very slow/choppy performance.

However, reading the man page for dracut, I added rd_NO_PLYMOUTH to the boot line, removed a SD Flash card, and now the screen seems to stabilize with more information:

The Oops says "unable to handle NULL pointer dereference at 00000000000037".

I attach here a photo of the screen.

I will attach another below; I can't tell which is clearer....

Comment 18 Tom London 2009-11-28 00:30:53 UTC
Created attachment 374328 [details]
Second photo of screen showing Oops

Here is a second photo.....

By the way, here are the xorg packages I'm running with:

[tbl@tlondon Download]$ rpm -qa xorg-x11-server\* xorg-x11-drv-intel
xorg-x11-server-Xorg-1.7.1-7.fc12.x86_64
xorg-x11-server-debuginfo-1.7.1-7.fc12.x86_64
xorg-x11-server-utils-7.4-15.fc13.x86_64
xorg-x11-drv-intel-2.9.1-1.fc13.x86_64
xorg-x11-server-common-1.7.1-7.fc12.x86_64
xorg-x11-server-Xephyr-1.7.1-7.fc12.x86_64
[tbl@tlondon Download]$

Comment 19 David Woodhouse 2009-11-28 09:23:50 UTC
OK, the faults ought to be fixed by http://david.woodhou.se/gtt-size.patch --
but I'd like confirmation that that's correct for i915 (and G33 in particular).
Assigning to Eric for that.

The oops with igfx_off ought to be fixed by
http://david.woodhou.se/igfx-off-fix.patch; I'll test that next, now that I'm no
longer being distracted by tracking down the _real_ problem :)

Comment 20 David Woodhouse 2009-11-28 18:04:44 UTC
Hm, gtt-size.patch may not be the right approach -- I suspect we have to do it earlier, to ensure that it's done before the IOMMU is turned on. And then we might as well leave the AGP code as it is.

Maybe a PCI quirk for clearing the GTT on startup?

I've also seen something scribbling on the GTT in random places, far _above_ the number of entries we actually use. It always seems to be the _untranslated_ address of the scratch page which gets written, strangely.

[root@dyn-236 ~]# dd if=/dev/mem bs=$((0x100000)) skip=$((0xf22)) count=3 | hexdump -C | grep '01 00 80 37'
00043790  01 f0 ff ff 01 00 80 37  01 f0 ff ff 01 f0 ff ff  |.......7........|
000691b0  01 f0 ff ff 01 00 80 37  01 f0 ff ff 01 f0 ff ff  |.......7........|
001a6860  01 f0 ff ff 01 f0 ff ff  01 00 80 37 01 f0 ff ff  |...........7....|

I reverted gtt-size.patch and instead just used a memset to zero for the rest of the GTT, in intel_i915_configure(). And then iounmapped it and did another ioremap of _only_ 256KiB, in the hope that I'd then get a fault and a backtrace from whatever was doing this. No luck so far though -- it's still all zeroes. I haven't quite worked out how to reproduce the corruption reliably though; it may yet happen.

Comment 21 Tom London 2009-11-28 19:47:23 UTC
Just for completeness, I thought I would add the obvious: if I disable Vt-d in the BIOS, this goes away: I'm running kernel-2.6.32-0.56.rc8.git1.fc13.x86_64 without problem.

With Vt-d enabled in BIOS, I get above DMAR spew/system choppiness, etc.

Comment 22 Tom London 2009-12-01 15:01:40 UTC
With kernel-2.6.32-0.61.rc8.git2.fc13.x86_64, I'm getting Oops like the following.  These related/connected?

The Linux kernel

The kernel package contains the Linux kernel (vmlinuz), the core of any
Linux operating system.  The kernel handles the basic functions
of the operating system: memory allocation, process allocation, device
input and output, etc.[root@tlondon kerneloops-1259679238-1]# cat kerneloops 
------------[ cut here ]------------
WARNING: at drivers/pci/dmar.c:611 check_zero_address+0x8f/0xb3() (Not tainted)
Hardware name: 74585FU
Your BIOS is broken; DMAR reported at address zero!
BIOS vendor: LENOVO; Ver: 6DET60WW (3.10 ); Product Version: ThinkPad X200
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.32-0.61.rc8.git2.fc13.x86_64 #1
Call Trace:
[<ffffffff81058f75>] warn_slowpath_common+0x84/0x9c
[<ffffffff81488013>] ? _etext+0x0/0x1
[<ffffffff81058fe4>] warn_slowpath_fmt+0x41/0x43
[<ffffffff81a6ba1d>] check_zero_address+0x8f/0xb3
[<ffffffff81a6ba9c>] detect_intel_iommu+0x12/0x8c
[<ffffffff81a49b33>] pci_iommu_alloc+0x5e/0x6c
[<ffffffff81a587b1>] mem_init+0x19/0xec
[<ffffffff81a42c14>] start_kernel+0x21f/0x42c
[<ffffffff81a422c1>] x86_64_start_reservations+0xac/0xb0
[<ffffffff81a423bd>] x86_64_start_kernel+0xf8/0x107

and

------------[ cut here ]------------
WARNING: at arch/x86/mm/ioremap.c:149 __ioremap_caller+0x171/0x312() (Tainted: G        W )
Hardware name: 74585FU
Modules linked in:
Pid: 1, comm: swapper Tainted: G        W  2.6.32-0.61.rc8.git2.fc13.x86_64 #1
Call Trace:
[<ffffffff81058f75>] warn_slowpath_common+0x84/0x9c
[<ffffffff81058fa1>] warn_slowpath_null+0x14/0x16
[<ffffffff8103711a>] __ioremap_caller+0x171/0x312
[<ffffffff8126a11d>] ? alloc_iommu+0x1de/0x261
[<ffffffff8103739d>] ioremap_nocache+0x17/0x19
[<ffffffff8126a11d>] alloc_iommu+0x1de/0x261
[<ffffffff81a6bc43>] ? dmar_table_init+0x12d/0x376
[<ffffffff81a6bcb9>] dmar_table_init+0x1a3/0x376
[<ffffffff81a50b21>] enable_IR_x2apic+0x23/0x21d
[<ffffffff810892a5>] ? lockstat_clock+0x11/0x13
[<ffffffff81a4ec17>] native_smp_prepare_cpus+0x12d/0x365
[<ffffffff81a425cc>] kernel_init+0x75/0x26e
[<ffffffff81012daa>] child_rip+0xa/0x20
[<ffffffff81012710>] ? restore_args+0x0/0x30
[<ffffffff81a42557>] ? kernel_init+0x0/0x26e
[<ffffffff81012da0>] ? child_rip+0x0/0x20

Comment 23 David Woodhouse 2009-12-01 17:05:08 UTC
Those aren't oopses; they're just warnings. Your BIOS is broken. The system should work fine with the exception of some spurious graphics faults which I'm still working on.

Comment 24 Tom London 2009-12-01 18:18:15 UTC
OK, thanks.  Appreciate the support!

Sigh, believe this is the latest version for X200.

Let me know how to help.....

Comment 25 Adam Williamson 2009-12-01 20:20:50 UTC
tell your motherboard vendor they're an idiot :)

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 26 Michael Breuer 2009-12-02 00:31:50 UTC
Adding a data point... Having similar issues on a p6t deluxe V2 (and have told Asus)...

I discovered today that there is some odd relation between the DMAR errors and KMS. Also, the DMAR errors are not specific to any card as I have managed to get them on my ethernet device as well as nvidia card.

Scenerios:

VT-D enabled + nomodeset + intel_iommu=on: DMAR errors + IRQ errors
VT-D enabled + KMS + intel_iommu=on: No errors (DMAR or IRQ)
VD-D disabled - all scenarios work w/o DMAR errors (as expected)

Also, performance with VT-D disabled seems to be about 20% higher than with VT-D enabled regardless of modeset or intel_iommu settings (not that I've tried all the options yet).

I've also not verified that I can actually bring up a VM and redirect devices using VT-D.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 27 David Woodhouse 2009-12-02 09:07:35 UTC
Michael, please could you show the messages you see? Not the graphics ones which I'm aware of, but any on other devices and any IRQ errors. Show the full dmesg when the system boots.

Comment 28 Michael Breuer 2009-12-02 09:17:38 UTC
Ok- won't be until late tomorrow. Have to rerun.

Comment 29 David Woodhouse 2009-12-02 11:24:24 UTC
Tom, the 2.6.32-0.63.rc8 kernel building at http://koji.fedoraproject.org/koji/taskinfo?taskID=1843047 should fix all known issues, although it may still print nasty messages about your BIOS.

But with VT-d enabled in your BIOS, everything should work fine and you shouldn't see any DMAR faults except maybe one on the graphics device when the IOMMU is first initialised. If you could confirm that, it would be very much appreciated.

Note that I am currently chasing what seems to be a hardware problem on the x200s: After some period of operation, you do start to see DMAR faults which seem to happen when a GTT entry get modified to point at the physical address of the scratch page instead of the virtual address (which is translated by the IOMMU to a physical address).

If you see that, please could you run http://david.woodhou.se/intel_gtt and show me its output, along with the faults you see?

Comment 30 Tom London 2009-12-02 14:38:35 UTC
Created attachment 375443 [details]
output of 'intel_gtt'

First of all, thanks!

I've downloaded/installed/booted kernel-2.6.32-0.63.rc8.git2.fc13.x86_64.

Not sure this is what you are looking for, but within a few minutes of booting up to gnome, I see the following:

------------[ cut here ]------------
WARNING: at drivers/pci/dmar.c:616 check_zero_address+0x96/0x19b() (Not tainted)
Hardware name: 74585FU
Your BIOS is broken; DMAR reported at address zero!
BIOS vendor: LENOVO; Ver: 6DET60WW (3.10 ); Product Version: ThinkPad X200
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.32-0.63.rc8.git2.fc13.x86_64 #1
Call Trace:
[<ffffffff81058f75>] warn_slowpath_common+0x84/0x9c
[<ffffffff81488113>] ? _etext+0x0/0x1
[<ffffffff81058fe4>] warn_slowpath_fmt+0x41/0x43
[<ffffffff81a6ba6d>] check_zero_address+0x96/0x19b
[<ffffffff812afd6d>] ? acpi_tb_verify_table+0x57/0x5c
[<ffffffff812af373>] ? acpi_get_table_with_size+0x5a/0xb4
[<ffffffff81488113>] ? _etext+0x0/0x1
[<ffffffff81a6bb84>] detect_intel_iommu+0x12/0x8c
[<ffffffff81a49b33>] pci_iommu_alloc+0x5e/0x6c
[<ffffffff81a587b1>] mem_init+0x19/0xec
[<ffffffff81a42c14>] start_kernel+0x21f/0x42c
[<ffffffff81a422c1>] x86_64_start_reservations+0xac/0xb0
[<ffffffff81a423bd>] x86_64_start_kernel+0xf8/0x107

Running 'intel_gtt' produces lots.... attached here.

This what you are looking for?

No explicit "DMAR" messages in log yet.

Comment 31 David Woodhouse 2009-12-02 15:46:37 UTC
OK, that warning shows that your system isn't _using_ the DMAR (aka the IOMMU), because the BIOS is claiming that it lives at physical address zero... which we don't believe.

ISTR the X200s only did that for the first boot after VT-d was enabled in the BIOS. If you enable it in the BIOS and then power cycle, it should be fine after that?

And then most of the entries in the GART should have addresses like 0xf....... -- most of them will be 0xfffff001, in the same way that most of them are 0x37800001 in the output you showed.

Comment 32 Tom London 2009-12-02 15:55:25 UTC
Yeah, you're right: this was from first boot after enabling VT-d.

I'll reboot and report.

Comment 33 Tom London 2009-12-02 17:01:10 UTC
Created attachment 375493 [details]
Output from 'intel_gtt' just after gnome boot....

After a reboot, I continue to get a "slew" of DMAR messages:

Dec  2 08:57:14 tlondon kernel: DRHD: handling fault status reg 2
Dec  2 08:57:14 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec  2 08:57:14 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Dec  2 08:57:15 tlondon kernel: DRHD: handling fault status reg 2
Dec  2 08:57:15 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec  2 08:57:15 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Dec  2 08:57:17 tlondon kernel: DRHD: handling fault status reg 2
Dec  2 08:57:17 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec  2 08:57:17 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Dec  2 08:57:17 tlondon kernel: DRHD: handling fault status reg 2
Dec  2 08:57:17 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec  2 08:57:17 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Dec  2 08:57:19 tlondon kernel: DRHD: handling fault status reg 2
Dec  2 08:57:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec  2 08:57:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Dec  2 08:57:19 tlondon kernel: DRHD: handling fault status reg 2
Dec  2 08:57:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec  2 08:57:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Dec  2 08:57:19 tlondon kernel: DRHD: handling fault status reg 2
Dec  2 08:57:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec  2 08:57:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set

I've captured the output of 'intel_gtt' 2 times, separated by about 2 minutes.

I attach the first here and the second below.  Output as expected?

Comment 34 Tom London 2009-12-02 17:01:57 UTC
Created attachment 375495 [details]
Output from 'intel_gtt' 2 minutes later...

Comment 35 Tom London 2009-12-02 17:12:11 UTC
Created attachment 375499 [details]
Output of 'intel_gtt' 13 minutes later.....

Ran 'intel_gtt' again 13 minute later.

Comment 36 Tom London 2009-12-02 17:29:05 UTC
Created attachment 375503 [details]
/var/log/messages for complete run until hard crash....

After about 5 or 6 more minutes (with pretty steady DMAR message spew), the system hard froze.  I had to power cycle to reboot.

Attached is /var/log/messages for that run.

I've rebooted with VT-d off.....

Comment 37 David Woodhouse 2009-12-02 17:29:33 UTC
OK, it sounds like you're seeing the hardware issue that I found on my x200s, but you can reproduce it a _lot_ faster than I can. 

This is just a stock F12 with rawhide kernel, and you're seeing this when you just boot into runlevel 5 and immediately look in /var/log/messages? Nothing else running but the terminal you use for that?

You can avoid this by using 'intel_iommu=igfx_off' for now. The IOMMU will work for everything else, and be turned off for the graphics device.

Thank you very much for confirming this suspected hardware issue exists on more than one machine. I had been trying on an HP6930p, with the same chipset, and failing to reproduce it there.

Comment 38 Tom London 2009-12-02 17:47:38 UTC
My Thinkpad X200 is "full rawhide" from a continuously updated f12.

Here is my "recipe": I boot into compiz enabled gnome, open a terminal window, open a second terminal window (via Shift-ctrl-N), do "sudo -i; inotail -f /var/log/messages".

I can see the DMAR messages then (per the first 2 'intel_gtt' attachments).

I then try to set up my usual desktop: a qemu-kvm VM, empathy, firefox, rhythmbox.

Believe that was what I had running when I got the crash.

Again, thanks for looking in to this.

I will re-enable VT-d and boot with 'intel_iommu=igfx_off' the next reboot.

Comment 39 Adam Williamson 2009-12-02 19:01:54 UTC
General point: can anyone who's going to post on this bug and say they're seeing 'similar issues' please just include the exact messages from the kernel logs instead. There are various DMAR / IOMMU related error messages and the exact messages you're seeing are significant, in all cases we need to know the exact messages you see. thanks.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 40 Tom London 2009-12-02 21:45:56 UTC
Created attachment 375599 [details]
Output of 'intel_gtt' after system has been up for about 13 minutes

I re-enabled VT-d, powered off, powered on and boot up (with 'intel_iommu=igfx_off').

I do not see DMAR spew.

Here are the DMAR/IOMMU messages logged in dmesg:

[tbl@tlondon ~]$ dmesg | grep -e 'DMAR\|IOMMU'
ACPI: DMAR 00000000bd406000 00120 (v01               ? 00000001      00000000)
Intel-IOMMU: disable GFX device mapping
DMAR: Host address width 36
DMAR: DRHD base: 0x000000feb03000 flags: 0x0
IOMMU feb03000: ver 1:0 cap c9008020e30260 ecap 1000
DMAR: DRHD base: 0x000000feb01000 flags: 0x0
IOMMU feb01000: ver 1:0 cap c0000020630260 ecap 1000
DMAR: DRHD base: 0x000000feb00000 flags: 0x0
IOMMU feb00000: ver 1:0 cap c0000020630270 ecap 1000
DMAR: DRHD base: 0x000000feb02000 flags: 0x1
IOMMU feb02000: ver 1:0 cap c9008020630260 ecap 1000
DMAR: RMRR base: 0x000000f2826c00 end: 0x000000f28273ff
DMAR: RMRR base: 0x000000bdc00000 end: 0x000000bfffffff
DMAR: No ATSR found
DMAR: Forcing write-buffer flush capability
DMAR: Device scope device [0000:00:03.02] not found
DMAR: Device scope device [0000:00:03.02] not found
DMAR: Device scope device [0000:00:03.03] not found
DMAR: Device scope device [0000:00:03.03] not found
IOMMU 0xfeb00000: using Register based invalidation
IOMMU 0xfeb03000: using Register based invalidation
IOMMU 0xfeb02000: using Register based invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:00:1d.0 [0xf2826c00 - 0xf2827400]
IOMMU: Setting identity map for device 0000:00:1d.1 [0xf2826c00 - 0xf2827400]
IOMMU: Setting identity map for device 0000:00:1d.2 [0xf2826c00 - 0xf2827400]
IOMMU: Setting identity map for device 0000:00:1d.7 [0xf2826c00 - 0xf2827400]
IOMMU: Setting identity map for device 0000:00:1a.0 [0xf2826c00 - 0xf2827400]
IOMMU: Setting identity map for device 0000:00:1a.1 [0xf2826c00 - 0xf2827400]
IOMMU: Setting identity map for device 0000:00:1a.2 [0xf2826c00 - 0xf2827400]
IOMMU: Setting identity map for device 0000:00:1a.7 [0xf2826c00 - 0xf2827400]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000]
[tbl@tlondon ~]$ 

For completeness, I attach the output of 'intel_gtt' after system has been up about 8 minutes.

Comment 41 David Woodhouse 2009-12-02 21:54:25 UTC
Thanks, Adam.

Also, please note that if they're just the harmless warnings telling you that your BIOS is broken but we coped, we don't need to know. Report it to your motherboard/system manufacturer and demand a fixed BIOS.

I only want to know if you're seeing actual problems or DMA fault reports.

Comment 42 David Woodhouse 2009-12-02 21:56:35 UTC
Thanks, Tom. Your GTT shows that you are indeed not using the IOMMU for graphics -- it's got real physical addresses in it, and you shouldn't have any faults.

If you do see any, they're likely to be a driver bug in a different device driver (like the sound one where the card seemed to be attempting to read from physical address zero, which is precisely the kind of thing the IOMMU is _supposed_ to catch).

Comment 43 Tom London 2009-12-11 14:33:25 UTC
A (small) status update:

I found a new BIOS version for X200 on Lenovo site:

Dec 11 06:12:55 tlondon kernel: thinkpad_acpi: ThinkPad BIOS 6DET61WW (3.11 ), EC 7XHT24WW-1.06

[Old one was: Dec 10 06:09:02 tlondon kernel: thinkpad_acpi: ThinkPad BIOS 6DET60WW (3.10 ), EC 7XHT22WW-1.04]

I see no change in described behavior: booting kernel-2.6.32-7.fc13.x86_64 without 'intel_iommu=igfx_off' still spews:

Dec 11 06:11:46 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec 11 06:11:46 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
Dec 11 06:11:47 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 
Dec 11 06:11:47 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set

Booting with 'intel_iommu=igfx_off' works and squelches spew.

I've asked my IT guy to report this through Lenovo channels.  I'll continue to monitor BIOS updates.....

Comment 44 David Woodhouse 2009-12-11 16:11:04 UTC
Thanks. I think the faults you see now are a chipset bug; not necessarily Lenovo's fault. It's being investigated.

Comment 45 MartinG 2009-12-16 20:51:30 UTC
A new datapoint here: Lenovo Thinkpad T400, Fedora 12 with kernel-2.6.32.1-9.fc13.x86_64. Locks up when GDM starts, the system get unresponsive and then totally locks up. Ctrl-alt-del doesn't work and I need to hold the powerbutton to restart. dmesg shows:

DRHD: handling fault status reg 2
DMAR:[DMA Write] Request device [00:02.0] fault addr 1305ef000
DMAR:[fault reason 05] PTE Write access is not set

I got the full dmesg output from a virtual terminal. i915 graphics:
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)

Comment 46 MartinG 2009-12-16 20:52:44 UTC
Created attachment 378843 [details]
dmesg output on kernel 2.6.32.1-9.fc13.x86_64, Thinkpad T400

Comment 47 MartinG 2009-12-16 21:54:41 UTC
Created attachment 378856 [details]
dmesg with intel_iommu=igfx_off and suspend/resume cycle

Just wanted to add that "intel_iommu=igfx_off" makes 2.6.32.1-9.fc13.x86_64 work on my T400, with compiz and KDE 4. Nice. It also seems that https://bugzilla.redhat.com/show_bug.cgi?id=528312 is gone for me now. Great!

Full dmesg (with working suspend/resume) attached.

Comment 48 Tomasz Torcz 2010-01-17 10:07:01 UTC
2.6.32.3-24.fc12.x86_64, Lenovo ThinkPad T400 with Intel GMA4500MHD, 4 GiB RAM. Booting with VT-d enabled spews messages in dmesg:

DMAR: Device scope device [0000:00:03.02] not found
DMAR: Device scope device [0000:00:03.03] not found
DMAR: Device scope device [0000:00:03.03] not found
IOMMU 0xfeb00000: using Register based invalidation
IOMMU 0xfeb01000: using Register based invalidation
IOMMU 0xfeb03000: using Register based invalidation
IOMMU 0xfeb02000: using Register based invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:00:02.0 [0xbdc00000 - 0xc0000000]
IOMMU: Setting identity map for device 0000:00:02.1 [0xbdc00000 - 0xc0000000]
IOMMU: Setting identity map for device 0000:00:1d.0 [0xfc226c00 - 0xfc227400]
IOMMU: Setting identity map for device 0000:00:1d.1 [0xfc226c00 - 0xfc227400]
IOMMU: Setting identity map for device 0000:00:1d.2 [0xfc226c00 - 0xfc227400]
IOMMU: Setting identity map for device 0000:00:1d.7 [0xfc226c00 - 0xfc227400]
IOMMU: Setting identity map for device 0000:00:1a.0 [0xfc226c00 - 0xfc227400]
IOMMU: Setting identity map for device 0000:00:1a.1 [0xfc226c00 - 0xfc227400]
IOMMU: Setting identity map for device 0000:00:1a.2 [0xfc226c00 - 0xfc227400]
IOMMU: Setting identity map for device 0000:00:1a.7 [0xfc226c00 - 0xfc227400]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000]
DRHD: handling fault status reg 3
DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 
DMAR:[fault reason 05] PTE Write access is not set
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
...
DRHD: handling fault status reg 2
DMAR:[DMA Write] Request device [00:02.0] fault addr 106265000 
DMAR:[fault reason 05] PTE Write access is not set
(few times)

Followed by hard-lock within seconds after logging into GNOME.

00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)

Comment 49 David Woodhouse 2010-01-17 17:34:48 UTC
Tomasz, please boot with 'intel_iommu=igfx_off' to disable use of the IOMMU for the graphics device. We suspect a hardware issue is causing this.

We'll work out which chipsets are affected and make a blacklist so that this happens automatically... I need to whip the hardware guys harder to give me that list; they're still dragging their heels. Sorry for the delay.

Comment 50 Laurent Le Grandois 2010-02-01 19:16:52 UTC
Same problem with kernel-2.6.32.7-37.fc12.x86_64
Mesa DRI Mobile Intel® GM45 Express Chipset GEM 20091221 2009Q4

Feb  1 11:15:03 osc-llg kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000
Feb  1 11:15:03 osc-llg kernel: DMAR:[fault reason 05] PTE Write access is not set


Regards

llg

Comment 51 David Woodhouse 2010-02-02 21:03:58 UTC
To make sure this bug doesn't affect potential F-12 updates, I've committed a patch to disable the graphics IOMMU unit on all steppings of the affected chipset.

It's in 2.6.32.7-40.fc12, building at
http://koji.fedoraproject.org/koji/taskinfo?taskID=1959601

If this issue persists with that kernel, please let me know.

Comment 52 Michal Hlavinka 2010-02-03 07:58:49 UTC
I'm getting the same messages:

Feb  3 08:38:57 krles kernel: DRHD: handling fault status reg 2
Feb  3 08:38:57 krles kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 0 
Feb  3 08:38:57 krles kernel: DMAR:[fault reason 06] PTE Read access is not set

with recent kernel build kernel-2.6.32.7-40.fc12.x86_64

Request device is VGA too, but nvidia this time (using nouveau):

02:00.0 VGA compatible controller [0300]: nVidia Corporation Quadro NVS 290 [10de:042f] (rev a1)

jftr, there were no such messages with 2.6.31.* kernel

Comment 53 David Woodhouse 2010-02-03 08:14:52 UTC
Michal, that sounds like a different problem. Do you get it once at boot time (in which case it's probably a BIOS bug), or does it happen repeatedly at run time (in which case it's likely to be a nouveau bug)? Please file a separate bug providing full dmesg output and assign it to me.

Comment 54 Michal Hlavinka 2010-02-03 08:35:27 UTC
it does happen repeatedly, around 2000 lines per minute. I've filed clone bug #561267

Comment 55 Andreas Schneider 2010-02-04 15:06:29 UTC
I've discovered the same problem on my x200. After installing the kernel from comment #51 the problem was gone.

Comment 56 Bug Zapper 2010-03-15 13:04:10 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 57 Eric Paris 2010-03-22 19:51:34 UTC
I'm also on an x200 which was giving me stutter and lockups on f13 and f14 kernels.  I just updated BIOS (3.12 so even newer BIOS than comment #43) and booted the latest f14 kernel

2.6.34-0.13.rc1.git1.fc14.x86_64

which has a message about disabling the iommu for the graphics card.  It seems to be working just fine.  So thanks for that!

I guess my question is: Is this something that BIOS might someday fix or is it just broken hardware and this quirk is the 'fix'?  Any reason I should think about this BZ ever again?

Comment 58 David Woodhouse 2010-03-22 20:09:50 UTC
This specific problem -- where your chipset is 8086:2a40 rev 07 -- one isn't the fault of the BIOS; it's hardware.

It's _only_ disabling the IOMMU for the graphics card though -- all other IOMMU functionality is working.

Once the chipset folks have worked out why it's happening, there _may_ be a workaround which allows you to use the IOMMU on the graphics device again -- but I wouldn't hold your breath.

The graphics device has its own dedicated IOMMU unit, and it's all kind of smashed into one with the graphics itself. The gfx device has a GTT which lists the pages that are visible from the GPU address space, and that contains 'virtual' addresses which need to be translated through the IOMMU page tables.

For efficiency, the hardware maintains a 'shadow' GTT which contains the real physcial addresses after translation -- like a huge TLB which covers the whole of the GTT.

Unfortunately, this particular chipset sometimes reads from the GTT, does the translation, then writes the translated address back to the _original_ GTT instead of to the shadow GTT. That's why you're seeing real physical addresses where you should have 'virtual DMA addresses', and you get the faults.

Comment 59 David Woodhouse 2010-06-15 20:39:59 UTC
*** Bug 573173 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.