Created attachment 369946 [details] /var/log/messages from run showing DMAR spew and freeze..... Description of problem: I installed and booted kernel-2.6.32-0.48.rc7.git1.fc13.x86_64 into an otherwise fc12 system: Lenovo X200: 00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07) 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) 00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) 00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07) 00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03) 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03) 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03) 00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03) 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03) 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03) 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03) 00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03) 00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03) 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03) 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03) 00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03) 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03) 00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93) 00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03) 00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03) 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03) 03:00.0 Network controller: Intel Corporation PRO/Wireless 5100 AGN [Shiloh] Network Connection System boots up just fine, and X/gdm comes up as expected. However, the system seems "choppy", and examining /var/log/messages, I see a steady spew of the following: Nov 17 06:05:46 tlondon kernel: DRHD: handling fault status reg 3 Nov 17 06:05:46 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 4040000 Nov 17 06:05:46 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set I attach the complete /var/log/messages from the run. The system eventually "graphically froze": cursor froze, no response to keyboard. I eventually did a hard reset and booted up to kernel-2.6.31.5-127.fc12.x86_64. Version-Release number of selected component (if applicable): kernel-2.6.32-0.48.rc7.git1.fc13.x86_64 How reproducible: Every boot...... Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This spew (and the general system choppiness) continues with kernel-2.6.32-0.51.rc7.git2.fc13.x86_64 Nov 19 09:00:05 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:05 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:05 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:05 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:07 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:07 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:07 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:07 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:07 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:07 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:07 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:07 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:07 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:10 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:10 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:10 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:11 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:11 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:11 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:12 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:12 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:12 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:14 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:14 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:14 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:14 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:14 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:14 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:15 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:15 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:15 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:16 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:16 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:16 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:19 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:19 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:19 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:22 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:22 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:22 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 19 09:00:23 tlondon kernel: DRHD: handling fault status reg 3 Nov 19 09:00:23 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Nov 19 09:00:23 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set
can you confirm that the issue is not present with the f12 release kernel, and the messages do not display? -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Yes, I can firm. I am running kernel-2.6.31.5-127.fc12.x86_64 with only a single occurrence of such message: [tbl@tlondon ~]$ uname -a Linux tlondon.innopath.com 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux [tbl@tlondon ~]$ dmesg | grep DMAR ACPI: DMAR 00000000bd406000 00120 (v01 00000001 00000000) DMAR:Host address width 36 DMAR:DRHD base: 0x000000feb03000 flags: 0x0 DMAR:DRHD base: 0x000000feb01000 flags: 0x0 DMAR:DRHD base: 0x000000feb00000 flags: 0x0 DMAR:DRHD base: 0x000000feb02000 flags: 0x1 DMAR:RMRR base: 0x000000f2826c00 end: 0x000000f28273ff DMAR:RMRR base: 0x000000bdc00000 end: 0x000000bfffffff DMAR:No ATSR found DMAR: Forcing write-buffer flush capability DMAR:Device scope device [0000:00:03.02] not found DMAR:Device scope device [0000:00:03.02] not found DMAR:Device scope device [0000:00:03.03] not found DMAR:Device scope device [0000:00:03.03] not found DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 DMAR:[fault reason 05] PTE Write access is not set [tbl@tlondon ~]$ Here is a bit of context from /var/log/messages for this single occurrence: Nov 23 08:50:48 tlondon kernel: IOMMU: Setting identity map for device 0000:00:1a.2 [0xf2826c00 - 0xf2827400] Nov 23 08:50:48 tlondon kernel: IOMMU: Setting identity map for device 0000:00:1a.7 [0xf2826c00 - 0xf2827400] Nov 23 08:50:48 tlondon kernel: IOMMU: Prepare 0-16MiB unity mapping for LPC Nov 23 08:50:48 tlondon kernel: IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000] Nov 23 08:50:48 tlondon kernel: DRHD: handling fault status reg 3 Nov 23 08:50:48 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 Nov 23 08:50:48 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Nov 23 08:50:48 tlondon kernel: PCI-DMA: Intel(R) Virtualization Technology for Directed I/O Nov 23 08:50:48 tlondon kernel: hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0 Nov 23 08:50:48 tlondon kernel: hpet0: 4 comparators, 64-bit 14.318180 MHz counter Nov 23 08:50:48 tlondon kernel: pnp: PnP ACPI init
I downloaded/installed the latest fc12 kernel from koji, kernel-2.6.31.6-148.fc12.x86_64, and rebooted. I also do not see the DMAR error spew with it: [root@tlondon ~]# uname -a Linux tlondon.innopath.com 2.6.31.6-148.fc12.x86_64 #1 SMP Mon Nov 23 19:30:10 EST 2009 x86_64 x86_64 x86_64 GNU/Linux [root@tlondon ~]# dmesg | grep DMAR ACPI: DMAR 00000000bd406000 00120 (v01 00000001 00000000) DMAR:Host address width 36 DMAR:DRHD base: 0x000000feb03000 flags: 0x0 DMAR:DRHD base: 0x000000feb01000 flags: 0x0 DMAR:DRHD base: 0x000000feb00000 flags: 0x0 DMAR:DRHD base: 0x000000feb02000 flags: 0x1 DMAR:RMRR base: 0x000000f2826c00 end: 0x000000f28273ff DMAR:RMRR base: 0x000000bdc00000 end: 0x000000bfffffff DMAR:No ATSR found DMAR: Forcing write-buffer flush capability DMAR:Device scope device [0000:00:03.02] not found DMAR:Device scope device [0000:00:03.02] not found DMAR:Device scope device [0000:00:03.03] not found DMAR:Device scope device [0000:00:03.03] not found DMAR:[DMA Write] Request device [00:02.0] fault addr 6f6577000 DMAR:[fault reason 05] PTE Write access is not set [root@tlondon ~]#
Created attachment 373488 [details] dmesg output showing DMAR DMA read errors for audio device Uhhh... This may be more complicated that I reported above..... I noticed the following DMAR DMA read errors on my current boot of kernel-2.6.31.6-148.fc12.x86_64: <<<<SNIP>>>> DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set DMAR:[DMA Read] Request device [00:1b.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set <<<<SNIP>>>> 00:1b.0 is the Intel Audio device.....: 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03) The previously reported DMAR DMA write errors were accessing 00:02.0 which is the Intel Video controller: 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) I attach complete output of dmesg....
Created attachment 374239 [details] dmesg output showing DMAR spew kernel-2.6.32-0.55.rc8.git1.fc13.x86_64 does not fix this. I continue to get DMAR spew, system "choppiness", and ultimately graphical (system?) freeze. I attach here the output of "dmesg" obtained just before the system froze. The system was up for about 5 minutes before freezing. I can attach /var/log/messages output for this run if useful.... Reverting back to kernel-2.6.31.6-148.fc12.x86_64 makes the system stable (and more responsive). Any additional info useful here?
David, is this one for you? -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
The F12 kernel has a "workaround broken graphics drivers" option enabled, so the graphics driver is exempt from having to use the DMA API correctly. This option is removed from later kernels; the graphics drivers are expected (and believed) to be fixed. If you see this kind of error, it's a driver problem. Current practice is to assign kernel/drm bugs to the corresponding xorg driver, right?
The one with the audio driver is interesting. That's a completely separate IOMMU unit. Can you file it as a separate bug, please?
Btw, booting with 'intel_iommu=igfx_off' should still allow you to bypass the IOMMU for the integrated graphics device.
I filed BZ for audio driver issue here: https://bugzilla.redhat.com/show_bug.cgi?id=541981
Created attachment 374296 [details] Photo of screen with Oops/crash (In reply to comment #10) > Btw, booting with 'intel_iommu=igfx_off' should still allow you to bypass the > IOMMU for the integrated graphics device. Arghhh..... Booting with 'intel_iommu=igfx_off' seems to bring no joy: I get a hard crash during boot (every time). I attach a "camera photo" of screen. Sorry, but I don't know how to better scrape the screen in such cases. (Suggestions?) The crash appears to be at 'notifier_call_chain', ..., i915_init, ... Anyway, 'intel_iommu=igfx_off' appears to be a non-starter......
Please try again with rc8, or with 'mem=2G'. There's a fix in 2.6.32-rc8 which makes the inteldrm driver cope with RAM above the 4GiB boundary. And if it persists, please add 'panic_on_oops' so that the interesting part of the oops doesn't scroll off the screen.
These errors are probably happening because there's noise somewhere in the GATT. I suspect we're not clearing it completely. Either we're miscalculating the size of it, or it's something to do with the 4KiB of stolen memory that we don't use on GM4X.
(In reply to comment #13) > Please try again with rc8, or with 'mem=2G'. There's a fix in 2.6.32-rc8 which > makes the inteldrm driver cope with RAM above the 4GiB boundary. > > And if it persists, please add 'panic_on_oops' so that the interesting part of > the oops doesn't scroll off the screen. Already downloaded/booted kernel-2.6.32-0.55.rc8.git1.fc13.x86_64 and it seems to fail the same way. See comments in #6 above. I will try with 'mem=2G panic_on_oops'.....
Sorry, I wasn't clear: the crash with intel_iommu=igfx_off was with kernel-2.6.32-0.55.rc8.git1.fc13.x86_64
Created attachment 374327 [details] Screen photo of Oops during boot OK. The suggested options (mem=2G, panic_on_oops) did not help with kernel-2.6.32-0.55.rc8.git1.fc13.x86_64: With 'intel_iommu=igfx_off', the system continued to Oops and die just as dracut is about to start plymouth. Without 'intel_iommu=igfx_off', the system boots, but I continue to get the DMAR spew along with very slow/choppy performance. However, reading the man page for dracut, I added rd_NO_PLYMOUTH to the boot line, removed a SD Flash card, and now the screen seems to stabilize with more information: The Oops says "unable to handle NULL pointer dereference at 00000000000037". I attach here a photo of the screen. I will attach another below; I can't tell which is clearer....
Created attachment 374328 [details] Second photo of screen showing Oops Here is a second photo..... By the way, here are the xorg packages I'm running with: [tbl@tlondon Download]$ rpm -qa xorg-x11-server\* xorg-x11-drv-intel xorg-x11-server-Xorg-1.7.1-7.fc12.x86_64 xorg-x11-server-debuginfo-1.7.1-7.fc12.x86_64 xorg-x11-server-utils-7.4-15.fc13.x86_64 xorg-x11-drv-intel-2.9.1-1.fc13.x86_64 xorg-x11-server-common-1.7.1-7.fc12.x86_64 xorg-x11-server-Xephyr-1.7.1-7.fc12.x86_64 [tbl@tlondon Download]$
OK, the faults ought to be fixed by http://david.woodhou.se/gtt-size.patch -- but I'd like confirmation that that's correct for i915 (and G33 in particular). Assigning to Eric for that. The oops with igfx_off ought to be fixed by http://david.woodhou.se/igfx-off-fix.patch; I'll test that next, now that I'm no longer being distracted by tracking down the _real_ problem :)
Hm, gtt-size.patch may not be the right approach -- I suspect we have to do it earlier, to ensure that it's done before the IOMMU is turned on. And then we might as well leave the AGP code as it is. Maybe a PCI quirk for clearing the GTT on startup? I've also seen something scribbling on the GTT in random places, far _above_ the number of entries we actually use. It always seems to be the _untranslated_ address of the scratch page which gets written, strangely. [root@dyn-236 ~]# dd if=/dev/mem bs=$((0x100000)) skip=$((0xf22)) count=3 | hexdump -C | grep '01 00 80 37' 00043790 01 f0 ff ff 01 00 80 37 01 f0 ff ff 01 f0 ff ff |.......7........| 000691b0 01 f0 ff ff 01 00 80 37 01 f0 ff ff 01 f0 ff ff |.......7........| 001a6860 01 f0 ff ff 01 f0 ff ff 01 00 80 37 01 f0 ff ff |...........7....| I reverted gtt-size.patch and instead just used a memset to zero for the rest of the GTT, in intel_i915_configure(). And then iounmapped it and did another ioremap of _only_ 256KiB, in the hope that I'd then get a fault and a backtrace from whatever was doing this. No luck so far though -- it's still all zeroes. I haven't quite worked out how to reproduce the corruption reliably though; it may yet happen.
Just for completeness, I thought I would add the obvious: if I disable Vt-d in the BIOS, this goes away: I'm running kernel-2.6.32-0.56.rc8.git1.fc13.x86_64 without problem. With Vt-d enabled in BIOS, I get above DMAR spew/system choppiness, etc.
With kernel-2.6.32-0.61.rc8.git2.fc13.x86_64, I'm getting Oops like the following. These related/connected? The Linux kernel The kernel package contains the Linux kernel (vmlinuz), the core of any Linux operating system. The kernel handles the basic functions of the operating system: memory allocation, process allocation, device input and output, etc.[root@tlondon kerneloops-1259679238-1]# cat kerneloops ------------[ cut here ]------------ WARNING: at drivers/pci/dmar.c:611 check_zero_address+0x8f/0xb3() (Not tainted) Hardware name: 74585FU Your BIOS is broken; DMAR reported at address zero! BIOS vendor: LENOVO; Ver: 6DET60WW (3.10 ); Product Version: ThinkPad X200 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.32-0.61.rc8.git2.fc13.x86_64 #1 Call Trace: [<ffffffff81058f75>] warn_slowpath_common+0x84/0x9c [<ffffffff81488013>] ? _etext+0x0/0x1 [<ffffffff81058fe4>] warn_slowpath_fmt+0x41/0x43 [<ffffffff81a6ba1d>] check_zero_address+0x8f/0xb3 [<ffffffff81a6ba9c>] detect_intel_iommu+0x12/0x8c [<ffffffff81a49b33>] pci_iommu_alloc+0x5e/0x6c [<ffffffff81a587b1>] mem_init+0x19/0xec [<ffffffff81a42c14>] start_kernel+0x21f/0x42c [<ffffffff81a422c1>] x86_64_start_reservations+0xac/0xb0 [<ffffffff81a423bd>] x86_64_start_kernel+0xf8/0x107 and ------------[ cut here ]------------ WARNING: at arch/x86/mm/ioremap.c:149 __ioremap_caller+0x171/0x312() (Tainted: G W ) Hardware name: 74585FU Modules linked in: Pid: 1, comm: swapper Tainted: G W 2.6.32-0.61.rc8.git2.fc13.x86_64 #1 Call Trace: [<ffffffff81058f75>] warn_slowpath_common+0x84/0x9c [<ffffffff81058fa1>] warn_slowpath_null+0x14/0x16 [<ffffffff8103711a>] __ioremap_caller+0x171/0x312 [<ffffffff8126a11d>] ? alloc_iommu+0x1de/0x261 [<ffffffff8103739d>] ioremap_nocache+0x17/0x19 [<ffffffff8126a11d>] alloc_iommu+0x1de/0x261 [<ffffffff81a6bc43>] ? dmar_table_init+0x12d/0x376 [<ffffffff81a6bcb9>] dmar_table_init+0x1a3/0x376 [<ffffffff81a50b21>] enable_IR_x2apic+0x23/0x21d [<ffffffff810892a5>] ? lockstat_clock+0x11/0x13 [<ffffffff81a4ec17>] native_smp_prepare_cpus+0x12d/0x365 [<ffffffff81a425cc>] kernel_init+0x75/0x26e [<ffffffff81012daa>] child_rip+0xa/0x20 [<ffffffff81012710>] ? restore_args+0x0/0x30 [<ffffffff81a42557>] ? kernel_init+0x0/0x26e [<ffffffff81012da0>] ? child_rip+0x0/0x20
Those aren't oopses; they're just warnings. Your BIOS is broken. The system should work fine with the exception of some spurious graphics faults which I'm still working on.
OK, thanks. Appreciate the support! Sigh, believe this is the latest version for X200. Let me know how to help.....
tell your motherboard vendor they're an idiot :) -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Adding a data point... Having similar issues on a p6t deluxe V2 (and have told Asus)... I discovered today that there is some odd relation between the DMAR errors and KMS. Also, the DMAR errors are not specific to any card as I have managed to get them on my ethernet device as well as nvidia card. Scenerios: VT-D enabled + nomodeset + intel_iommu=on: DMAR errors + IRQ errors VT-D enabled + KMS + intel_iommu=on: No errors (DMAR or IRQ) VD-D disabled - all scenarios work w/o DMAR errors (as expected) Also, performance with VT-D disabled seems to be about 20% higher than with VT-D enabled regardless of modeset or intel_iommu settings (not that I've tried all the options yet). I've also not verified that I can actually bring up a VM and redirect devices using VT-D. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Michael, please could you show the messages you see? Not the graphics ones which I'm aware of, but any on other devices and any IRQ errors. Show the full dmesg when the system boots.
Ok- won't be until late tomorrow. Have to rerun.
Tom, the 2.6.32-0.63.rc8 kernel building at http://koji.fedoraproject.org/koji/taskinfo?taskID=1843047 should fix all known issues, although it may still print nasty messages about your BIOS. But with VT-d enabled in your BIOS, everything should work fine and you shouldn't see any DMAR faults except maybe one on the graphics device when the IOMMU is first initialised. If you could confirm that, it would be very much appreciated. Note that I am currently chasing what seems to be a hardware problem on the x200s: After some period of operation, you do start to see DMAR faults which seem to happen when a GTT entry get modified to point at the physical address of the scratch page instead of the virtual address (which is translated by the IOMMU to a physical address). If you see that, please could you run http://david.woodhou.se/intel_gtt and show me its output, along with the faults you see?
Created attachment 375443 [details] output of 'intel_gtt' First of all, thanks! I've downloaded/installed/booted kernel-2.6.32-0.63.rc8.git2.fc13.x86_64. Not sure this is what you are looking for, but within a few minutes of booting up to gnome, I see the following: ------------[ cut here ]------------ WARNING: at drivers/pci/dmar.c:616 check_zero_address+0x96/0x19b() (Not tainted) Hardware name: 74585FU Your BIOS is broken; DMAR reported at address zero! BIOS vendor: LENOVO; Ver: 6DET60WW (3.10 ); Product Version: ThinkPad X200 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.32-0.63.rc8.git2.fc13.x86_64 #1 Call Trace: [<ffffffff81058f75>] warn_slowpath_common+0x84/0x9c [<ffffffff81488113>] ? _etext+0x0/0x1 [<ffffffff81058fe4>] warn_slowpath_fmt+0x41/0x43 [<ffffffff81a6ba6d>] check_zero_address+0x96/0x19b [<ffffffff812afd6d>] ? acpi_tb_verify_table+0x57/0x5c [<ffffffff812af373>] ? acpi_get_table_with_size+0x5a/0xb4 [<ffffffff81488113>] ? _etext+0x0/0x1 [<ffffffff81a6bb84>] detect_intel_iommu+0x12/0x8c [<ffffffff81a49b33>] pci_iommu_alloc+0x5e/0x6c [<ffffffff81a587b1>] mem_init+0x19/0xec [<ffffffff81a42c14>] start_kernel+0x21f/0x42c [<ffffffff81a422c1>] x86_64_start_reservations+0xac/0xb0 [<ffffffff81a423bd>] x86_64_start_kernel+0xf8/0x107 Running 'intel_gtt' produces lots.... attached here. This what you are looking for? No explicit "DMAR" messages in log yet.
OK, that warning shows that your system isn't _using_ the DMAR (aka the IOMMU), because the BIOS is claiming that it lives at physical address zero... which we don't believe. ISTR the X200s only did that for the first boot after VT-d was enabled in the BIOS. If you enable it in the BIOS and then power cycle, it should be fine after that? And then most of the entries in the GART should have addresses like 0xf....... -- most of them will be 0xfffff001, in the same way that most of them are 0x37800001 in the output you showed.
Yeah, you're right: this was from first boot after enabling VT-d. I'll reboot and report.
Created attachment 375493 [details] Output from 'intel_gtt' just after gnome boot.... After a reboot, I continue to get a "slew" of DMAR messages: Dec 2 08:57:14 tlondon kernel: DRHD: handling fault status reg 2 Dec 2 08:57:14 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 2 08:57:14 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Dec 2 08:57:15 tlondon kernel: DRHD: handling fault status reg 2 Dec 2 08:57:15 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 2 08:57:15 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Dec 2 08:57:17 tlondon kernel: DRHD: handling fault status reg 2 Dec 2 08:57:17 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 2 08:57:17 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Dec 2 08:57:17 tlondon kernel: DRHD: handling fault status reg 2 Dec 2 08:57:17 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 2 08:57:17 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Dec 2 08:57:19 tlondon kernel: DRHD: handling fault status reg 2 Dec 2 08:57:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 2 08:57:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Dec 2 08:57:19 tlondon kernel: DRHD: handling fault status reg 2 Dec 2 08:57:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 2 08:57:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Dec 2 08:57:19 tlondon kernel: DRHD: handling fault status reg 2 Dec 2 08:57:19 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 2 08:57:19 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set I've captured the output of 'intel_gtt' 2 times, separated by about 2 minutes. I attach the first here and the second below. Output as expected?
Created attachment 375495 [details] Output from 'intel_gtt' 2 minutes later...
Created attachment 375499 [details] Output of 'intel_gtt' 13 minutes later..... Ran 'intel_gtt' again 13 minute later.
Created attachment 375503 [details] /var/log/messages for complete run until hard crash.... After about 5 or 6 more minutes (with pretty steady DMAR message spew), the system hard froze. I had to power cycle to reboot. Attached is /var/log/messages for that run. I've rebooted with VT-d off.....
OK, it sounds like you're seeing the hardware issue that I found on my x200s, but you can reproduce it a _lot_ faster than I can. This is just a stock F12 with rawhide kernel, and you're seeing this when you just boot into runlevel 5 and immediately look in /var/log/messages? Nothing else running but the terminal you use for that? You can avoid this by using 'intel_iommu=igfx_off' for now. The IOMMU will work for everything else, and be turned off for the graphics device. Thank you very much for confirming this suspected hardware issue exists on more than one machine. I had been trying on an HP6930p, with the same chipset, and failing to reproduce it there.
My Thinkpad X200 is "full rawhide" from a continuously updated f12. Here is my "recipe": I boot into compiz enabled gnome, open a terminal window, open a second terminal window (via Shift-ctrl-N), do "sudo -i; inotail -f /var/log/messages". I can see the DMAR messages then (per the first 2 'intel_gtt' attachments). I then try to set up my usual desktop: a qemu-kvm VM, empathy, firefox, rhythmbox. Believe that was what I had running when I got the crash. Again, thanks for looking in to this. I will re-enable VT-d and boot with 'intel_iommu=igfx_off' the next reboot.
General point: can anyone who's going to post on this bug and say they're seeing 'similar issues' please just include the exact messages from the kernel logs instead. There are various DMAR / IOMMU related error messages and the exact messages you're seeing are significant, in all cases we need to know the exact messages you see. thanks. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Created attachment 375599 [details] Output of 'intel_gtt' after system has been up for about 13 minutes I re-enabled VT-d, powered off, powered on and boot up (with 'intel_iommu=igfx_off'). I do not see DMAR spew. Here are the DMAR/IOMMU messages logged in dmesg: [tbl@tlondon ~]$ dmesg | grep -e 'DMAR\|IOMMU' ACPI: DMAR 00000000bd406000 00120 (v01 ? 00000001 00000000) Intel-IOMMU: disable GFX device mapping DMAR: Host address width 36 DMAR: DRHD base: 0x000000feb03000 flags: 0x0 IOMMU feb03000: ver 1:0 cap c9008020e30260 ecap 1000 DMAR: DRHD base: 0x000000feb01000 flags: 0x0 IOMMU feb01000: ver 1:0 cap c0000020630260 ecap 1000 DMAR: DRHD base: 0x000000feb00000 flags: 0x0 IOMMU feb00000: ver 1:0 cap c0000020630270 ecap 1000 DMAR: DRHD base: 0x000000feb02000 flags: 0x1 IOMMU feb02000: ver 1:0 cap c9008020630260 ecap 1000 DMAR: RMRR base: 0x000000f2826c00 end: 0x000000f28273ff DMAR: RMRR base: 0x000000bdc00000 end: 0x000000bfffffff DMAR: No ATSR found DMAR: Forcing write-buffer flush capability DMAR: Device scope device [0000:00:03.02] not found DMAR: Device scope device [0000:00:03.02] not found DMAR: Device scope device [0000:00:03.03] not found DMAR: Device scope device [0000:00:03.03] not found IOMMU 0xfeb00000: using Register based invalidation IOMMU 0xfeb03000: using Register based invalidation IOMMU 0xfeb02000: using Register based invalidation IOMMU: Setting RMRR: IOMMU: Setting identity map for device 0000:00:1d.0 [0xf2826c00 - 0xf2827400] IOMMU: Setting identity map for device 0000:00:1d.1 [0xf2826c00 - 0xf2827400] IOMMU: Setting identity map for device 0000:00:1d.2 [0xf2826c00 - 0xf2827400] IOMMU: Setting identity map for device 0000:00:1d.7 [0xf2826c00 - 0xf2827400] IOMMU: Setting identity map for device 0000:00:1a.0 [0xf2826c00 - 0xf2827400] IOMMU: Setting identity map for device 0000:00:1a.1 [0xf2826c00 - 0xf2827400] IOMMU: Setting identity map for device 0000:00:1a.2 [0xf2826c00 - 0xf2827400] IOMMU: Setting identity map for device 0000:00:1a.7 [0xf2826c00 - 0xf2827400] IOMMU: Prepare 0-16MiB unity mapping for LPC IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000] [tbl@tlondon ~]$ For completeness, I attach the output of 'intel_gtt' after system has been up about 8 minutes.
Thanks, Adam. Also, please note that if they're just the harmless warnings telling you that your BIOS is broken but we coped, we don't need to know. Report it to your motherboard/system manufacturer and demand a fixed BIOS. I only want to know if you're seeing actual problems or DMA fault reports.
Thanks, Tom. Your GTT shows that you are indeed not using the IOMMU for graphics -- it's got real physical addresses in it, and you shouldn't have any faults. If you do see any, they're likely to be a driver bug in a different device driver (like the sound one where the card seemed to be attempting to read from physical address zero, which is precisely the kind of thing the IOMMU is _supposed_ to catch).
A (small) status update: I found a new BIOS version for X200 on Lenovo site: Dec 11 06:12:55 tlondon kernel: thinkpad_acpi: ThinkPad BIOS 6DET61WW (3.11 ), EC 7XHT24WW-1.06 [Old one was: Dec 10 06:09:02 tlondon kernel: thinkpad_acpi: ThinkPad BIOS 6DET60WW (3.10 ), EC 7XHT22WW-1.04] I see no change in described behavior: booting kernel-2.6.32-7.fc13.x86_64 without 'intel_iommu=igfx_off' still spews: Dec 11 06:11:46 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 11 06:11:46 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Dec 11 06:11:47 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Dec 11 06:11:47 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Booting with 'intel_iommu=igfx_off' works and squelches spew. I've asked my IT guy to report this through Lenovo channels. I'll continue to monitor BIOS updates.....
Thanks. I think the faults you see now are a chipset bug; not necessarily Lenovo's fault. It's being investigated.
A new datapoint here: Lenovo Thinkpad T400, Fedora 12 with kernel-2.6.32.1-9.fc13.x86_64. Locks up when GDM starts, the system get unresponsive and then totally locks up. Ctrl-alt-del doesn't work and I need to hold the powerbutton to restart. dmesg shows: DRHD: handling fault status reg 2 DMAR:[DMA Write] Request device [00:02.0] fault addr 1305ef000 DMAR:[fault reason 05] PTE Write access is not set I got the full dmesg output from a virtual terminal. i915 graphics: 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
Created attachment 378843 [details] dmesg output on kernel 2.6.32.1-9.fc13.x86_64, Thinkpad T400
Created attachment 378856 [details] dmesg with intel_iommu=igfx_off and suspend/resume cycle Just wanted to add that "intel_iommu=igfx_off" makes 2.6.32.1-9.fc13.x86_64 work on my T400, with compiz and KDE 4. Nice. It also seems that https://bugzilla.redhat.com/show_bug.cgi?id=528312 is gone for me now. Great! Full dmesg (with working suspend/resume) attached.
2.6.32.3-24.fc12.x86_64, Lenovo ThinkPad T400 with Intel GMA4500MHD, 4 GiB RAM. Booting with VT-d enabled spews messages in dmesg: DMAR: Device scope device [0000:00:03.02] not found DMAR: Device scope device [0000:00:03.03] not found DMAR: Device scope device [0000:00:03.03] not found IOMMU 0xfeb00000: using Register based invalidation IOMMU 0xfeb01000: using Register based invalidation IOMMU 0xfeb03000: using Register based invalidation IOMMU 0xfeb02000: using Register based invalidation IOMMU: Setting RMRR: IOMMU: Setting identity map for device 0000:00:02.0 [0xbdc00000 - 0xc0000000] IOMMU: Setting identity map for device 0000:00:02.1 [0xbdc00000 - 0xc0000000] IOMMU: Setting identity map for device 0000:00:1d.0 [0xfc226c00 - 0xfc227400] IOMMU: Setting identity map for device 0000:00:1d.1 [0xfc226c00 - 0xfc227400] IOMMU: Setting identity map for device 0000:00:1d.2 [0xfc226c00 - 0xfc227400] IOMMU: Setting identity map for device 0000:00:1d.7 [0xfc226c00 - 0xfc227400] IOMMU: Setting identity map for device 0000:00:1a.0 [0xfc226c00 - 0xfc227400] IOMMU: Setting identity map for device 0000:00:1a.1 [0xfc226c00 - 0xfc227400] IOMMU: Setting identity map for device 0000:00:1a.2 [0xfc226c00 - 0xfc227400] IOMMU: Setting identity map for device 0000:00:1a.7 [0xfc226c00 - 0xfc227400] IOMMU: Prepare 0-16MiB unity mapping for LPC IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000] DRHD: handling fault status reg 3 DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 DMAR:[fault reason 05] PTE Write access is not set PCI-DMA: Intel(R) Virtualization Technology for Directed I/O ... DRHD: handling fault status reg 2 DMAR:[DMA Write] Request device [00:02.0] fault addr 106265000 DMAR:[fault reason 05] PTE Write access is not set (few times) Followed by hard-lock within seconds after logging into GNOME. 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
Tomasz, please boot with 'intel_iommu=igfx_off' to disable use of the IOMMU for the graphics device. We suspect a hardware issue is causing this. We'll work out which chipsets are affected and make a blacklist so that this happens automatically... I need to whip the hardware guys harder to give me that list; they're still dragging their heels. Sorry for the delay.
Same problem with kernel-2.6.32.7-37.fc12.x86_64 Mesa DRI Mobile Intel® GM45 Express Chipset GEM 20091221 2009Q4 Feb 1 11:15:03 osc-llg kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 37800000 Feb 1 11:15:03 osc-llg kernel: DMAR:[fault reason 05] PTE Write access is not set Regards llg
To make sure this bug doesn't affect potential F-12 updates, I've committed a patch to disable the graphics IOMMU unit on all steppings of the affected chipset. It's in 2.6.32.7-40.fc12, building at http://koji.fedoraproject.org/koji/taskinfo?taskID=1959601 If this issue persists with that kernel, please let me know.
I'm getting the same messages: Feb 3 08:38:57 krles kernel: DRHD: handling fault status reg 2 Feb 3 08:38:57 krles kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 0 Feb 3 08:38:57 krles kernel: DMAR:[fault reason 06] PTE Read access is not set with recent kernel build kernel-2.6.32.7-40.fc12.x86_64 Request device is VGA too, but nvidia this time (using nouveau): 02:00.0 VGA compatible controller [0300]: nVidia Corporation Quadro NVS 290 [10de:042f] (rev a1) jftr, there were no such messages with 2.6.31.* kernel
Michal, that sounds like a different problem. Do you get it once at boot time (in which case it's probably a BIOS bug), or does it happen repeatedly at run time (in which case it's likely to be a nouveau bug)? Please file a separate bug providing full dmesg output and assign it to me.
it does happen repeatedly, around 2000 lines per minute. I've filed clone bug #561267
I've discovered the same problem on my x200. After installing the kernel from comment #51 the problem was gone.
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle. Changing version to '13'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
I'm also on an x200 which was giving me stutter and lockups on f13 and f14 kernels. I just updated BIOS (3.12 so even newer BIOS than comment #43) and booted the latest f14 kernel 2.6.34-0.13.rc1.git1.fc14.x86_64 which has a message about disabling the iommu for the graphics card. It seems to be working just fine. So thanks for that! I guess my question is: Is this something that BIOS might someday fix or is it just broken hardware and this quirk is the 'fix'? Any reason I should think about this BZ ever again?
This specific problem -- where your chipset is 8086:2a40 rev 07 -- one isn't the fault of the BIOS; it's hardware. It's _only_ disabling the IOMMU for the graphics card though -- all other IOMMU functionality is working. Once the chipset folks have worked out why it's happening, there _may_ be a workaround which allows you to use the IOMMU on the graphics device again -- but I wouldn't hold your breath. The graphics device has its own dedicated IOMMU unit, and it's all kind of smashed into one with the graphics itself. The gfx device has a GTT which lists the pages that are visible from the GPU address space, and that contains 'virtual' addresses which need to be translated through the IOMMU page tables. For efficiency, the hardware maintains a 'shadow' GTT which contains the real physcial addresses after translation -- like a huge TLB which covers the whole of the GTT. Unfortunately, this particular chipset sometimes reads from the GTT, does the translation, then writes the translated address back to the _original_ GTT instead of to the shadow GTT. That's why you're seeing real physical addresses where you should have 'virtual DMA addresses', and you get the faults.
*** Bug 573173 has been marked as a duplicate of this bug. ***