Created attachment 328997 [details] /var/log/messages from kernel-2.6.29-0.35.rc1.git4.fc11.x86_64 boot without "nopat" showing Oops, DMAR errors, eth1 problems.... Description of problem: Booted kernel-2.6.29-0.35.rc1.git4.fc11.x86_64 on Thinkpad X200 without "nopat", I get much badness. The messages below repeat over and over. Jan 14 06:26:59 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 Jan 14 06:26:59 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Wired NIC is NOT working properly. Get the following over and over: Jan 14 06:32:35 tlondon kernel: 0000:00:19.0: eth1: Detected Tx Unit Hang: Jan 14 06:32:35 tlondon kernel: TDH <0> Jan 14 06:32:35 tlondon kernel: TDT <1> Jan 14 06:32:35 tlondon kernel: next_to_use <1> Jan 14 06:32:35 tlondon kernel: next_to_clean <0> Jan 14 06:32:35 tlondon kernel: buffer_info[next_to_clean]: Jan 14 06:32:35 tlondon kernel: time_stamp <100016289> Jan 14 06:32:35 tlondon kernel: next_to_watch <0> Jan 14 06:32:35 tlondon kernel: jiffies <1000179bd> Jan 14 06:32:35 tlondon kernel: next_to_watch.status <0> Jan 14 06:32:36 tlondon dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 7 Jan 14 06:32:36 tlondon kernel: DMAR:[DMA Write] Request device [03:00.0] fault addr ffd87000 Jan 14 06:32:36 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set And get the following Oops: Jan 14 06:28:38 tlondon kernel: DMAR:[DMA Write] Request device [03:00.0] fault addr ffad3000 Jan 14 06:28:38 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not s et Jan 14 06:28:38 tlondon kernel: ------------[ cut here ]------------ Jan 14 06:28:38 tlondon kernel: WARNING: at net/sched/sch_generic.c:226 dev_watc hdog+0xcb/0x136() Jan 14 06:28:38 tlondon kernel: Hardware name: 74585FU Jan 14 06:28:38 tlondon kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit timed o ut Jan 14 06:28:38 tlondon kernel: Modules linked in: i915 drm i2c_algo_bit rfcomm sco bridge stp llc bnep l2cap sunrpc nf_conntrack_netbios_ns ip6t_REJECT nf_conn track_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_ta ble fuse dm_multipath kvm_intel kvm uinput snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy arc4 snd_seq_oss snd_seq_midi_event ecb sn d_seq snd_seq_device iwlagn iwlcore snd_pcm_oss snd_mixer_oss i2c_i801 snd_pcm t hinkpad_acpi snd_timer lib80211 iTCO_wdt rfkill snd video mac80211 usblp i2c_cor e iTCO_vendor_support soundcore btusb pcspkr hwmon wmi bluetooth snd_page_alloc output e1000e cfg80211 sha256_generic cbc aes_x86_64 aes_generic dm_crypt [last unloaded: microcode] Jan 14 06:28:38 tlondon kernel: Pid: 0, comm: swapper Tainted: G W 2.6.2 9-0.35.rc1.git4.fc11.x86_64 #1 Jan 14 06:28:38 tlondon kernel: Call Trace: Jan 14 06:28:38 tlondon kernel: <IRQ> [<ffffffff8104a245>] warn_slowpath+0xb1/0 xed Jan 14 06:28:38 tlondon kernel: [<ffffffff813839c6>] ? _spin_unlock_irqrestore+0 x40/0x57 Jan 14 06:28:38 tlondon kernel: [<ffffffff8106c393>] ? trace_hardirqs_off_caller +0x1f/0xac Jan 14 06:28:38 tlondon kernel: [<ffffffff812fa630>] ? netif_tx_lock+0x48/0x7d Jan 14 06:28:38 tlondon kernel: [<ffffffff8119adf7>] ? _raw_spin_unlock+0x8e/0x9 3 Jan 14 06:28:38 tlondon kernel: [<ffffffff81383a03>] ? _spin_unlock+0x26/0x2a Jan 14 06:28:38 tlondon kernel: [<ffffffff812fa6b7>] ? dev_watchdog+0x0/0x136 Jan 14 06:28:38 tlondon kernel: [<ffffffff812fa782>] dev_watchdog+0xcb/0x136 Jan 14 06:28:38 tlondon kernel: [<ffffffff8106d2f5>] ? trace_hardirqs_on_caller+ 0x1f/0x153 Jan 14 06:28:38 tlondon kernel: [<ffffffff8106d436>] ? trace_hardirqs_on+0xd/0xf Jan 14 06:28:38 tlondon kernel: [<ffffffff8138397a>] ? _spin_unlock_irq+0x2b/0x3 7 Jan 14 06:28:38 tlondon kernel: [<ffffffff81054297>] run_timer_softirq+0x17a/0x1 f2 Jan 14 06:28:38 tlondon kernel: [<ffffffff8104f9ad>] __do_softirq+0x94/0x16f Jan 14 06:28:38 tlondon kernel: [<ffffffff8101272c>] call_softirq+0x1c/0x30 Jan 14 06:28:38 tlondon kernel: [<ffffffff81013849>] do_softirq+0x4d/0xb4 Jan 14 06:28:38 tlondon kernel: [<ffffffff8104f5fb>] irq_exit+0x4e/0x88 Jan 14 06:28:38 tlondon kernel: [<ffffffff81013b60>] do_IRQ+0x130/0x154 Jan 14 06:28:38 tlondon kernel: [<ffffffff81011e13>] ret_from_intr+0x0/0x2e Jan 14 06:28:38 tlondon kernel: <EOI> <4>---[ end trace 4eaa2a86a8e2da24 ]--- Jan 14 06:28:39 tlondon NetworkManager: <info> (eth1): carrier now OFF (device state 8) Jan 14 06:28:39 tlondon NetworkManager: <info> (eth1): device state change: 8 - > 2 Jan 14 06:28:39 tlondon NetworkManager: <info> (eth1): deactivating device (rea son: 40). Jan 14 06:28:39 tlondon NetworkManager: <info> eth1: canceled DHCP transaction, dhcp client pid 2879 I attach complete /var/log/messages for this run. This was the first time booting without "pat" on boot line "worked". Not sure that is relevant. I reverted to kernel-2.6.29-0.31.rc1.git2.fc11.x86_64 to repair / filesystem, etc. Version-Release number of selected component (if applicable): kernel-2.6.29-0.35.rc1.git4.fc11.x86_64 How reproducible: Too scared to try again.... ;) Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
A bit more on "file system errors": I noticed when the system was running with kernel-2.6.29-0.35.rc1.git4.fc11.x86_64, that rhythmbox reported some of music files were "missing". They reside in my home directory on root filesystem: /dev/mapper/VolGroup00-LogVol00 88600832 56905556 30795920 65% / When I rebooted with kernel-2.6.29-0.31.rc1.git2.fc11.x86_64 (and with "nopat"), I got "filesystem has errors, checking" message from boot. "fsck" appears to have corrected whatever problems where were. [I try each new kernel without "nopat". 0.35 is the first to actually work with compiz on my Thinkpads. I can certainly attempt a boot with "nopat" if that would be helpful.....]
Created attachment 329007 [details] /var/log/messages from kernel-2.6.29-0.35.rc1.git4.fc11.x86_64 boot with "nopat" showing Oops, DMAR errors, eth1 problems.... OK. I bit the bullet and rebooted with "nopat" on the boot line. Believe I got similar issues with DMAR: Jan 14 07:15:22 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 Jan 14 07:15:22 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set and issues with my NIC: Jan 14 07:17:02 tlondon kernel: 0000:00:19.0: eth1: Detected Tx Unit Hang: Jan 14 07:17:02 tlondon kernel: TDH <13> Jan 14 07:17:02 tlondon kernel: TDT <1e> Jan 14 07:17:02 tlondon kernel: next_to_use <1e> Jan 14 07:17:02 tlondon kernel: next_to_clean <13> Jan 14 07:17:02 tlondon kernel: buffer_info[next_to_clean]: Jan 14 07:17:02 tlondon kernel: time_stamp <fffdcf34> Jan 14 07:17:02 tlondon kernel: next_to_watch <13> Jan 14 07:17:02 tlondon kernel: jiffies <fffdd428> Jan 14 07:17:02 tlondon kernel: next_to_watch.status <0> Jan 14 07:17:04 tlondon kernel: 0000:00:19.0: eth1: Detected Tx Unit Hang: Jan 14 07:17:04 tlondon kernel: TDH <13> Jan 14 07:17:04 tlondon kernel: TDT <1e> Jan 14 07:17:04 tlondon kernel: next_to_use <1e> Jan 14 07:17:04 tlondon kernel: next_to_clean <13> Jan 14 07:17:04 tlondon kernel: buffer_info[next_to_clean]: Jan 14 07:17:04 tlondon kernel: time_stamp <fffdcf34> Jan 14 07:17:04 tlondon kernel: next_to_watch <13> Jan 14 07:17:04 tlondon kernel: jiffies <fffddbf8> Jan 14 07:17:04 tlondon kernel: next_to_watch.status <0> Jan 14 07:17:06 tlondon kernel: 0000:00:19.0: eth1: Detected Tx Unit Hang: Jan 14 07:17:06 tlondon kernel: TDH <13> Jan 14 07:17:06 tlondon kernel: TDT <1e> and Oops: Jan 14 07:17:07 tlondon kernel: ------------[ cut here ]------------ Jan 14 07:17:07 tlondon kernel: WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xcb/0x136() Jan 14 07:17:07 tlondon kernel: Hardware name: 74585FU Jan 14 07:17:07 tlondon kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit timed out Jan 14 07:17:07 tlondon kernel: Modules linked in: i915 drm i2c_algo_bit rfcomm sco bridge stp llc bnep l2cap sunrpc nf_conntrack_netbios_ns ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table fuse dm_multipath kvm_intel kvm uinput snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy arc4 snd_seq_oss snd_seq_midi_event ecb snd_seq snd_seq_device iwlagn snd_pcm_oss iwlcore snd_mixer_oss snd_pcm thinkpad_acpi rfkill lib80211 snd_timer video hwmon usblp i2c_i801 mac80211 iTCO_wdt iTCO_vendor_support snd wmi i2c_core pcspkr soundcore btusb e1000e output bluetooth snd_page_alloc cfg80211 sha256_generic cbc aes_x86_64 aes_generic dm_crypt [last unloaded: microcode] Jan 14 07:17:07 tlondon kernel: Pid: 0, comm: swapper Tainted: G W 2.6.29-0.35.rc1.git4.fc11.x86_64 #1 Jan 14 07:17:07 tlondon kernel: Call Trace: Jan 14 07:17:07 tlondon kernel: <IRQ> [<ffffffff8104a245>] warn_slowpath+0xb1/0xed Jan 14 07:17:07 tlondon kernel: [<ffffffff813839c6>] ? _spin_unlock_irqrestore+0x40/0x57 Jan 14 07:17:07 tlondon kernel: [<ffffffff8106c393>] ? trace_hardirqs_off_caller+0x1f/0xac Jan 14 07:17:07 tlondon kernel: [<ffffffff812fa630>] ? netif_tx_lock+0x48/0x7d Jan 14 07:17:07 tlondon kernel: [<ffffffff8119adf7>] ? _raw_spin_unlock+0x8e/0x93 Jan 14 07:17:07 tlondon kernel: [<ffffffff81383a03>] ? _spin_unlock+0x26/0x2a Jan 14 07:17:07 tlondon kernel: [<ffffffff812fa6b7>] ? dev_watchdog+0x0/0x136 Jan 14 07:17:07 tlondon kernel: [<ffffffff812fa782>] dev_watchdog+0xcb/0x136 Jan 14 07:17:07 tlondon kernel: [<ffffffff8106d2f5>] ? trace_hardirqs_on_caller+0x1f/0x153 Jan 14 07:17:07 tlondon kernel: [<ffffffff8106d436>] ? trace_hardirqs_on+0xd/0xf Jan 14 07:17:07 tlondon kernel: [<ffffffff8138397a>] ? _spin_unlock_irq+0x2b/0x37 Jan 14 07:17:07 tlondon kernel: [<ffffffff81054297>] run_timer_softirq+0x17a/0x1f2 Jan 14 07:17:07 tlondon kernel: [<ffffffff8104f9ad>] __do_softirq+0x94/0x16f Jan 14 07:17:07 tlondon kernel: [<ffffffff8101272c>] call_softirq+0x1c/0x30 Jan 14 07:17:07 tlondon kernel: [<ffffffff81013849>] do_softirq+0x4d/0xb4 Jan 14 07:17:07 tlondon kernel: [<ffffffff8104f5fb>] irq_exit+0x4e/0x88 Jan 14 07:17:07 tlondon kernel: [<ffffffff810232a8>] smp_apic_timer_interrupt+0x93/0xac Jan 14 07:17:07 tlondon kernel: [<ffffffff81012103>] apic_timer_interrupt+0x13/0x20 Jan 14 07:17:07 tlondon kernel: <EOI> <4>---[ end trace 4eaa2a86a8e2da24 ]--- Jan 14 07:17:07 tlondon NetworkManager: <info> (eth1): carrier now OFF (device state 8) Jan 14 07:17:07 tlondon NetworkManager: <info> (eth1): device state change: 8 -> 2 Jan 14 07:17:07 tlondon NetworkManager: <info> (eth1): deactivating device (reason: 40). I rebooted very quickly.....
VT-d was enabled in the BIOS. I disabled it and looks like it "works for me now". Not sure how/what/why or how serious (or not) this is....
Hi Tom, I just tagged a build for koji that should disable this by default (and let you enable it with intel_iommu=on) Could you please test it when it's ready? http://koji.fedoraproject.org/koji/buildinfo?buildID=79960
Sure. Just to be precise, you want me to re-enable VT-d in the BIOS and boot the new kernel without "intel_iommu=on", right?
That's right, I would do it, but I don't seem to have any hardware supporting VT-d. Please feel free to turn it off again immediately after confirming the world doesn't explode. :) cheers, Kyle
OK. Downloaded/installed 0.46, enabled VT-d in the BIOS, and booted without 'intel_iommu=on': Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.29-0.46.rc2.git1.fc11.x86_64 (mockbuild.phx.redha t.com) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Fri Jan 23 1 3:58:45 EST 2009 Command line: ro root=/dev/VolGroup00/LogVol00 vga=0x318 KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls BIOS-provided physical RAM map: The boot was fine, and I'm running now in this setup. Did spot this in /var/log/messages however: Jan 23 12:20:38 tlondon kernel: NetLabel: protocols = UNLABELED CIPSOv4 Jan 23 12:20:38 tlondon kernel: NetLabel: unlabeled traffic allowed by default Jan 23 12:20:38 tlondon kernel: DMAR:Host address width 36 Jan 23 12:20:38 tlondon kernel: DMAR:DRHD (flags: 0x00000001)base: 0x0000000000000000 Jan 23 12:20:38 tlondon kernel: ------------[ cut here ]------------ Jan 23 12:20:38 tlondon kernel: WARNING: at arch/x86/mm/ioremap.c:241 __ioremap_caller+0x150/0x2bd() (Not tainted) Jan 23 12:20:38 tlondon kernel: Hardware name: 74585FU Jan 23 12:20:38 tlondon kernel: Modules linked in: Jan 23 12:20:38 tlondon kernel: Pid: 1, comm: swapper Not tainted 2.6.29-0.46.rc2.git1.fc11.x86_64 #1 Jan 23 12:20:38 tlondon kernel: Call Trace: Jan 23 12:20:38 tlondon kernel: [<ffffffff8104a4f9>] warn_slowpath+0xb9/0xfe Jan 23 12:20:38 tlondon kernel: [<ffffffff81016e66>] ? native_sched_clock+0x32/0x66 Jan 23 12:20:38 tlondon kernel: [<ffffffff81016ea3>] ? sched_clock+0x9/0xc Jan 23 12:20:38 tlondon kernel: [<ffffffff8106ba34>] ? lock_release_holdtime+0x2c/0x123 Jan 23 12:20:38 tlondon kernel: [<ffffffff81031a0a>] __ioremap_caller+0x150/0x2bd Jan 23 12:20:38 tlondon kernel: [<ffffffff811b048a>] ? alloc_iommu+0x142/0x192 Jan 23 12:20:38 tlondon kernel: [<ffffffff8103185a>] ? iounmap+0xb4/0xbf Jan 23 12:20:38 tlondon kernel: [<ffffffff81602140>] ? early_idt_handler+0x0/0x71 Jan 23 12:20:38 tlondon kernel: [<ffffffff81031c4c>] ioremap_nocache+0x12/0x14 Jan 23 12:20:38 tlondon kernel: [<ffffffff811b048a>] alloc_iommu+0x142/0x192 Jan 23 12:20:38 tlondon kernel: [<ffffffff81629650>] dmar_table_init+0x11a/0x24a Jan 23 12:20:38 tlondon kernel: [<ffffffff8160cc39>] ? pci_iommu_init+0x0/0x21 Jan 23 12:20:38 tlondon kernel: [<ffffffff81629c0f>] intel_iommu_init+0x19/0x9ad Jan 23 12:20:38 tlondon kernel: [<ffffffff813822ec>] ? mutex_unlock+0x9/0xb Jan 23 12:20:38 tlondon kernel: [<ffffffff81151a25>] ? debugfs_create_file+0x1c8/0x238 Jan 23 12:20:38 tlondon kernel: [<ffffffff8100e871>] ? xen_debugfs_create_u32_array+0x57/0x63 Jan 23 12:20:38 tlondon kernel: [<ffffffff8160cc39>] ? pci_iommu_init+0x0/0x21 Jan 23 12:20:38 tlondon kernel: [<ffffffff81602140>] ? early_idt_handler+0x0/0x71 Jan 23 12:20:38 tlondon kernel: [<ffffffff8160cc47>] pci_iommu_init+0xe/0x21 Jan 23 12:20:38 tlondon kernel: [<ffffffff8100a070>] do_one_initcall+0x70/0x187 Jan 23 12:20:38 tlondon kernel: [<ffffffff8106cf03>] ? mark_lock+0x22/0x3ad Jan 23 12:20:38 tlondon kernel: [<ffffffff8106cf03>] ? mark_lock+0x22/0x3ad Jan 23 12:20:38 tlondon kernel: [<ffffffff81190e4f>] ? ida_get_new_above+0x18d/0x1ae Jan 23 12:20:38 tlondon kernel: [<ffffffff81016e66>] ? native_sched_clock+0x32/0x66 Jan 23 12:20:38 tlondon kernel: [<ffffffff81016ea3>] ? sched_clock+0x9/0xc Jan 23 12:20:38 tlondon kernel: [<ffffffff8106ba34>] ? lock_release_holdtime+0x2c/0x123 Jan 23 12:20:38 tlondon kernel: [<ffffffff8119ac6b>] ? _raw_spin_unlock+0x8e/0x93 Jan 23 12:20:38 tlondon kernel: [<ffffffff81383d2b>] ? _spin_unlock+0x26/0x2a Jan 23 12:20:38 tlondon kernel: [<ffffffff811278ef>] ? proc_register+0x17f/0x193 Jan 23 12:20:38 tlondon kernel: [<ffffffff81127a2b>] ? create_proc_entry+0x7e/0x94 Jan 23 12:20:38 tlondon kernel: [<ffffffff8109805e>] ? register_irq_proc+0xb0/0xcc Jan 23 12:20:38 tlondon kernel: [<ffffffff81120000>] ? qtree_read_dquot+0xfc/0x1b7 Jan 23 12:20:38 tlondon kernel: [<ffffffff81602752>] kernel_init+0x1d4/0x22a Jan 23 12:20:38 tlondon kernel: [<ffffffff8101262a>] child_rip+0xa/0x20 Jan 23 12:20:38 tlondon kernel: [<ffffffff81011f3e>] ? restore_args+0x0/0x30 Jan 23 12:20:38 tlondon kernel: [<ffffffff8160257e>] ? kernel_init+0x0/0x22a Jan 23 12:20:38 tlondon kernel: [<ffffffff81012620>] ? child_rip+0x0/0x20 Jan 23 12:20:38 tlondon kernel: ---[ end trace 4eaa2a86a8e2da22 ]--- Jan 23 12:20:38 tlondon kernel: IOMMU: can't map the region Jan 23 12:20:38 tlondon kernel: DMAR:parse DMAR table failure. Jan 23 12:20:38 tlondon kernel: hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0 Jan 23 12:20:38 tlondon kernel: hpet0: 4 comparators, 64-bit 14.318180 MHz counter Jan 23 12:20:38 tlondon kernel: ODEBUG: object is on stack, but not annotated Jan 23 12:20:38 tlondon kernel: ------------[ cut here ]------------ Jan 23 12:20:38 tlondon kernel: WARNING: at lib/debugobjects.c:253 __debug_object_init+0x2a8/0x353() (Tainted: G W ) Jan 23 12:20:38 tlondon kernel: Hardware name: 74585FU Jan 23 12:20:38 tlondon kernel: Modules linked in: System seems stable, and I don't see any of the "DMAR" messages spewing out.... [Of course, I don't know if the above is important.] Let me know if you need more test....
Posted to here: https://lists.linux-foundation.org/pipermail/iommu/2009-January/001028.html (Just tested out intel_iommu=on here and it works fine - i.e. Kyle's patch to change the default works fine) Tom's latest try shows this: DMAR:DRHD (flags: 0x00000001)base: 0x0000000000000000 .. WARNING: at arch/x86/mm/ioremap.c:241 __ioremap_caller+0x150/0x2bd() ... IOMMU: can't map the region DMAR:parse DMAR table failure. Here's the bits of code in question: static int __init dmar_parse_one_drhd(struct acpi_dmar_header *header) { ... drhd = (struct acpi_dmar_hardware_unit *)header; dmaru->reg_base_addr = drhd->address; ... ret = alloc_iommu(dmaru); ... } int alloc_iommu(struct dmar_drhd_unit *drhd) { ... iommu->reg = ioremap(drhd->reg_base_addr, VTD_PAGE_SIZE); if (!iommu->reg) { printk(KERN_ERR "IOMMU: can't map the region\n"); goto error; ... } static void __iomem *__ioremap_caller(resource_size_t phys_addr, unsigned long size, unsigned long prot_val, void *caller) { ... * Don't allow anybody to remap normal RAM that we're using */ for (pfn = phys_addr >> PAGE_SHIFT; (pfn << PAGE_SHIFT) < (last_addr & PAGE_MASK); pfn++) { int is_ram = page_is_ram(pfn); if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn))) return NULL; WARN_ON_ONCE(is_ram); ... } drhd->flags == 0x1 is INCLUDE_PCI_ALL Looks like drhd->address == NULL, leading us to ioremap(0x00) i.e. this BIOS is reporting a re-mapping unit, but not supplying a base address for the unit's register set However, originally Tom's bios reported: DMAR:Host address width 36 DMAR:DRHD (flags: 0x00000000)base: 0x00000000feb03000 DMAR:DRHD (flags: 0x00000000)base: 0x00000000feb01000 DMAR:DRHD (flags: 0x00000000)base: 0x00000000feb00000 DMAR:DRHD (flags: 0x00000001)base: 0x00000000feb02000 Tom: can you try fiddling with your BIOS settings to see if it can go back to properly reporting the remapping hardware units? Perhaps make sure to power off completely after changing the settings?
OK. I rebooted to BIOS setup, re-enabled VT-d, saved/exited and power cycled. As you conjectured, this time I got something different: Jan 26 06:35:17 tlondon kernel: DMAR:Host address width 36 Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000000)base: 0x00000000feb03000 Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000000)base: 0x00000000feb01000 Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000000)base: 0x00000000feb00000 Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000001)base: 0x00000000feb02000 Jan 26 06:35:17 tlondon kernel: DMAR:RMRR base: 0x00000000f2826c00 end: 0x00000000f28273ff Jan 26 06:35:17 tlondon kernel: DMAR:RMRR base: 0x000000007cc00000 end: 0x000000007effffff Jan 26 06:35:17 tlondon kernel: DMAR:Device scope device [0000:00:03.02] not found Jan 26 06:35:17 tlondon kernel: DMAR:Device scope device [0000:00:03.02] not found Jan 26 06:35:17 tlondon kernel: DMAR:Device scope device [0000:00:03.03] not found Jan 26 06:35:17 tlondon kernel: DMAR:Device scope device [0000:00:03.03] not found Jan 26 06:35:17 tlondon kernel: hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0 So I must have done something "less perfect" last time. I will leave VT-d enabled and boot (without 'intel_iommu=on') and see if the 'DMAR:DRHD (flags: 0x00000001)base: 0x0000000000000000' comes back..... I've not detected any issues booting without 'intel_iommu=on' when VT-d is enabled. Any reason for me not to "just leave it this way"?
Created attachment 329990 [details] /var/log/messages from boot with VT-d enabled and without 'intel_iommu=on' I forgot to attach the complete "/var/log/messages".
(In reply to comment #9) > OK. I rebooted to BIOS setup, re-enabled VT-d, saved/exited and power cycled. > > As you conjectured, this time I got something different: > > Jan 26 06:35:17 tlondon kernel: DMAR:Host address width 36 > Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000000)base: > 0x00000000feb03000 > Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000000)base: > 0x00000000feb01000 > Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000000)base: > 0x00000000feb00000 > Jan 26 06:35:17 tlondon kernel: DMAR:DRHD (flags: 0x00000001)base: > 0x00000000feb02000 Great. Stupid BIOS! :-) > I will leave VT-d enabled and boot (without 'intel_iommu=on') and see if the > 'DMAR:DRHD (flags: 0x00000001)base: 0x0000000000000000' comes back..... It shouldn't. > I've not detected any issues booting without 'intel_iommu=on' when VT-d is > enabled. Any reason for me not to "just leave it this way"? Oh yes, please do leave it that way. If your "PTE Write access is not set" issue doesn't re-appear, that means we can try making "intel_iommu=on" the default for everyone again. Perhaps your issue got resolved between rc1 and rc2.
Created attachment 329992 [details] /var/log/messages booting with VT-d and with 'intel_iommu=on' Based on your optimism ( ;) ), I tried booting with 'intel_iommu=on' to see if any changes 'fixed this'. The short answer is "no". Get many: Jan 26 07:32:03 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffff000 Jan 26 07:32:03 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Noticed this one: Jan 26 07:32:17 tlondon kernel: Uhhuh. NMI received for unknown reason b1 on CPU 0. Jan 26 07:32:17 tlondon kernel: You have some hardware problem, likely on the PCI bus. Jan 26 07:32:17 tlondon kernel: Dazed and confused, but trying to continue Jan 26 07:32:17 tlondon kernel: DMAR:[DMA Write] Request device [03:00.0] fault addr ff9df000 Jan 26 07:32:17 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Jan 26 07:32:17 tlondon kernel: DMAR:[DMA Write] Request device [03:00.0] fault addr fff8c000 Jan 26 07:32:17 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Jan 26 07:32:17 tlondon kernel: DMAR:[DMA Write] Request device [03:00.0] fault addr ff9df000 Jan 26 07:32:17 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set I rebooted quickly. So I'm back to booting without 'intel_iommu=on'. Noticed that there is a BIOS update from Lenovo: http://www-307.ibm.com/pc/support/site.wss/MIGR-70347.html But doesn't sound like anything related to VT-d.....
(In reply to comment #12) > Created an attachment (id=329992) [details] > /var/log/messages booting with VT-d and with 'intel_iommu=on' > > Based on your optimism ( ;) ), I tried booting with 'intel_iommu=on' to see if > any changes 'fixed this'. > > The short answer is "no". I'm sorry - I misread your other comment as saying you had booted *with* intel_iommu=on. That explains it - I was very surprised that the issue seemed to be fixed. > > Jan 26 07:32:03 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault > addr ffffff000 > Jan 26 07:32:03 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not > set Thanks for confirming it's still there. > Noticed that there is a BIOS update from Lenovo: > http://www-307.ibm.com/pc/support/site.wss/MIGR-70347.html > > But doesn't sound like anything related to VT-d..... It's definitely worth a try, IMHO.
Created attachment 330160 [details] ACPI DMAR table from x200
FWIW I'm debugging the same e1000e "Tx Unit Hang" after the DMAR faults on my Lenovo X200 with a vanilla kernel.org 2.6.28 kernel. I also tried upgrading the BIOS to 2.02 from Lenovo's website, only to be left with an unbootable notebook. Turns out their new BIOS somehow blacklists Kingston memory modules, and I had to get Lenovo to replace the motherboard in my X200 to get back to the old working BIOS that boots. http://forums.lenovo.com/lnv/board/message?board.id=Special_Interest_Utilities&message.id=3496#M3496 So bottom line: Don't upgrade your BIOS. Also, I looked at the ACPI DMAR tables from my x200, and AFAICT they looked reasonable. See attached (bhavesh-DMAR.dsl).
(In reply to comment #15) > FWIW I'm debugging the same e1000e "Tx Unit Hang" after the DMAR faults on my > Lenovo X200 with a vanilla kernel.org 2.6.28 kernel. > > I also tried upgrading the BIOS to 2.02 from Lenovo's website, only to be left > with an unbootable notebook. Turns out their new BIOS somehow blacklists > Kingston memory modules, and I had to get Lenovo to replace the motherboard in > my X200 to get back to the old working BIOS that boots. > > http://forums.lenovo.com/lnv/board/message?board.id=Special_Interest_Utilities&message.id=3496#M3496 > > So bottom line: Don't upgrade your BIOS. > > Also, I looked at the ACPI DMAR tables from my x200, and AFAICT they looked > reasonable. See attached (bhavesh-DMAR.dsl). Argh.... Thanks for the warning, but too late; I've already updated. Guessing the Lenovo supplied memory in my system is not blacklisted by the BIOS. Anyway, the newer (2.02) BIOS does NOT fix the problem.
Created attachment 330187 [details] Serial capture of failed boot I work with Bhavesh and am seeing the same issues, albeit on different hardware. I've got a Dell Precision T5400 with VT-d enabled (via BIOS). I built 2.6.29-rc2 with CONFIG_DMAR=y, rebooted into the kernel, but failed to completely boot. During boot I'm seeing DMAR faults, in my case, because "Present bit in root entry is clear".
(In reply to comment #17) > Created an attachment (id=330187) [details] > Serial capture of failed boot > > I work with Bhavesh Thanks to both of you for investigating this. > and am seeing the same issues, albeit on different > hardware. I've got a Dell Precision T5400 with VT-d enabled (via BIOS). I built > 2.6.29-rc2 with CONFIG_DMAR=y, rebooted into the kernel, but failed to > completely boot. During boot I'm seeing DMAR faults, in my case, because > "Present bit in root entry is clear". Yes, I see this in your log: [ 180.160014] DMAR:[DMA Read] Request device [06:00.0] fault addr fff82000 [ 180.160017] DMAR:[fault reason 01] Present bit in root entry is clear I assume everything is working fine if you disable VT-d in the BIOS or boot with intel_iommu=off ?
Okay, looking at Tom's logs it seems his DMAR faults are coming mostly from an iwlagn wireless card but also one each from an i915 gfx card, an e1000e nic and an ahci sata controller. Adar's log shows 6 faults from fusion mpt sas host adaptor.
(In reply to comment #17) > I built 2.6.29-rc2 with CONFIG_DMAR=y, rebooted into the kernel Adar, since you're building a kernel you could try using Joerg Roedel's DMA-API debugging code: http://git.kernel.org/?p=linux/kernel/git/joro/linux-2.6-iommu.git
(In reply to comment #20) > (In reply to comment #17) > > > I built 2.6.29-rc2 with CONFIG_DMAR=y, rebooted into the kernel > > Adar, since you're building a kernel you could try using Joerg Roedel's DMA-API > debugging code: > > http://git.kernel.org/?p=linux/kernel/git/joro/linux-2.6-iommu.git ... on the dma-api/debug branch: http://git.kernel.org/?p=linux/kernel/git/joro/linux-2.6-iommu.git;h=dma-api/debug
Created attachment 330301 [details] Serial capture of failed dma-api/debug boot (In reply to comment #21) > ... on the dma-api/debug branch: > > http://git.kernel.org/?p=linux/kernel/git/joro/linux-2.6-iommu.git;h=dma-api/debug Mark, I was able to boot 2.6.29-rc2 after passing intel_iommu=off on the command line. Additionally, I cloned the dma-api/debug branch, built it, and tried to boot off of it. I got the same kind of failure (which disappeared if I passed intel-iommu=off on the command line). I was expecting to see more logging output (for DMA requests) but didn't get any; maybe my .config is messed up. I'm attaching the serial console output of the dma-api/debug boot, as well as my .config for the kernel.
Created attachment 330302 [details] kernel config for dma-api/debug kernel
(In reply to comment #22) > I was able to boot 2.6.29-rc2 after passing intel_iommu=off on the command > line. That's good to know > > Additionally, I cloned the dma-api/debug branch, built it, and tried to boot > off of it. I got the same kind of failure (which disappeared if I passed > intel-iommu=off on the command line). I was expecting to see more logging > output (for DMA requests) but didn't get any; maybe my .config is messed up. Well, you have CONFIG_DMA_API_DEBUG enabled. I think this just means that this isn't a driver issue that DMA_API_DEBUG detects.
Another report here: https://lists.linux-foundation.org/pipermail/iommu/2009-January/001032.html That means this bug has been seen on: - Lenovo x200 laptop - Dell Precision t5400 workstation - Dell Latitude e6400 laptop
That report is pretty much the same as Adar and mine, and Oleg is also our teammate. Also, I wouldn't have expected CONFIG_DMA_API_DEBUG to turn up anything. And given that this bug (wherever it is) is affecting pretty much all flavours of PCI(e) devices behind Intel VT-d DMARs, I strongly suspect this bug is lurking in the Intel IOMMU driver (linux-2.6/drivers/pci/intel-iommu.[ch])
See also: http://bugzilla.kernel.org/show_bug.cgi?id=12578
I have a Dell E6400 and would like to help test any fixes. I have very little experience with kernel work but could compile and use one if pointed to compilation instructions. Bob
Bhavesh, it appears Adar has a seperate issue from Tom's. Although this has kind of become the "ZOMG MY IOMMU IS BROKEN" omnibus bug. :) Could you please post your dmesg from your X200? It looks to me like Adar is using the T5400 workstation, whereas you and Tom are seeing a seperate issue using the X200 laptop? cheers, Kyle
Tom, could you try booting with pci=noacpi? I suspect this might have something to do with PCI hotplug, and sadly, since we build ACPI Hotplug into the kernel, this is the smallest hammer we have to turn it off. I'll spin a kernel scratchbuild with acpi hotplug off just in case there's side effects to this that make it untestable. regards, Kyle
Created attachment 330990 [details] Photo of screen hanging on boot with "pci=noacpi intel_iommu=on" Won't boot with "pci=noacpi intel_iommu=on". This is with kernel-2.6.29-0.78.rc3.git5.fc11.x86_64. Hangs very early, just after grub and before prompting for encrypted partition passphrase. Can't be sure, but I think I saw some IOMMU/DMAR messages scroll by. What can I do next?
(In reply to comment #29) > Bhavesh, it appears Adar has a seperate issue from Tom's. Although this has > kind of become the "ZOMG MY IOMMU IS BROKEN" omnibus bug. :) Could you please > post your dmesg from your X200? Sure, here's a dmesg from an unmodified 2.6.29-rc3 boot. I've been debugging this issue quite a bit myself too by lacing the Intel IOMMU driver with printks, so I had to go fishing in /var/log/ for this one from an unmodified kernel. > It looks to me like Adar is using the T5400 workstation, whereas you and Tom > are seeing a seperate issue using the X200 laptop? That's correct. And BTW, I'm almost 100% convinced now that this is not a broken BIOS ACPI DMAR table issue. But I'm working on it... Thanks - Bhavesh
Created attachment 331033 [details] dmesg from unmodified 2.6.29-rc3 boot with Intel IOMMU enabled. dmesg from unmodified 2.6.29-rc3 boot with Intel IOMMU enabled.
Thanks Bhavesh, I concur that it's very unlikely to be a broken ACPI table (not ruling out broken BIOS though. :) The reason for this is because you're able to mount root and swap in the traces, which means that /some/ DMA is working, it's just failing after the fact. Could you try this patch and see if that resolves your issue? Speaking with Dave Woodhouse last night we became suspicious if maybe the io page table updates weren't correctly getting flushed from the cache. http://kyle.fedorapeople.org/always-clflush-ioptes.diff regards, Kyle
That's so interesting, because I've always been suspicious about exactly the same root cause (not clflushing the PTE entries), and that's one of the places I had laced with printks (in domain_page_mapping before calling __iommu_flush_cache(..pte..)). I looked at the iommu->ecap and the "coherent" bit was not set so __iommu_flush_cache should have been clflushing the PTE. I'll take a look at David's patch.
Okay there's no point applying David's patch that I just looked at, because of what I said: none of my 4 IOMMU's on my laptop advertise Coherency in their ecap register, so it *had* to be clfushing the PTEs in my case.
Okay, just to make you happy ;) I tried with a similar patch in concept (since I had been debugging with a 2.6.28 kernel): linux-2.6.28/include/linux/intel-iommu.h: static inline void __iommu_flush_cache( struct intel_iommu *iommu, void *addr, int size) { #if 0 if (!ecap_coherent(iommu->ecap)) #endif clflush_cache_range(addr, size); } Still the DMA Read fault from the IOMMU, and Tx Unit Hangs for the e1000e.
Thanks for testing, I'll poke a bit more. :/
(In reply to comment #36) > Okay there's no point applying David's patch that I just looked at, because of > what I said: none of my 4 IOMMU's on my laptop advertise Coherency in their > ecap register, so it *had* to be clfushing the PTEs in my case. Mostly true. But domain_flush_cache() won't flush the cache unless domain->iommu_coherency is set to zero. And there's a race condition in domain_update_iommu_coherency() where it would set that to 1 briefly, then decide that it should be set back to zero again. It's fairly unlikely that it'll bite, because I think it needs PCI hotplug to actually trigger it. But it's theoretically possible, and it would be nice to eliminate that possibility completely. My other debug patch will dump a list of the valid DMA mappings when it gets a DMA page fault. If the faulting address is _in_ that list, we know it's a DMAR bug (which we suspect anyway). If the faulting address _isn't_ in that list, it looks like a driver bug.
Created attachment 331065 [details] /var/log/messages booting 0.85 with 'intel_iommu=on'; lots of DMAR type errors. Not sure it's helpful, but I noticed a new kernel in koji. I booted kernel-2.6.29-0.85.rc3.git7.fc11.x86_64 twice, the first time without 'intel_iommu=on', the second with it. [System is Thinkpad X200, ....] First boot: without 'intel_iommu=on'. Feb 5 14:50:53 tlondon kernel: BUG: MAX_LOCKDEP_ENTRIES too low! Feb 5 14:50:53 tlondon kernel: turning off the locking correctness validator. and ------------[ cut here ]------------ WARNING: at lib/dma-debug.c:448 check_unmap+0x2e7/0x417() (Not tainted) Hardware name: 74585FU e1000e 0000:00:19.0: DMA-API: device driver frees DMA memory with wrong function [device address=0x000000011d06a000] [size=204 bytes] [mapped as page] [unmapped as single] Modules linked in: i915 drm i2c_algo_bit rfcomm sco bridge stp llc bnep l2cap sunrpc nf_conntrack_netbios_ns ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table fuse dm_multipath kvm_intel kvm uinput snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device arc4 ecb snd_pcm_oss snd_mixer_oss iwlagn snd_pcm iwlcore snd_timer lib80211 snd mac80211 thinkpad_acpi rfkill video i2c_i801 soundcore btusb output wmi iTCO_wdt usblp i2c_core e1000e hwmon cfg80211 iTCO_vendor_support bluetooth snd_page_alloc pcspkr sha256_generic cbc aes_x86_64 aes_generic dm_crypt [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.29-0.85.rc3.git7.fc11.x86_64 #1 Call Trace: <IRQ> [<ffffffff8104a671>] warn_slowpath+0xb9/0xfe [<ffffffff8106ece3>] ? print_lock_contention_bug+0x1e/0x110 [<ffffffff8106ece3>] ? print_lock_contention_bug+0x1e/0x110 [<ffffffff8119fd81>] ? check_unmap+0x79/0x417 [<ffffffff8119ffef>] check_unmap+0x2e7/0x417 [<ffffffff810d6c6a>] ? kmem_cache_free+0xc7/0xdf [<ffffffff811a02ae>] debug_dma_unmap_page+0x48/0x4a [<ffffffffa0074327>] pci_unmap_page+0x6a/0x73 [e1000e] [<ffffffffa0074359>] e1000_put_txbuf+0x29/0x4b [e1000e] [<ffffffffa00744ef>] e1000_clean_tx_irq+0xc8/0x2ce [e1000e] [<ffffffffa0077e75>] ? e1000_clean+0x66/0x244 [e1000e] [<ffffffffa0077e81>] e1000_clean+0x72/0x244 [e1000e] [<ffffffff812e81c6>] ? net_rx_action+0x1a4/0x1ed [<ffffffff812e80d6>] net_rx_action+0xb4/0x1ed [<ffffffff812e81c6>] ? net_rx_action+0x1a4/0x1ed [<ffffffff8104fd55>] __do_softirq+0x94/0x16f [<ffffffff8101272c>] call_softirq+0x1c/0x30 [<ffffffff81013849>] do_softirq+0x4d/0xb4 [<ffffffff8104f9a3>] irq_exit+0x4e/0x88 [<ffffffff81013b60>] do_IRQ+0x130/0x154 [<ffffffff81011e13>] ret_from_intr+0x0/0x2e <EOI> <4>---[ end trace 3a7ddba377ef6e75 ]--- wlan1: authenticate with AP 00:19:77:00:4f:f After this, the system appears to "run normally". The second boot with 'intel_iommu=on' produced the log that is attached here. Please let me know how else to help here.....
(In reply to comment #39) > My other debug patch will dump a list of the valid DMA mappings when it gets a > DMA page fault. If the faulting address is _in_ that list, we know it's a DMAR > bug (which we suspect anyway). If the faulting address _isn't_ in that list, it > looks like a driver bug. Can you be more specific about which "other debug patch" you're referring to? Perhaps another pointer to it? I'll be glad to try it out because I'm interested in exactly that information too. And BTW, I tried a quick hack to add a 2us delay after clflushing the PTE cache line and it seems to help in my case: --- orig/linux-2.6.28/include/linux/intel-iommu.h 2008-12-24 15:26:37.000000000 -0800 +++ linux-2.6.28/include/linux/intel-iommu.h 2009-02-05 12:45:45.000000000 -0800 @@ -317,6 +317,7 @@ static inline void __iommu_flush_cache( { if (!ecap_coherent(iommu->ecap)) clflush_cache_range(addr, size); + udelay(2); } extern struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev *dev);
(In reply to comment #41) > And BTW, I tried a quick hack to add a 2us delay after clflushing the PTE cache > line and it seems to help in my case: I too applied Bhavesh's udelay patch and can confirm a successful boot into Ubuntu 8.10 (using the patched kernel), with intel_iommu=on.
My other debugging attempts are in the kernel.org bug at http://bugzilla.kernel.org/show_bug.cgi?id=12578 They're targeted at the failure mode reported there, which is slightly different to your 'Present bit in root entry is clear' failure. I'm very interested (and somewhat perturbed) by the fact that udelay() helps. When you say 'seems to help'.... what exactly happens? Do _all_ the symptoms go away?
Can you show the contents of /proc/cpuinfo... particularly the clflush_size line?
'seems to help' == No more "DMA Read" faults from the IOMMU. Multiple boots. /proc/cpuinfo: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz stepping : 6 cpu MHz : 800.000 cache size : 3072 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority bogomips : 4787.82 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz stepping : 6 cpu MHz : 800.000 cache size : 3072 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority bogomips : 4787.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
Created attachment 331069 [details] /proc/cpuinfo on Dell Precision T5400 Here is my situation: I've got a Dell Precision T5400 workstation. I installed Ubuntu 8.10 on it. I pulled 2.6.29-rc2 and built it using the Ubuntu 8.10 config with a few changes (like CONFIG_DMAR=y). I've never been able to boot into that newly built kernel; at boot time, my root filesystem can't be found and I see "Present bit in root entry is clear" errors in my log. See comment 17 for the full output. Bhavesh inserted that udelay() patch into the iommu code of his 2.6.28 kernel tree. With that kernel installed, I'm able to boot into Ubuntu successfully, all my devices appear to be detected, I can start X, send network packets, and perform disk I/O. In short, everything appears to work normally on the system. And I've confirmed in dmesg that the iommu units in my system have been detected. I'm attaching the output of /proc/cpuinfo.
I need to consult with internal folks. I suspect that we may need to flush it harder -- look at the chipset_flush functions in drivers/char/agp/intel-agp.c to see the kind of fun hoops we might need to jump through.
Please could you test with the patch at http://bugzilla.kernel.org/show_bug.cgi?id=12578#c19
(In reply to comment #48) > Please could you test with the patch at > http://bugzilla.kernel.org/show_bug.cgi?id=12578#c19 Presumably you'd like people to test with rwbf_quirk=2 and rwbf_quirk=1 as well as the default rwbf_quirk=3?
If rwbf=3 doesn't work then you could try =2. I don't think that'll make much difference though. A more useful test, if this patch doesn't fix it, would be to force the workaround in _all_ cases, and not just when !cap_rwbf().
So, I think the summary here is that Tom's lenovo x200 should be fixed by dwmw2's patch. The dell e6400 has the same chipset and should similarly be fixed. When that gets into Fedora, we should close this bug. Please file separate bugs for issues seen on other platforms (or indeed different issues seen on these two platforms)
Okay, I've filed the iwlan issue as bug #485139
Any easy way to know which build this patch will get in to? I see a build in koji now, but the changelog is not clear to me......
As I understand it, the iommu patch is waiting on confirmation from Intel as an errata. The patch that Dave is building a kernel in koji for right now is for the iwlagn issue, not the broken iommu issue. regards, Kyle
2.6.29-0.115.rc4.git7.fc11 is building now in koji, which should have this fixed. http://koji.fedoraproject.org/koji/taskinfo?taskID=1125914
Created attachment 331890 [details] dmesg output booting 0.115 with 'intel_iommu=on' Downloaded kernel-2.6.29-0.115.rc4.git7.fc11.x86_64, and booted with 'intel_iommu=on'. Looks good! I only get 1 DMAR fault message: Feb 13 17:25:52 tlondon kernel: IOMMU: Setting identity map for device 0000:00:02.1 [0xbd30f000 - 0xbd3c6000] Feb 13 17:25:52 tlondon kernel: IOMMU: Setting identity map for device 0000:00:02.1 [0xbd6ff000 - 0xbd700000] Feb 13 17:25:52 tlondon kernel: IOMMU: Setting identity map for device 0000:00:02.1 [0x100000000 - 0x13c000000] Feb 13 17:25:52 tlondon kernel: IOMMU: Prepare 0-16M unity mapping for LPC Feb 13 17:25:52 tlondon kernel: IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000] Feb 13 17:25:52 tlondon kernel: DMAR:[DMA Write] Request device [00:02.0] fault addr 4f6577000 Feb 13 17:25:52 tlondon kernel: DMAR:[fault reason 05] PTE Write access is not set Feb 13 17:25:52 tlondon kernel: PCI-DMA: Intel(R) Virtualization Technology for Directed I/O Feb 13 17:25:52 tlondon kernel: hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0 From all appearances, iwlagn, e1000e, and the rest seem to be working without issue. Posting output of dmesg for completeness. Thanks!
Interesting, can you just paste in the lspci -vvnn for 00:02.0? cheers, Kyle
Sure: 00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller [8086:2a42] (rev 07) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device [17aa:20e4] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 34 Region 0: Memory at f2000000 (64-bit, non-prefetchable) [size=4M] Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at 1800 [size=8] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Mask- 64bit- Count=1/1 Enable+ Address: fee0300c Data: 41e9 Capabilities: [d0] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- That what you're looking for?
Yup, ok, cool, that's one we've already seen. cheers, Kyle
I have hit probably the same issue with Fedora 13. It is reported as bug #572550. Should I reopen this bug instead?