Description of problem: [ 30.626164] ------------[ cut here ]------------ [ 30.631058] WARNING: at lib/dma-debug.c:911 check_sync+0xca/0x46c() [ 30.637589] Hardware name: X8DTH-i/6/iF/6F [ 30.641954] mlx4_core 0000:86:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000fe181000] [size=4096 bytes] [ 30.656829] Modules linked in: ses enclosure ghes serio_raw hed i2c_i801 joydev i2c_core mpt2sas iTCO_wdt i7core_edac iTCO_vendor_support scsi_transport_sas ioatdma raid_class edac_core mlx4_core(+) ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_cm ib_sa ib_uverbs ib_umad ib_mad ib_core igb dca ipmi_devintf ipmi_si ipmi_msghandler [ 30.689819] Pid: 846, comm: work_for_cpu Not tainted 3.0.1-5.fc16.x86_64 #1 [ 30.697049] Call Trace: [ 30.699768] [<ffffffff81058ebc>] warn_slowpath_common+0x83/0x9b [ 30.706044] [<ffffffff81058f77>] warn_slowpath_fmt+0x46/0x48 [ 30.712061] [<ffffffff81253b41>] check_sync+0xca/0x46c [ 30.717557] [<ffffffff8108aefc>] ? mark_held_locks+0x4b/0x6d [ 30.723576] [<ffffffff8100e9fd>] ? paravirt_read_tsc+0x9/0xd [ 30.729589] [<ffffffff8100eec7>] ? native_sched_clock+0x34/0x36 [ 30.735862] [<ffffffff812541d8>] debug_dma_sync_single_for_cpu+0x42/0x44 [ 30.742920] [<ffffffff814dbd6d>] ? __mutex_unlock_slowpath+0x112/0x122 [ 30.749796] [<ffffffff8108b029>] ? trace_hardirqs_on_caller+0x10b/0x12f [ 30.756765] [<ffffffff814dbd75>] ? __mutex_unlock_slowpath+0x11a/0x122 [ 30.763645] [<ffffffff814dbd8b>] ? mutex_unlock+0xe/0x10 [ 30.769319] [<ffffffffa00da0b8>] dma_sync_single_for_cpu.constprop.11+0x5d/0x66 [mlx4_core] [ 30.778243] [<ffffffffa00da181>] mlx4_write_mtt+0xc0/0x133 [mlx4_core] [ 30.785135] [<ffffffffa00d360d>] mlx4_create_eq+0x305/0x45c [mlx4_core] [ 30.792106] [<ffffffffa00d3a26>] ? mlx4_init_eq_table+0x146/0x4ac [mlx4_core] [ 30.799844] [<ffffffff8112a0da>] ? __kmalloc+0xfa/0x10c [ 30.805439] [<ffffffffa00d3a6f>] mlx4_init_eq_table+0x18f/0x4ac [mlx4_core] [ 30.812755] [<ffffffffa00dd192>] mlx4_setup_hca+0x11a/0x410 [mlx4_core] [ 30.819729] [<ffffffffa00d796c>] ? kzalloc.constprop.3+0x13/0x15 [mlx4_core] [ 30.827128] [<ffffffffa00d8118>] __mlx4_init_one+0x7aa/0x7bb [mlx4_core] [ 30.834189] [<ffffffff8106f9d8>] ? move_linked_works+0x6e/0x6e [ 30.840374] [<ffffffffa00dd4c5>] mlx4_init_one+0x3d/0x42 [mlx4_core] [ 30.847083] [<ffffffff8125dca6>] local_pci_probe+0x44/0x75 [ 30.852921] [<ffffffff8106f9ee>] do_work_for_cpu+0x16/0x28 [ 30.858765] [<ffffffff81075e5d>] kthread+0xa8/0xb0 [ 30.863914] [<ffffffff814e50a4>] kernel_thread_helper+0x4/0x10 [ 30.870096] [<ffffffff814dd754>] ? retint_restore_args+0x13/0x13 [ 30.876459] [<ffffffff81075db5>] ? __init_kthread_worker+0x5a/0x5a [ 30.882994] [<ffffffff814e50a0>] ? gs_change+0x13/0x13 [ 30.888487] ---[ end trace ede39044efbe156f ]--- Version-Release number of selected component (if applicable): kernel-3.0.1-5.fc16.x86_64 How reproducible: Always Steps to Reproduce: 1. Boot kernel with Mellanox InfiniBand controller Additional info: Motherboard is configured with Optimal Defaults The same hardware has been given intermittent issues with the following errors: [ 19.360850] mlx4_core 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. [ 19.360855] mpt2sas 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. Mostly when the machine is rebooted, these problems go away.
I think the "vpd r/w failed" errors are due to a BIOS bug which is exposed by udev loading modules together. I don't think it makes a difference to whether this warning is emitted.
It seems this has happened before: http://copilotco.com/mail-archives/ofa.2009/msg04432.html
This still happens with the latest 3.2.3 debug kernel rpm. WARNING: at lib/dma-debug.c:966 check_sync+0x2a8/0x530() mlx4_core 0000:02:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000017839c1000] [size=4096 bytes]
[mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update.
looks fixed. By the way, apparently pcie_aspm=off or pcie_aspm=performance fixes the VPD issue, according to Supermicro.
Saw this again on a machine with a mlx4 with older firmware with 3.3.0-4 debug.
False alarm maybe. I see now it was 3.2.7-debug on this machine, not 3.3.0-4. Will retest.
Retested with 3.3.0-4.fc16.x86_64.debug on this machine. Looks fixed.
Argh. I wasn't looking properly. Finally, 3.3.0-4.fc16.x86_64.debug still has this. [ 40.693654] ------------[ cut here ]------------ [ 40.695188] WARNING: at lib/dma-debug.c:966 check_sync+0x2a8/0x530() [ 40.697642] Hardware name: SUN BLADE X6270 SERVER MODULE [ 40.706231] mlx4_core 0000:0d:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000102c981000] [size=4096 bytes] [ 40.730140] Modules linked in: ib_ipoib ib_cm ib_addr ib_sa ib_uverbs ib_umad ib_mad ib_core ipmi_poweroff ipmi_watchdog ipmi_devintf i2c_i801 mptsas(+) iTCO_wdt mptscsih igb i2c_core joydev microcode iTCO_vendor_support i7core_edac mlx4_core(+) mptbase ioatdma dca edac_core scsi_transport_sas ipmi_si ipmi_msghandler [ 40.787764] Pid: 257, comm: modprobe Not tainted 3.3.0-4.fc16.x86_64.debug #1 [ 40.790549] Call Trace: [ 40.806210] [<ffffffff81061f6f>] warn_slowpath_common+0x7f/0xc0 [ 40.808336] [<ffffffff81062066>] warn_slowpath_fmt+0x46/0x50 [ 40.827621] [<ffffffff8133ff88>] check_sync+0x2a8/0x530 [ 40.829762] [<ffffffff81340492>] debug_dma_sync_single_for_cpu+0x42/0x50 [ 40.847063] [<ffffffff8133c0fc>] ? is_swiotlb_buffer+0x3c/0x50 [ 40.849194] [<ffffffff8133c918>] ? swiotlb_sync_single+0x38/0x80 [ 40.868855] [<ffffffff8133ca5c>] ? swiotlb_sync_single_for_cpu+0xc/0x10 [ 40.905296] [<ffffffffa0095d3a>] __mlx4_write_mtt+0xea/0x1e0 [mlx4_core] [ 40.909217] [<ffffffffa0095f5c>] mlx4_write_mtt+0x12c/0x170 [mlx4_core] [ 40.925092] [<ffffffffa008a9bd>] mlx4_create_eq+0x4ad/0x6e0 [mlx4_core] [ 40.926982] [<ffffffffa008b24f>] mlx4_init_eq_table+0x1ff/0x6b0 [mlx4_core] [ 40.947112] [<ffffffffa00916d7>] mlx4_setup_hca+0x167/0x530 [mlx4_core] [ 40.964978] [<ffffffff811a26ec>] ? kfree+0x28c/0x2a0 [ 40.966813] [<ffffffffa0092342>] __mlx4_init_one+0x8a2/0xca0 [mlx4_core] [ 40.985583] [<ffffffffa009f5df>] mlx4_init_one+0x3d/0x42 [mlx4_core] [ 40.987441] [<ffffffff813498dc>] local_pci_probe+0x5c/0xd0 [ 41.005550] [<ffffffff8134b1d9>] pci_device_probe+0x109/0x130 [ 41.008078] [<ffffffff8141216c>] driver_probe_device+0x9c/0x300 [ 41.025620] [<ffffffff8141247b>] __driver_attach+0xab/0xb0 [ 41.027444] [<ffffffff814123d0>] ? driver_probe_device+0x300/0x300 [ 41.046544] [<ffffffff814104fe>] bus_for_each_dev+0x5e/0x90 [ 41.048703] [<ffffffff81411d6e>] driver_attach+0x1e/0x20 [ 41.065678] [<ffffffff81411960>] bus_add_driver+0x1c0/0x2b0 [ 41.067490] [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core] [ 41.086974] [<ffffffff814129f6>] driver_register+0x76/0x140 [ 41.104564] [<ffffffff813325a8>] ? __raw_spin_lock_init+0x38/0x70 [ 41.106723] [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core] [ 41.125441] [<ffffffff8134ae96>] __pci_register_driver+0x66/0xe0 [ 41.127858] [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core] [ 41.145308] [<ffffffffa00b0107>] mlx4_init+0xac/0xfa5 [mlx4_core] [ 41.147163] [<ffffffff8100203f>] do_one_initcall+0x3f/0x170 [ 41.166884] [<ffffffff810dbb82>] sys_init_module+0xc82/0x21f0 [ 41.169044] [<ffffffff816abb29>] system_call_fastpath+0x16/0x1b [ 41.186223] ---[ end trace 2a085dfdc60385a8 ]---
Are you still seeing this on a 3.4 or 3.5 debug kernel? I'm guessing probably, but it would be good to know.
I'll try to boot a machine with the latest debug kernel over the weekend.
# Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.