Bug 732279

Summary: mlx4_core 0000:86:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000fe181000] [size=4096 bytes]
Product: [Fedora] Fedora Reporter: Albert Strasheim <fullung>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: gansalmon, itamar, jforbes, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-14 15:23:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Albert Strasheim 2011-08-21 15:19:14 UTC
Description of problem:

[   30.626164] ------------[ cut here ]------------
[   30.631058] WARNING: at lib/dma-debug.c:911 check_sync+0xca/0x46c()
[   30.637589] Hardware name: X8DTH-i/6/iF/6F
[   30.641954] mlx4_core 0000:86:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000fe181000] [size=4096 bytes]
[   30.656829] Modules linked in: ses enclosure ghes serio_raw hed i2c_i801 joydev i2c_core mpt2sas iTCO_wdt i7core_edac iTCO_vendor_support scsi_transport_sas ioatdma raid_class edac_core mlx4_core(+) ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_cm ib_sa ib_uverbs ib_umad ib_mad ib_core igb dca ipmi_devintf ipmi_si ipmi_msghandler
[   30.689819] Pid: 846, comm: work_for_cpu Not tainted 3.0.1-5.fc16.x86_64 #1
[   30.697049] Call Trace:
[   30.699768]  [<ffffffff81058ebc>] warn_slowpath_common+0x83/0x9b
[   30.706044]  [<ffffffff81058f77>] warn_slowpath_fmt+0x46/0x48
[   30.712061]  [<ffffffff81253b41>] check_sync+0xca/0x46c
[   30.717557]  [<ffffffff8108aefc>] ? mark_held_locks+0x4b/0x6d
[   30.723576]  [<ffffffff8100e9fd>] ? paravirt_read_tsc+0x9/0xd
[   30.729589]  [<ffffffff8100eec7>] ? native_sched_clock+0x34/0x36
[   30.735862]  [<ffffffff812541d8>] debug_dma_sync_single_for_cpu+0x42/0x44
[   30.742920]  [<ffffffff814dbd6d>] ? __mutex_unlock_slowpath+0x112/0x122
[   30.749796]  [<ffffffff8108b029>] ? trace_hardirqs_on_caller+0x10b/0x12f
[   30.756765]  [<ffffffff814dbd75>] ? __mutex_unlock_slowpath+0x11a/0x122
[   30.763645]  [<ffffffff814dbd8b>] ? mutex_unlock+0xe/0x10
[   30.769319]  [<ffffffffa00da0b8>] dma_sync_single_for_cpu.constprop.11+0x5d/0x66 [mlx4_core]
[   30.778243]  [<ffffffffa00da181>] mlx4_write_mtt+0xc0/0x133 [mlx4_core]
[   30.785135]  [<ffffffffa00d360d>] mlx4_create_eq+0x305/0x45c [mlx4_core]
[   30.792106]  [<ffffffffa00d3a26>] ? mlx4_init_eq_table+0x146/0x4ac [mlx4_core]
[   30.799844]  [<ffffffff8112a0da>] ? __kmalloc+0xfa/0x10c
[   30.805439]  [<ffffffffa00d3a6f>] mlx4_init_eq_table+0x18f/0x4ac [mlx4_core]
[   30.812755]  [<ffffffffa00dd192>] mlx4_setup_hca+0x11a/0x410 [mlx4_core]
[   30.819729]  [<ffffffffa00d796c>] ? kzalloc.constprop.3+0x13/0x15 [mlx4_core]
[   30.827128]  [<ffffffffa00d8118>] __mlx4_init_one+0x7aa/0x7bb [mlx4_core]
[   30.834189]  [<ffffffff8106f9d8>] ? move_linked_works+0x6e/0x6e
[   30.840374]  [<ffffffffa00dd4c5>] mlx4_init_one+0x3d/0x42 [mlx4_core]
[   30.847083]  [<ffffffff8125dca6>] local_pci_probe+0x44/0x75
[   30.852921]  [<ffffffff8106f9ee>] do_work_for_cpu+0x16/0x28
[   30.858765]  [<ffffffff81075e5d>] kthread+0xa8/0xb0
[   30.863914]  [<ffffffff814e50a4>] kernel_thread_helper+0x4/0x10
[   30.870096]  [<ffffffff814dd754>] ? retint_restore_args+0x13/0x13
[   30.876459]  [<ffffffff81075db5>] ? __init_kthread_worker+0x5a/0x5a
[   30.882994]  [<ffffffff814e50a0>] ? gs_change+0x13/0x13
[   30.888487] ---[ end trace ede39044efbe156f ]---

Version-Release number of selected component (if applicable):

kernel-3.0.1-5.fc16.x86_64

How reproducible:

Always

Steps to Reproduce:
1. Boot kernel with Mellanox InfiniBand controller
  
Additional info:

Motherboard is configured with Optimal Defaults

The same hardware has been given intermittent issues with the following errors:

[   19.360850] mlx4_core 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

[   19.360855] mpt2sas 0000:04:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

Mostly when the machine is rebooted, these problems go away.

Comment 1 Albert Strasheim 2011-08-23 08:35:29 UTC
I think the "vpd r/w failed" errors are due to a BIOS bug which is exposed by udev loading modules together. I don't think it makes a difference to whether this warning is emitted.

Comment 2 Albert Strasheim 2011-09-01 18:26:51 UTC
It seems this has happened before:

http://copilotco.com/mail-archives/ofa.2009/msg04432.html

Comment 3 Albert Strasheim 2012-02-28 09:06:03 UTC
This still happens with the latest 3.2.3 debug kernel rpm.

WARNING: at lib/dma-debug.c:966 check_sync+0x2a8/0x530()
mlx4_core 0000:02:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000017839c1000] [size=4096 bytes]

Comment 4 Dave Jones 2012-03-22 17:11:33 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 5 Dave Jones 2012-03-22 17:14:05 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 6 Dave Jones 2012-03-22 17:23:20 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 7 Albert Strasheim 2012-03-23 16:45:20 UTC
looks fixed.

By the way, apparently pcie_aspm=off or pcie_aspm=performance fixes the VPD issue, according to Supermicro.

Comment 8 Albert Strasheim 2012-03-26 10:44:37 UTC
Saw this again on a machine with a mlx4 with older firmware with 3.3.0-4 debug.

Comment 9 Albert Strasheim 2012-03-26 10:48:08 UTC
False alarm maybe. I see now it was 3.2.7-debug on this machine, not 3.3.0-4. Will retest.

Comment 10 Albert Strasheim 2012-03-27 10:29:53 UTC
Retested with 3.3.0-4.fc16.x86_64.debug on this machine. Looks fixed.

Comment 11 Albert Strasheim 2012-03-27 10:32:23 UTC
Argh. I wasn't looking properly. Finally, 3.3.0-4.fc16.x86_64.debug still has this.

[   40.693654] ------------[ cut here ]------------
[   40.695188] WARNING: at lib/dma-debug.c:966 check_sync+0x2a8/0x530()
[   40.697642] Hardware name: SUN BLADE X6270 SERVER MODULE
[   40.706231] mlx4_core 0000:0d:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000102c981000] [size=4096 bytes]
[   40.730140] Modules linked in: ib_ipoib ib_cm ib_addr ib_sa ib_uverbs ib_umad ib_mad ib_core ipmi_poweroff ipmi_watchdog ipmi_devintf i2c_i801 mptsas(+) iTCO_wdt mptscsih igb i2c_core joydev microcode iTCO_vendor_support i7core_edac mlx4_core(+) mptbase ioatdma dca edac_core scsi_transport_sas ipmi_si ipmi_msghandler
[   40.787764] Pid: 257, comm: modprobe Not tainted 3.3.0-4.fc16.x86_64.debug #1
[   40.790549] Call Trace:
[   40.806210]  [<ffffffff81061f6f>] warn_slowpath_common+0x7f/0xc0
[   40.808336]  [<ffffffff81062066>] warn_slowpath_fmt+0x46/0x50
[   40.827621]  [<ffffffff8133ff88>] check_sync+0x2a8/0x530
[   40.829762]  [<ffffffff81340492>] debug_dma_sync_single_for_cpu+0x42/0x50
[   40.847063]  [<ffffffff8133c0fc>] ? is_swiotlb_buffer+0x3c/0x50
[   40.849194]  [<ffffffff8133c918>] ? swiotlb_sync_single+0x38/0x80
[   40.868855]  [<ffffffff8133ca5c>] ? swiotlb_sync_single_for_cpu+0xc/0x10
[   40.905296]  [<ffffffffa0095d3a>] __mlx4_write_mtt+0xea/0x1e0 [mlx4_core]
[   40.909217]  [<ffffffffa0095f5c>] mlx4_write_mtt+0x12c/0x170 [mlx4_core]
[   40.925092]  [<ffffffffa008a9bd>] mlx4_create_eq+0x4ad/0x6e0 [mlx4_core]
[   40.926982]  [<ffffffffa008b24f>] mlx4_init_eq_table+0x1ff/0x6b0 [mlx4_core]
[   40.947112]  [<ffffffffa00916d7>] mlx4_setup_hca+0x167/0x530 [mlx4_core]
[   40.964978]  [<ffffffff811a26ec>] ? kfree+0x28c/0x2a0
[   40.966813]  [<ffffffffa0092342>] __mlx4_init_one+0x8a2/0xca0 [mlx4_core]
[   40.985583]  [<ffffffffa009f5df>] mlx4_init_one+0x3d/0x42 [mlx4_core]
[   40.987441]  [<ffffffff813498dc>] local_pci_probe+0x5c/0xd0
[   41.005550]  [<ffffffff8134b1d9>] pci_device_probe+0x109/0x130
[   41.008078]  [<ffffffff8141216c>] driver_probe_device+0x9c/0x300
[   41.025620]  [<ffffffff8141247b>] __driver_attach+0xab/0xb0
[   41.027444]  [<ffffffff814123d0>] ? driver_probe_device+0x300/0x300
[   41.046544]  [<ffffffff814104fe>] bus_for_each_dev+0x5e/0x90
[   41.048703]  [<ffffffff81411d6e>] driver_attach+0x1e/0x20
[   41.065678]  [<ffffffff81411960>] bus_add_driver+0x1c0/0x2b0
[   41.067490]  [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core]
[   41.086974]  [<ffffffff814129f6>] driver_register+0x76/0x140
[   41.104564]  [<ffffffff813325a8>] ? __raw_spin_lock_init+0x38/0x70
[   41.106723]  [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core]
[   41.125441]  [<ffffffff8134ae96>] __pci_register_driver+0x66/0xe0
[   41.127858]  [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core]
[   41.145308]  [<ffffffffa00b0107>] mlx4_init+0xac/0xfa5 [mlx4_core]
[   41.147163]  [<ffffffff8100203f>] do_one_initcall+0x3f/0x170
[   41.166884]  [<ffffffff810dbb82>] sys_init_module+0xc82/0x21f0
[   41.169044]  [<ffffffff816abb29>] system_call_fastpath+0x16/0x1b
[   41.186223] ---[ end trace 2a085dfdc60385a8 ]---

Comment 12 Josh Boyer 2012-09-06 15:29:06 UTC
Are you still seeing this on a 3.4 or 3.5 debug kernel?  I'm guessing probably, but it would be good to know.

Comment 13 Albert Strasheim 2012-09-06 18:40:51 UTC
I'll try to boot a machine with the latest debug kernel over the weekend.

Comment 14 Dave Jones 2012-10-23 15:34:11 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 15 Justin M. Forbes 2012-11-14 15:23:48 UTC
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.