732279 – mlx4_core 0000:86:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000fe181000] [size=4096 bytes]

Bug 732279 - mlx4_core 0000:86:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000fe181000] [size=4096 bytes]

Summary: mlx4_core 0000:86:00.0: DMA-API: device driver tries to sync DMA memory it ha...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-08-21 15:19 UTC by Albert Strasheim
Modified:	2012-11-14 15:23 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-11-14 15:23:48 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Albert Strasheim 2011-08-21 15:19:14 UTC

Description of problem:

[   30.626164] ------------[ cut here ]------------
[   30.631058] WARNING: at lib/dma-debug.c:911 check_sync+0xca/0x46c()
[   30.637589] Hardware name: X8DTH-i/6/iF/6F
[   30.641954] mlx4_core 0000:86:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000fe181000] [size=4096 bytes]
[   30.656829] Modules linked in: ses enclosure ghes serio_raw hed i2c_i801 joydev i2c_core mpt2sas iTCO_wdt i7core_edac iTCO_vendor_support scsi_transport_sas ioatdma raid_class edac_core mlx4_core(+) ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_cm ib_sa ib_uverbs ib_umad ib_mad ib_core igb dca ipmi_devintf ipmi_si ipmi_msghandler
[   30.689819] Pid: 846, comm: work_for_cpu Not tainted 3.0.1-5.fc16.x86_64 #1
[   30.697049] Call Trace:
[   30.699768]  [<ffffffff81058ebc>] warn_slowpath_common+0x83/0x9b
[   30.706044]  [<ffffffff81058f77>] warn_slowpath_fmt+0x46/0x48
[   30.712061]  [<ffffffff81253b41>] check_sync+0xca/0x46c
[   30.717557]  [<ffffffff8108aefc>] ? mark_held_locks+0x4b/0x6d
[   30.723576]  [<ffffffff8100e9fd>] ? paravirt_read_tsc+0x9/0xd
[   30.729589]  [<ffffffff8100eec7>] ? native_sched_clock+0x34/0x36
[   30.735862]  [<ffffffff812541d8>] debug_dma_sync_single_for_cpu+0x42/0x44
[   30.742920]  [<ffffffff814dbd6d>] ? __mutex_unlock_slowpath+0x112/0x122
[   30.749796]  [<ffffffff8108b029>] ? trace_hardirqs_on_caller+0x10b/0x12f
[   30.756765]  [<ffffffff814dbd75>] ? __mutex_unlock_slowpath+0x11a/0x122
[   30.763645]  [<ffffffff814dbd8b>] ? mutex_unlock+0xe/0x10
[   30.769319]  [<ffffffffa00da0b8>] dma_sync_single_for_cpu.constprop.11+0x5d/0x66 [mlx4_core]
[   30.778243]  [<ffffffffa00da181>] mlx4_write_mtt+0xc0/0x133 [mlx4_core]
[   30.785135]  [<ffffffffa00d360d>] mlx4_create_eq+0x305/0x45c [mlx4_core]
[   30.792106]  [<ffffffffa00d3a26>] ? mlx4_init_eq_table+0x146/0x4ac [mlx4_core]
[   30.799844]  [<ffffffff8112a0da>] ? __kmalloc+0xfa/0x10c
[   30.805439]  [<ffffffffa00d3a6f>] mlx4_init_eq_table+0x18f/0x4ac [mlx4_core]
[   30.812755]  [<ffffffffa00dd192>] mlx4_setup_hca+0x11a/0x410 [mlx4_core]
[   30.819729]  [<ffffffffa00d796c>] ? kzalloc.constprop.3+0x13/0x15 [mlx4_core]
[   30.827128]  [<ffffffffa00d8118>] __mlx4_init_one+0x7aa/0x7bb [mlx4_core]
[   30.834189]  [<ffffffff8106f9d8>] ? move_linked_works+0x6e/0x6e
[   30.840374]  [<ffffffffa00dd4c5>] mlx4_init_one+0x3d/0x42 [mlx4_core]
[   30.847083]  [<ffffffff8125dca6>] local_pci_probe+0x44/0x75
[   30.852921]  [<ffffffff8106f9ee>] do_work_for_cpu+0x16/0x28
[   30.858765]  [<ffffffff81075e5d>] kthread+0xa8/0xb0
[   30.863914]  [<ffffffff814e50a4>] kernel_thread_helper+0x4/0x10
[   30.870096]  [<ffffffff814dd754>] ? retint_restore_args+0x13/0x13
[   30.876459]  [<ffffffff81075db5>] ? __init_kthread_worker+0x5a/0x5a
[   30.882994]  [<ffffffff814e50a0>] ? gs_change+0x13/0x13
[   30.888487] ---[ end trace ede39044efbe156f ]---

Version-Release number of selected component (if applicable):

kernel-3.0.1-5.fc16.x86_64

How reproducible:

Always

Steps to Reproduce:
1. Boot kernel with Mellanox InfiniBand controller
  
Additional info:

Motherboard is configured with Optimal Defaults

The same hardware has been given intermittent issues with the following errors:

[   19.360850] mlx4_core 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

[   19.360855] mpt2sas 0000:04:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

Mostly when the machine is rebooted, these problems go away.

Comment 1 Albert Strasheim 2011-08-23 08:35:29 UTC

I think the "vpd r/w failed" errors are due to a BIOS bug which is exposed by udev loading modules together. I don't think it makes a difference to whether this warning is emitted.

Comment 2 Albert Strasheim 2011-09-01 18:26:51 UTC

It seems this has happened before:

http://copilotco.com/mail-archives/ofa.2009/msg04432.html

Comment 3 Albert Strasheim 2012-02-28 09:06:03 UTC

This still happens with the latest 3.2.3 debug kernel rpm.

WARNING: at lib/dma-debug.c:966 check_sync+0x2a8/0x530()
mlx4_core 0000:02:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000017839c1000] [size=4096 bytes]

Comment 4 Dave Jones 2012-03-22 17:11:33 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 5 Dave Jones 2012-03-22 17:14:05 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 6 Dave Jones 2012-03-22 17:23:20 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 7 Albert Strasheim 2012-03-23 16:45:20 UTC

looks fixed.

By the way, apparently pcie_aspm=off or pcie_aspm=performance fixes the VPD issue, according to Supermicro.

Comment 8 Albert Strasheim 2012-03-26 10:44:37 UTC

Saw this again on a machine with a mlx4 with older firmware with 3.3.0-4 debug.

Comment 9 Albert Strasheim 2012-03-26 10:48:08 UTC

False alarm maybe. I see now it was 3.2.7-debug on this machine, not 3.3.0-4. Will retest.

Comment 10 Albert Strasheim 2012-03-27 10:29:53 UTC

Retested with 3.3.0-4.fc16.x86_64.debug on this machine. Looks fixed.

Comment 11 Albert Strasheim 2012-03-27 10:32:23 UTC

Argh. I wasn't looking properly. Finally, 3.3.0-4.fc16.x86_64.debug still has this.

[   40.693654] ------------[ cut here ]------------
[   40.695188] WARNING: at lib/dma-debug.c:966 check_sync+0x2a8/0x530()
[   40.697642] Hardware name: SUN BLADE X6270 SERVER MODULE
[   40.706231] mlx4_core 0000:0d:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000102c981000] [size=4096 bytes]
[   40.730140] Modules linked in: ib_ipoib ib_cm ib_addr ib_sa ib_uverbs ib_umad ib_mad ib_core ipmi_poweroff ipmi_watchdog ipmi_devintf i2c_i801 mptsas(+) iTCO_wdt mptscsih igb i2c_core joydev microcode iTCO_vendor_support i7core_edac mlx4_core(+) mptbase ioatdma dca edac_core scsi_transport_sas ipmi_si ipmi_msghandler
[   40.787764] Pid: 257, comm: modprobe Not tainted 3.3.0-4.fc16.x86_64.debug #1
[   40.790549] Call Trace:
[   40.806210]  [<ffffffff81061f6f>] warn_slowpath_common+0x7f/0xc0
[   40.808336]  [<ffffffff81062066>] warn_slowpath_fmt+0x46/0x50
[   40.827621]  [<ffffffff8133ff88>] check_sync+0x2a8/0x530
[   40.829762]  [<ffffffff81340492>] debug_dma_sync_single_for_cpu+0x42/0x50
[   40.847063]  [<ffffffff8133c0fc>] ? is_swiotlb_buffer+0x3c/0x50
[   40.849194]  [<ffffffff8133c918>] ? swiotlb_sync_single+0x38/0x80
[   40.868855]  [<ffffffff8133ca5c>] ? swiotlb_sync_single_for_cpu+0xc/0x10
[   40.905296]  [<ffffffffa0095d3a>] __mlx4_write_mtt+0xea/0x1e0 [mlx4_core]
[   40.909217]  [<ffffffffa0095f5c>] mlx4_write_mtt+0x12c/0x170 [mlx4_core]
[   40.925092]  [<ffffffffa008a9bd>] mlx4_create_eq+0x4ad/0x6e0 [mlx4_core]
[   40.926982]  [<ffffffffa008b24f>] mlx4_init_eq_table+0x1ff/0x6b0 [mlx4_core]
[   40.947112]  [<ffffffffa00916d7>] mlx4_setup_hca+0x167/0x530 [mlx4_core]
[   40.964978]  [<ffffffff811a26ec>] ? kfree+0x28c/0x2a0
[   40.966813]  [<ffffffffa0092342>] __mlx4_init_one+0x8a2/0xca0 [mlx4_core]
[   40.985583]  [<ffffffffa009f5df>] mlx4_init_one+0x3d/0x42 [mlx4_core]
[   40.987441]  [<ffffffff813498dc>] local_pci_probe+0x5c/0xd0
[   41.005550]  [<ffffffff8134b1d9>] pci_device_probe+0x109/0x130
[   41.008078]  [<ffffffff8141216c>] driver_probe_device+0x9c/0x300
[   41.025620]  [<ffffffff8141247b>] __driver_attach+0xab/0xb0
[   41.027444]  [<ffffffff814123d0>] ? driver_probe_device+0x300/0x300
[   41.046544]  [<ffffffff814104fe>] bus_for_each_dev+0x5e/0x90
[   41.048703]  [<ffffffff81411d6e>] driver_attach+0x1e/0x20
[   41.065678]  [<ffffffff81411960>] bus_add_driver+0x1c0/0x2b0
[   41.067490]  [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core]
[   41.086974]  [<ffffffff814129f6>] driver_register+0x76/0x140
[   41.104564]  [<ffffffff813325a8>] ? __raw_spin_lock_init+0x38/0x70
[   41.106723]  [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core]
[   41.125441]  [<ffffffff8134ae96>] __pci_register_driver+0x66/0xe0
[   41.127858]  [<ffffffffa00b005b>] ? mlx4_catas_init+0x5b/0x5b [mlx4_core]
[   41.145308]  [<ffffffffa00b0107>] mlx4_init+0xac/0xfa5 [mlx4_core]
[   41.147163]  [<ffffffff8100203f>] do_one_initcall+0x3f/0x170
[   41.166884]  [<ffffffff810dbb82>] sys_init_module+0xc82/0x21f0
[   41.169044]  [<ffffffff816abb29>] system_call_fastpath+0x16/0x1b
[   41.186223] ---[ end trace 2a085dfdc60385a8 ]---

Comment 12 Josh Boyer 2012-09-06 15:29:06 UTC

Are you still seeing this on a 3.4 or 3.5 debug kernel?  I'm guessing probably, but it would be good to know.

Comment 13 Albert Strasheim 2012-09-06 18:40:51 UTC

I'll try to boot a machine with the latest debug kernel over the weekend.

Comment 14 Dave Jones 2012-10-23 15:34:11 UTC

# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 15 Justin M. Forbes 2012-11-14 15:23:48 UTC

With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.

Note You need to log in before you can comment on or make changes to this bug.