Bug 1530957
Summary: | In guest with device assignment, dpdk's testpmd fail boot up and show "DMA remapping" errors | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> | ||||
Component: | dpdk | Assignee: | Kevin Traynor <ktraynor> | ||||
Status: | CLOSED ERRATA | QA Contact: | Pei Zhang <pezhang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.5 | CC: | ailan, atragler, chayang, jhsiao, jinzhao, juzhang, ktraynor, lmiksik, maxime.coquelin, michen, tredaelli, virt-maint, yfu | ||||
Target Milestone: | rc | Keywords: | Extras, Regression | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | dpdk-17.11-7.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-10 23:59:23 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Pei Zhang
2018-01-04 10:11:13 UTC
It's because dpdk in guest wants to do this mapping for the 10g card: 7f6b80000000 (iova) -> 7f6b80000000 (vaddr) However the IOVA is over range of 39 bits (which is the maximum supported GAW of current VT-d emulation). So it's normal that we encounter this issue. Or say, we should encounter the same issue even on real hardware if the hardware VT-d GAW is 39 bits. From this pov, it's not a bug but by design. On one hand, of course we can boost GAW of emulated vt-d to 48 bits to reduce this kind of error (AFAIK current linux kernel virtual address will use 48 bits only, so that would be enough for now). Actually we have a bz1513841 for it and upstream work is still during review. However as Pei mentioned, it is a new change from DPDK side that triggered this error. I digged a bit and see this: commit 815c7deaed2d9e325968e82cb599984088a5c55a Author: Santosh Shukla <santosh.shukla> Date: Fri Oct 6 16:33:40 2017 +0530 pci: get IOMMU class on Linux Get iommu class of PCI device on the bus and returns preferred iova mapping mode for that bus. Patch also introduces RTE_PCI_DRV_IOVA_AS_VA drv flag. Flag used when driver needs to operate in iova=va mode. Algorithm for iova scheme selection for PCI bus: 0. If no device bound then return with RTE_IOVA_DC mapping mode, else goto 1). 1. Look for device attached to vfio kdrv and has .drv_flag set to RTE_PCI_DRV_IOVA_AS_VA. 2. Look for any device attached to UIO class of driver. 3. Check for vfio-noiommu mode enabled. If 2) & 3) is false and 1) is true then select mapping scheme as RTE_IOVA_VA. Otherwise use default mapping scheme (RTE_IOVA_PA). Signed-off-by: Santosh Shukla <santosh.shukla> Signed-off-by: Jerin Jacob <jerin.jacob> Reviewed-by: Maxime Coquelin <maxime.coquelin> Reviewed-by: Anatoly Burakov <anatoly.burakov> Acked-by: Hemant Agrawal <hemant.agrawal> Tested-by: Hemant Agrawal <hemant.agrawal> So I have two question for dpdk: (1) have we switched from PA mode to VA mode by default recently? Why? Another question of my own curiousity would be: Why DPDK hasn't has its own IOVA allocation algorigthm? (I assume it won't be too slow since it's using hugepages?) (2) whether dpdk should provide a way to specify this IOMMU mode? For this bug if user specify it to PA mode then DPDK will work. However it seems to me that we can't allow user to do this now, and user is forced to use VA mode. Do we need a tunable for this? Maxime, What do you think? Thanks, Peter Hi Peter, (In reply to Peter Xu from comment #3) > It's because dpdk in guest wants to do this mapping for the 10g card: > > 7f6b80000000 (iova) -> 7f6b80000000 (vaddr) > > However the IOVA is over range of 39 bits (which is the maximum supported > GAW of current VT-d emulation). So it's normal that we encounter this > issue. Or say, we should encounter the same issue even on real hardware if > the hardware VT-d GAW is 39 bits. From this pov, it's not a bug but by > design. > > On one hand, of course we can boost GAW of emulated vt-d to 48 bits to > reduce this kind of error (AFAIK current linux kernel virtual address will > use 48 bits only, so that would be enough for now). Actually we have a > bz1513841 for it and upstream work is still during review. > > However as Pei mentioned, it is a new change from DPDK side that triggered > this error. I digged a bit and see this: > > commit 815c7deaed2d9e325968e82cb599984088a5c55a > Author: Santosh Shukla <santosh.shukla> > Date: Fri Oct 6 16:33:40 2017 +0530 > > pci: get IOMMU class on Linux > > Get iommu class of PCI device on the bus and returns preferred iova > mapping mode for that bus. > > Patch also introduces RTE_PCI_DRV_IOVA_AS_VA drv flag. > Flag used when driver needs to operate in iova=va mode. > > Algorithm for iova scheme selection for PCI bus: > 0. If no device bound then return with RTE_IOVA_DC mapping mode, > else goto 1). > 1. Look for device attached to vfio kdrv and has .drv_flag set > to RTE_PCI_DRV_IOVA_AS_VA. > 2. Look for any device attached to UIO class of driver. > 3. Check for vfio-noiommu mode enabled. > > If 2) & 3) is false and 1) is true then select > mapping scheme as RTE_IOVA_VA. Otherwise use default > mapping scheme (RTE_IOVA_PA). > > Signed-off-by: Santosh Shukla <santosh.shukla> > Signed-off-by: Jerin Jacob <jerin.jacob> > Reviewed-by: Maxime Coquelin <maxime.coquelin> > Reviewed-by: Anatoly Burakov <anatoly.burakov> > Acked-by: Hemant Agrawal <hemant.agrawal> > Tested-by: Hemant Agrawal <hemant.agrawal> > > So I have two question for dpdk: > > (1) have we switched from PA mode to VA mode by default recently? Why? Cavium has a memory allocation IP bocks which works with virtual addresses, using PAs in their case caused a performance hit. Their initial series changed to use VA mode by default for all devices. I suggested that it might not be a good idea to change the default for other devices, because it could cause some problems. For example, in the case two devices sharing the same iommu group are used by different processes, both processes could use same VA for different pages. Santosh implemented my suggestion, but it seems than Jianfeng from Intel did a patch to advertise Intel NICs support VA mode: commit f37dfab21c988d2d0ecb3c82be4ba9738c7e51c7 Author: Jianfeng Tan <jianfeng.tan> Date: Wed Oct 11 10:33:48 2017 +0000 drivers/net: enable IOVA mode for Intel PMDs If we want to enable IOVA mode, introduced by commit 93878cf0255e ("eal: introduce helper API for IOVA mode"), we need PMDs (for PCI devices) to expose this flag. Signed-off-by: Jianfeng Tan <jianfeng.tan> Acked-by: Anatoly Burakov <anatoly.burakov> Reviewed-by: Santosh Shukla <santosh.shukla> > Another question of my own curiousity would be: Why DPDK hasn't has its own > IOVA allocation algorigthm? (I assume it won't be too slow since it's using > hugepages?) I think using PA mode by default should be enough, except for Cavium PMD. But maybe you see other advantage to have an IOVA allocator algorithm in DPDK? > (2) whether dpdk should provide a way to specify this IOMMU mode? > > For this bug if user specify it to PA mode then DPDK will work. However it > seems to me that we can't allow user to do this now, and user is forced to > use VA mode. Do we need a tunable for this? > Maxime, What do you think? I think a tunable is a good idea. By default keeping PA mode for all but Cavium PMD, and adding a cmdline option to force VA mode. It seems moving to VA by default causes another issue: http://dpdk.org/dev/patchwork/patch/31071/ I need to dig a bit more to understand how/if they fixed the KNI issue. Cheers, Maxime > Thanks, > Peter (In reply to Maxime Coquelin from comment #4) > Hi Peter, > > (In reply to Peter Xu from comment #3) > > It's because dpdk in guest wants to do this mapping for the 10g card: > > > > 7f6b80000000 (iova) -> 7f6b80000000 (vaddr) > > > > However the IOVA is over range of 39 bits (which is the maximum supported > > GAW of current VT-d emulation). So it's normal that we encounter this > > issue. Or say, we should encounter the same issue even on real hardware if > > the hardware VT-d GAW is 39 bits. From this pov, it's not a bug but by > > design. > > > > On one hand, of course we can boost GAW of emulated vt-d to 48 bits to > > reduce this kind of error (AFAIK current linux kernel virtual address will > > use 48 bits only, so that would be enough for now). Actually we have a > > bz1513841 for it and upstream work is still during review. > > > > However as Pei mentioned, it is a new change from DPDK side that triggered > > this error. I digged a bit and see this: > > > > commit 815c7deaed2d9e325968e82cb599984088a5c55a > > Author: Santosh Shukla <santosh.shukla> > > Date: Fri Oct 6 16:33:40 2017 +0530 > > > > pci: get IOMMU class on Linux > > > > Get iommu class of PCI device on the bus and returns preferred iova > > mapping mode for that bus. > > > > Patch also introduces RTE_PCI_DRV_IOVA_AS_VA drv flag. > > Flag used when driver needs to operate in iova=va mode. > > > > Algorithm for iova scheme selection for PCI bus: > > 0. If no device bound then return with RTE_IOVA_DC mapping mode, > > else goto 1). > > 1. Look for device attached to vfio kdrv and has .drv_flag set > > to RTE_PCI_DRV_IOVA_AS_VA. > > 2. Look for any device attached to UIO class of driver. > > 3. Check for vfio-noiommu mode enabled. > > > > If 2) & 3) is false and 1) is true then select > > mapping scheme as RTE_IOVA_VA. Otherwise use default > > mapping scheme (RTE_IOVA_PA). > > > > Signed-off-by: Santosh Shukla <santosh.shukla> > > Signed-off-by: Jerin Jacob <jerin.jacob> > > Reviewed-by: Maxime Coquelin <maxime.coquelin> > > Reviewed-by: Anatoly Burakov <anatoly.burakov> > > Acked-by: Hemant Agrawal <hemant.agrawal> > > Tested-by: Hemant Agrawal <hemant.agrawal> > > > > So I have two question for dpdk: > > > > (1) have we switched from PA mode to VA mode by default recently? Why? > > Cavium has a memory allocation IP bocks which works with virtual addresses, > using PAs in their case caused a performance hit. > > Their initial series changed to use VA mode by default for all devices. > I suggested that it might not be a good idea to change the default for other > devices, because it could cause some problems. > For example, in the case two devices sharing the same iommu group are used > by > different processes, both processes could use same VA for different pages. > > Santosh implemented my suggestion, but it seems than Jianfeng from Intel did > a > patch to advertise Intel NICs support VA mode: > > commit f37dfab21c988d2d0ecb3c82be4ba9738c7e51c7 > Author: Jianfeng Tan <jianfeng.tan> > Date: Wed Oct 11 10:33:48 2017 +0000 > > drivers/net: enable IOVA mode for Intel PMDs > > If we want to enable IOVA mode, introduced by > commit 93878cf0255e ("eal: introduce helper API for IOVA mode"), > we need PMDs (for PCI devices) to expose this flag. > > Signed-off-by: Jianfeng Tan <jianfeng.tan> > Acked-by: Anatoly Burakov <anatoly.burakov> > Reviewed-by: Santosh Shukla <santosh.shukla> > > > > Another question of my own curiousity would be: Why DPDK hasn't has its own > > IOVA allocation algorigthm? (I assume it won't be too slow since it's using > > hugepages?) > > I think using PA mode by default should be enough, except for Cavium PMD. > But maybe you see other advantage to have an IOVA allocator algorithm in > DPDK? No, it's just a question of mine, since as long as we are with VFIO and IOMMU, dodk should be able to work even without knowing PAs. At least, if dpdk allocates IOVA itself from zero and do it continuously, this bug won't ever happen until someone used more than 1<<39 memory for a single DPDK program. I don't know whether this can be "an advantage" though. :) > > > (2) whether dpdk should provide a way to specify this IOMMU mode? > > > > For this bug if user specify it to PA mode then DPDK will work. However it > > seems to me that we can't allow user to do this now, and user is forced to > > use VA mode. Do we need a tunable for this? > > > Maxime, What do you think? > > I think a tunable is a good idea. > By default keeping PA mode for all but Cavium PMD, and adding a cmdline > option to force VA mode. > > It seems moving to VA by default causes another issue: > http://dpdk.org/dev/patchwork/patch/31071/ > > I need to dig a bit more to understand how/if they fixed the KNI issue. Sure. Then, do you want me to move this bug component to dpdk for better tracking? After all vt-d has a bz for GAW extension already. Thanks, Peter (In reply to Peter Xu from comment #5) > (In reply to Maxime Coquelin from comment #4) > > > Another question of my own curiousity would be: Why DPDK hasn't has its own > > > IOVA allocation algorigthm? (I assume it won't be too slow since it's using > > > hugepages?) > > > > I think using PA mode by default should be enough, except for Cavium PMD. > > But maybe you see other advantage to have an IOVA allocator algorithm in > > DPDK? > > No, it's just a question of mine, since as long as we are with VFIO and > IOMMU, dodk should be able to work even without knowing PAs. > > At least, if dpdk allocates IOVA itself from zero and do it continuously, > this bug won't ever happen until someone used more than 1<<39 memory for a > single DPDK program. I don't know whether this can be "an advantage" though. > :) Thinking at it again, that would be a good idea, as it would address the problem Jianfeng solved by using VA mode. The goal was to be able to support 4K pages, for which it doesn't know the PA. > > > > > (2) whether dpdk should provide a way to specify this IOMMU mode? > > > > > > For this bug if user specify it to PA mode then DPDK will work. However it > > > seems to me that we can't allow user to do this now, and user is forced to > > > use VA mode. Do we need a tunable for this? > > > > > Maxime, What do you think? > > > > I think a tunable is a good idea. > > By default keeping PA mode for all but Cavium PMD, and adding a cmdline > > option to force VA mode. > > > > It seems moving to VA by default causes another issue: > > http://dpdk.org/dev/patchwork/patch/31071/ > > > > I need to dig a bit more to understand how/if they fixed the KNI issue. > > Sure. Then, do you want me to move this bug component to dpdk for better > tracking? After all vt-d has a bz for GAW extension already. Yes, please. I agree it should be fixed in DPDK. Thanks, Maxime > Thanks, > Peter Thank you Peter, Maxime. As your discussion in Comment 3 ~ Comment 6, move this bug to 'dpdk' component. Upstream patch posted: http://dpdk.org/ml/archives/stable/2018-January/004109.html Created attachment 1379091 [details]
Patch - Frobid VA mode if IOMMU supports only 39bits GAW
Hi Pei,
Please find in attachment a v17.11 backport of the patch posted upstream,
in case you'd like to test it in advance.
Regards,
Maxime
(In reply to Maxime Coquelin from comment #11) > Created attachment 1379091 [details] > Patch - Frobid VA mode if IOMMU supports only 39bits GAW > > Hi Pei, > > Please find in attachment a v17.11 backport of the patch posted upstream, > in case you'd like to test it in advance. Hi Maxime, This patch works well. Apply this path to dpdk-17.11.tar.xz. (1) Guest dpdk's testpmd can start up with assigned ixgbe NICs (PF) (2) Guest dpdk's testpmd can receive packets. (3) Reboot/shutdown guest, everything works well, no any error. So your patch can fix this issue. Thanks. Best Regards, Pei > Regards, > Maxime (In reply to Pei Zhang from comment #12) > (In reply to Maxime Coquelin from comment #11) > > Created attachment 1379091 [details] > > Patch - Frobid VA mode if IOMMU supports only 39bits GAW > > > > Hi Pei, > > > > Please find in attachment a v17.11 backport of the patch posted upstream, > > in case you'd like to test it in advance. > > Hi Maxime, > > This patch works well. > > Apply this path to dpdk-17.11.tar.xz. > (1) Guest dpdk's testpmd can start up with assigned ixgbe NICs (PF) > (2) Guest dpdk's testpmd can receive packets. > (3) Reboot/shutdown guest, everything works well, no any error. With ixgbe VFs, also works very well. Best Regards, Pei > So your patch can fix this issue. Thanks. > > > Best Regards, > Pei > > > Regards, > > Maxime Update: Versions: 3.10.0-841.el7.x86_64 kernel-3.10.0-837.el7.x86_64 qemu-kvm-rhev-2.10.0-18.el7.x86_64 dpdk-17.11-7.el7.x86_64 Steps: Same as Description. Every step works well. (Note: We are not using q35 multifunction.) So this bug has been fixed. QE hit a q35 multifunction new issue. It should not be this bug, so we file a new bug to track this new issue: Bug 1540964 - Booting guest with q35 multifuction, vIOMMU and device assignment, then dpdk's testpmd will show "VFIO group is not viable!" Best Regards, Pei Based on Comment 15, move this bug to 'VERIFIED'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1065 |