Bug 1662291

Summary:	Boot guest with device assignment+vIOMMU, qemu prompts "VFIO_UNMAP_DMA: -22" error when rebooting guest which has "intel_iommu=on" in kernel line
Product:	Red Hat Enterprise Linux 8	Reporter:	Pei Zhang <pezhang>
Component:	kernel	Assignee:	Alex Williamson <alex.williamson>
kernel sub component:	KVM	QA Contact:	Pei Zhang <pezhang>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	high	CC:	alex.williamson, chayang, jinzhao, juzhang, jwboyer, knoel, peterx, rbalakri, siliu, virt-bugs, virt-maint, yfu
Version:	8.0	Flags:	rule-engine: mirror+
Target Milestone:	rc
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-4.18.0-61.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-14 01:40:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pei Zhang 2018-12-27 10:45:48 UTC

Description of problem:
Boot guest with device assignment+vIOMMU, add "intel_iommu=on" to guest kernel line. then qemu terminal always prompted below error info after each rebooting guest:

qemu-kvm: VFIO_UNMAP_DMA: -22
qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument)

After several reboot, it also prompted below error info:
qemu-kvm: vtd_iova_to_slpte: detected slpte permission error (iova=0xffd9ce00, level=0x2, slpte=0x0, write=1)
qemu-kvm: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)



Version-Release number of selected component (if applicable):
4.18.0-57.el8.x86_64
qemu-kvm-3.1.0-2.module+el8+2606+2c716ad7.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Boot qemu with device assignment+vIOMMU, full command line see[1]

2. Add intel_iommu=on to guest kernel line
# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-57.el8.x86_64 ... intel_iommu=on

3. Reboot guest by system_reset

(qemu) system_reset 
(qemu) qemu-kvm: VFIO_UNMAP_DMA: -22
qemu-kvm: vfio_dma_unmap(0x55b3957db760, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument)
qemu-kvm: VFIO_UNMAP_DMA: -22
qemu-kvm: vfio_dma_unmap(0x55b395505d60, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument)

And after several rebooting, it prompted below more error info.
(qemu) system_reset 
(qemu) qemu-kvm: VFIO_UNMAP_DMA: -22
qemu-kvm: vfio_dma_unmap(0x55b3957db760, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument)
qemu-kvm: VFIO_UNMAP_DMA: -22
qemu-kvm: vfio_dma_unmap(0x55b395505d60, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument)
qemu-kvm: vtd_iova_to_slpte: detected slpte permission error (iova=0xffdbc000, level=0x2, slpte=0x0, write=1)
qemu-kvm: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)



Actual results:
qemu prompted error info when rebooting guest which kernel line has 'intel_iommu=on'


Expected results:
qemu should not prompt error info.


Additional info:


Reference:
[1]
/usr/libexec/qemu-kvm -name rhel8.0 \
-M q35,kernel-irqchip=split \
-cpu Skylake-Server -m 8G \
-device intel-iommu,intremap=true,caching-mode=true \
-smp 4,sockets=1,cores=4,threads=1 \
-device pcie-root-port,id=root.1,chassis=1 \
-device pcie-root-port,id=root.2,chassis=2 \
-device pcie-root-port,id=root.3,chassis=3 \
-device pcie-root-port,id=root.4,chassis=4 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/home/rhel8.0.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my,file=my_file \
-device virtio-blk-pci,drive=my,id=virtio-blk0,bus=root.1 \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=18:66:da:5f:dd:02,bus=root.2,iommu_platform=on,ats=on \
-vnc :2 \
-monitor stdio \
-device vfio-pci,host=0000:3b:00.0,bus=root.3 \
-device vfio-pci,host=0000:3b:00.1,bus=root.4 \

Comment 4 Alex Williamson 2019-01-07 02:28:10 UTC

Note that 0xfef00000 is immediately after the ioapic region and the size parameter (0xffffffff01100000) looks really suspicious, probably getting an -EINVAL because this wraps around the address space.

Comment 7 Peter Xu 2019-01-07 09:19:12 UTC

I'm firstly looking into the VFIO_UNMAP_DMA issue.

I reproduced it locally even with upstream QEMU.  The error happened at:

#0  vfio_dma_unmap (container=0x5563dda36280, iova=4277141504, size=18446744069432410112) at /home/peterx/git/qemu/hw/vfio/common.c:224
#1  0x00005563db0244f5 in vfio_listener_region_del (listener=0x5563dda36290, section=0x7ffdd3da2f50) at /home/peterx/git/qemu/hw/vfio/common.c:678
#2  0x00005563dafa3d26 in address_space_update_topology_pass (as=0x5563dd901900, old_view=0x5563dc4c3560, new_view=0x5563dc245850, adding=false) at /home/peterx/git/qemu/memory.c:885
#3  0x00005563dafa434a in address_space_set_flatview (as=0x5563dd901900) at /home/peterx/git/qemu/memory.c:986
#4  0x00005563dafa450a in memory_region_transaction_commit () at /home/peterx/git/qemu/memory.c:1039
#5  0x00005563dafa7f4a in memory_region_set_enabled (mr=0x5563dd901960, enabled=false) at /home/peterx/git/qemu/memory.c:2383
#6  0x00005563db06a477 in vtd_switch_address_space (as=0x5563dd9018f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:1177
#7  0x00005563db06a513 in vtd_switch_address_space_all (s=0x5563dd6632f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:1200
#8  0x00005563db06e84d in vtd_address_space_refresh_all (s=0x5563dd6632f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:3077
#9  0x00005563db06f2dc in vtd_reset (dev=0x5563dd6632f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:3256
#10 0x00005563db1ac6c7 in device_reset (dev=0x5563dd6632f0) at /home/peterx/git/qemu/hw/core/qdev.c:1081

It's during a system reset, and here we're trying to unmap all the potentially leftover IO page mappings in range 0xfef00000-2^64-1 (which is UINT64_MAX).  So we can see that here iova+size==2^64.

The unmap failure should be caused by overflow of parameters for VFIO_IOMMU_UNMAP_DMA.

A quick idea to fix this problem is that we shrink VT-d IOMMU memory region size from UINT64_MAX to something smaller like 2^63, which should also be big enough and also we can avoid this overflow issue.  However I noticed that this problem should not exist in the past so I digged a bit more.

I noticed that above check was introduce recently where we want to silence an overflow warning:

    commit 71a7d3d78e3ca51ea688ae88c389867d948377cd
    Author: Dan Carpenter <dan.carpenter>
    Date:   Fri Oct 20 11:41:56 2017 -0600

    vfio/type1: silence integer overflow warning

Now disregarding the warning itself I'm not very sure about this change since IIUC potentially this change will never allow the userspace to free the last page if it is really mapped somehow via VFIO_IOMMU_MAP_DMA (which is IOVA=2^64-4096).  Because AFAIU we can only unmap that page with parameters {iova=2^64-4096, size=4096} but this will be rejected by this extra overflow check.

I also tested to revert 71a7d3d78e3ca51ea688ae88c389867d948377cd in the kernel tree, reprobe modules, and then VFIO_UNMAP_DMA error will go away too.

Alex, any insight?  For now I would think it be good to fix this solo problem from kernel side which seems cleaner, but I'd like to know how you think about it.

Thanks,

Comment 8 Alex Williamson 2019-01-07 21:04:15 UTC

(In reply to Peter Xu from comment #7)
> Alex, any insight?  For now I would think it be good to fix this solo
> problem from kernel side which seems cleaner, but I'd like to know how you
> think about it.

Thanks for the analysis, Peter!  I'll kick this back to me.  It seems the bug in 71a7d3d78e3c is that "unmap->iova + unmap->size" should be "unmap->iova + unmap->size - 1", ie. iova + size wraps to 0x0 currently, right?  Gack, that's nasty that we cannot unmap the last page, but we'll need to do something in upstream QEMU too, at least until we can consider this bug deprecated in the host kernel.  I think that would mean that if we get an -EINVAL error and test that the end boundary is the end of the address space, we have to assume that the last page was never mapped anyway and subtract a page from the size.  On Intel systems we have a limited IOVA space, so this will always be the safe, and since we're emulating Intel VT-d in the guest, this should also always be safe.

Comment 9 Peter Xu 2019-01-08 03:29:33 UTC

(In reply to Alex Williamson from comment #8)
> It seems the
> bug in 71a7d3d78e3c is that "unmap->iova + unmap->size" should be
> "unmap->iova + unmap->size - 1", ie. iova + size wraps to 0x0 currently,
> right?

AFAICT, yes.

> Gack, that's nasty that we cannot unmap the last page, but we'll
> need to do something in upstream QEMU too, at least until we can consider
> this bug deprecated in the host kernel.  I think that would mean that if we
> get an -EINVAL error and test that the end boundary is the end of the
> address space, we have to assume that the last page was never mapped anyway
> and subtract a page from the size.  On Intel systems we have a limited IOVA
> space, so this will always be the safe, and since we're emulating Intel VT-d
> in the guest, this should also always be safe.

Indeed we should need a workaround in QEMU too for old kernels (I just noticed that it was introduced in 2017, so it actually covers 4.15+).  I think I can take care of the QEMU counterpart altogether with debugging the rest of error messages in bug 1662270.  Just let me know your preference.

Thanks!

Comment 15 Herton R. Krzesinski 2019-01-16 16:10:13 UTC

Patch(es) available on kernel-4.18.0-61.el8

Comment 18 Pei Zhang 2019-01-17 07:32:00 UTC

==Verification==

Versions:
4.18.0-61.el8.x86_64
qemu-kvm-3.1.0-4.module+el8+2681+819ab34d.x86_64

Steps:
1. Boot qemu with device assignment+vIOMMU.

2. Add intel_iommu=on to guest kernel line

3. Reboot guest by system_reset

(qemu) system_reset

4. Reboot guest in guest

# reboot

5. Shutdown guest by system_powerdown

(qemu) system_powerdown

6. Shutdown guest in guest

# shutdown -h now


After step3 and step6, qemu only prompted below info and no other error info: 
(qemu) qemu-kvm: vtd_interrupt_remap_msi: MSI address low 32 bit invalid: 0x0
(Bug 1662270 is tracking this issue)

After step4, step5 no any error in qemu, guest and host.. 


So this bug has been fixed very well. Move to 'VERIFIED'.