Bug 1662291
Summary: | Boot guest with device assignment+vIOMMU, qemu prompts "VFIO_UNMAP_DMA: -22" error when rebooting guest which has "intel_iommu=on" in kernel line | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Pei Zhang <pezhang> |
Component: | kernel | Assignee: | Alex Williamson <alex.williamson> |
kernel sub component: | KVM | QA Contact: | Pei Zhang <pezhang> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | alex.williamson, chayang, jinzhao, juzhang, jwboyer, knoel, peterx, rbalakri, siliu, virt-bugs, virt-maint, yfu |
Version: | 8.0 | ||
Target Milestone: | rc | ||
Target Release: | 8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | kernel-4.18.0-61.el8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-14 01:40:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pei Zhang
2018-12-27 10:45:48 UTC
Note that 0xfef00000 is immediately after the ioapic region and the size parameter (0xffffffff01100000) looks really suspicious, probably getting an -EINVAL because this wraps around the address space. I'm firstly looking into the VFIO_UNMAP_DMA issue. I reproduced it locally even with upstream QEMU. The error happened at: #0 vfio_dma_unmap (container=0x5563dda36280, iova=4277141504, size=18446744069432410112) at /home/peterx/git/qemu/hw/vfio/common.c:224 #1 0x00005563db0244f5 in vfio_listener_region_del (listener=0x5563dda36290, section=0x7ffdd3da2f50) at /home/peterx/git/qemu/hw/vfio/common.c:678 #2 0x00005563dafa3d26 in address_space_update_topology_pass (as=0x5563dd901900, old_view=0x5563dc4c3560, new_view=0x5563dc245850, adding=false) at /home/peterx/git/qemu/memory.c:885 #3 0x00005563dafa434a in address_space_set_flatview (as=0x5563dd901900) at /home/peterx/git/qemu/memory.c:986 #4 0x00005563dafa450a in memory_region_transaction_commit () at /home/peterx/git/qemu/memory.c:1039 #5 0x00005563dafa7f4a in memory_region_set_enabled (mr=0x5563dd901960, enabled=false) at /home/peterx/git/qemu/memory.c:2383 #6 0x00005563db06a477 in vtd_switch_address_space (as=0x5563dd9018f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:1177 #7 0x00005563db06a513 in vtd_switch_address_space_all (s=0x5563dd6632f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:1200 #8 0x00005563db06e84d in vtd_address_space_refresh_all (s=0x5563dd6632f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:3077 #9 0x00005563db06f2dc in vtd_reset (dev=0x5563dd6632f0) at /home/peterx/git/qemu/hw/i386/intel_iommu.c:3256 #10 0x00005563db1ac6c7 in device_reset (dev=0x5563dd6632f0) at /home/peterx/git/qemu/hw/core/qdev.c:1081 It's during a system reset, and here we're trying to unmap all the potentially leftover IO page mappings in range 0xfef00000-2^64-1 (which is UINT64_MAX). So we can see that here iova+size==2^64. The unmap failure should be caused by overflow of parameters for VFIO_IOMMU_UNMAP_DMA. A quick idea to fix this problem is that we shrink VT-d IOMMU memory region size from UINT64_MAX to something smaller like 2^63, which should also be big enough and also we can avoid this overflow issue. However I noticed that this problem should not exist in the past so I digged a bit more. I noticed that above check was introduce recently where we want to silence an overflow warning: commit 71a7d3d78e3ca51ea688ae88c389867d948377cd Author: Dan Carpenter <dan.carpenter> Date: Fri Oct 20 11:41:56 2017 -0600 vfio/type1: silence integer overflow warning Now disregarding the warning itself I'm not very sure about this change since IIUC potentially this change will never allow the userspace to free the last page if it is really mapped somehow via VFIO_IOMMU_MAP_DMA (which is IOVA=2^64-4096). Because AFAIU we can only unmap that page with parameters {iova=2^64-4096, size=4096} but this will be rejected by this extra overflow check. I also tested to revert 71a7d3d78e3ca51ea688ae88c389867d948377cd in the kernel tree, reprobe modules, and then VFIO_UNMAP_DMA error will go away too. Alex, any insight? For now I would think it be good to fix this solo problem from kernel side which seems cleaner, but I'd like to know how you think about it. Thanks, (In reply to Peter Xu from comment #7) > Alex, any insight? For now I would think it be good to fix this solo > problem from kernel side which seems cleaner, but I'd like to know how you > think about it. Thanks for the analysis, Peter! I'll kick this back to me. It seems the bug in 71a7d3d78e3c is that "unmap->iova + unmap->size" should be "unmap->iova + unmap->size - 1", ie. iova + size wraps to 0x0 currently, right? Gack, that's nasty that we cannot unmap the last page, but we'll need to do something in upstream QEMU too, at least until we can consider this bug deprecated in the host kernel. I think that would mean that if we get an -EINVAL error and test that the end boundary is the end of the address space, we have to assume that the last page was never mapped anyway and subtract a page from the size. On Intel systems we have a limited IOVA space, so this will always be the safe, and since we're emulating Intel VT-d in the guest, this should also always be safe. (In reply to Alex Williamson from comment #8) > It seems the > bug in 71a7d3d78e3c is that "unmap->iova + unmap->size" should be > "unmap->iova + unmap->size - 1", ie. iova + size wraps to 0x0 currently, > right? AFAICT, yes. > Gack, that's nasty that we cannot unmap the last page, but we'll > need to do something in upstream QEMU too, at least until we can consider > this bug deprecated in the host kernel. I think that would mean that if we > get an -EINVAL error and test that the end boundary is the end of the > address space, we have to assume that the last page was never mapped anyway > and subtract a page from the size. On Intel systems we have a limited IOVA > space, so this will always be the safe, and since we're emulating Intel VT-d > in the guest, this should also always be safe. Indeed we should need a workaround in QEMU too for old kernels (I just noticed that it was introduced in 2017, so it actually covers 4.15+). I think I can take care of the QEMU counterpart altogether with debugging the rest of error messages in bug 1662270. Just let me know your preference. Thanks! Patch(es) available on kernel-4.18.0-61.el8 ==Verification== Versions: 4.18.0-61.el8.x86_64 qemu-kvm-3.1.0-4.module+el8+2681+819ab34d.x86_64 Steps: 1. Boot qemu with device assignment+vIOMMU. 2. Add intel_iommu=on to guest kernel line 3. Reboot guest by system_reset (qemu) system_reset 4. Reboot guest in guest # reboot 5. Shutdown guest by system_powerdown (qemu) system_powerdown 6. Shutdown guest in guest # shutdown -h now After step3 and step6, qemu only prompted below info and no other error info: (qemu) qemu-kvm: vtd_interrupt_remap_msi: MSI address low 32 bit invalid: 0x0 (Bug 1662270 is tracking this issue) After step4, step5 no any error in qemu, guest and host.. So this bug has been fixed very well. Move to 'VERIFIED'. |