Bug 1511802
Summary: | Regression in QEMU handling for sub-page MMIO BARs for vfio-pci devices | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sergio Lopez <slopezpa> | ||||
Component: | qemu-kvm | Assignee: | Alex Williamson <alex.williamson> | ||||
Status: | CLOSED ERRATA | QA Contact: | Yanan Fu <yfu> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | alex.williamson, chayang, jen, jherrman, juzhang, knoel, michen, mkalinin, mtessun, qzhang, rbalakri, slopezpa, virt-maint, yfu | ||||
Target Milestone: | rc | Keywords: | Regression, TestOnly, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-1.5.3-146.el7 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, the memory mapping of processor access to base address registers (BARs) smaller than the system page size was incorrect. As a consequence, devices assigned to a guest virtual machine using the vfio-pci driver did not work, which in some cases prevented the guest from booting. This update fixes the QEMU emulator to map the affected devices properly, and the described problem no longer occurs.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1515110 (view as bug list) | Environment: | |||||
Last Closed: | 2018-04-10 14:38:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1494181 | ||||||
Bug Blocks: | 1515110 | ||||||
Attachments: |
|
This bug is already fixed in qemu-kvm for RHEL-7.5 by bug 1494181 which includes commit db0da029a185 ('vfio: Generalize region support') that inadvertently fixed this issue upstream by not mmap'ing sub-page regions: if (!vbasedev->no_mmap && size && region->flags & VFIO_REGION_INFO_FLAG_MMAP) { becomes: if (!vbasedev->no_mmap && region->flags & VFIO_REGION_INFO_FLAG_MMAP && !(region->size & ~qemu_real_host_page_mask)) { The sub-page mmap is thus excluded as it was prior to the kernel change. Also note that the issue is with the mmap, not with the DMA map. Prior to the kernel change or with the above fix, the region is not DMA mapped and the VM works well. DMA mapping only enables the possibility of peer-to-peer DMA between devices, the vCPU has full access to the region. I believe it's the sizing of the region which resolves the problem in the proposed fix, which also happens to enable it to be DMA mapped, but the DMA mapping is not actually the fix. Also note that qemu-kvm-rhev-2.6.0-28.el7_3.z already contains db0da029a185 (the fix above) and additionally qemu-kvm-rhev-2.9.0-16.el7_4.z contains 95251725e335 which includes the basis for the proposed patch, allowing sub-page mmaps. Unless there is a specific need to support sub-page mmaps (the performance difference for a USB device is likely immeasurable) or the user is unable or unwilling to move to qemu-kvm-rhev, I would suggest that the existing qemu-kvm fix is sufficient. I've also verified that current RHEL-7.5 builds of qemu-kvm do work with assignment of Terradici USB functions. Bug 1494181 is however not a good z-stream candidate and if z-stream is requested here I would suggest a very limited version of db0da029a185 including only a backport of the change noted above. Hi Alex, (In reply to Alex Williamson from comment #2) > The sub-page mmap is thus excluded as it was prior to the kernel change. > Also note that the issue is with the mmap, not with the DMA map. Prior to > the kernel change or with the above fix, the region is not DMA mapped and > the VM works well. DMA mapping only enables the possibility of peer-to-peer > DMA between devices, the vCPU has full access to the region. I believe it's > the sizing of the region which resolves the problem in the proposed fix, > which also happens to enable it to be DMA mapped, but the DMA mapping is not > actually the fix. Hum... that doesn't match my observations: - Unpatched QEMU, device _without_ VFIO_REGION_INFO_FLAG_MMAP 1. The BAR is not mmap'ed nor DMA mapped 2. A zero-sized MR is installed, which allows QEMU to fall back to the underlying MR which is configured with "vfio_bar_ops" MemoryRegionOps. * The Guest can access the BAR thanks to 2. - Unpatched QEMU, device _with_ VFIO_REGION_INFO_FLAG_MMAP and BAR smaller than 4096: 1. The BAR is successfully mmap'ed (I verified that it's possible to access the device's BAR reading from the mapped region on the Host). 2. A BAR-sized MR is installed, preventing QEMU from falling back to "vfio_bar_ops". 3. A DMA map will _not_ be requested in vfio_listener_region_add, as it's explained in this BZ description. * As a consequence of 2 and 3, the BAR is not accessible from the Guest. Sergio. (In reply to Sergio Lopez from comment #4) > Hi Alex, > > (In reply to Alex Williamson from comment #2) > > The sub-page mmap is thus excluded as it was prior to the kernel change. > > Also note that the issue is with the mmap, not with the DMA map. Prior to > > the kernel change or with the above fix, the region is not DMA mapped and > > the VM works well. DMA mapping only enables the possibility of peer-to-peer > > DMA between devices, the vCPU has full access to the region. I believe it's > > the sizing of the region which resolves the problem in the proposed fix, > > which also happens to enable it to be DMA mapped, but the DMA mapping is not > > actually the fix. > > Hum... that doesn't match my observations: > > - Unpatched QEMU, device _without_ VFIO_REGION_INFO_FLAG_MMAP > 1. The BAR is not mmap'ed nor DMA mapped > 2. A zero-sized MR is installed, which allows QEMU to fall back to the > underlying MR which is configured with "vfio_bar_ops" MemoryRegionOps. > * The Guest can access the BAR thanks to 2. > > - Unpatched QEMU, device _with_ VFIO_REGION_INFO_FLAG_MMAP and BAR smaller > than 4096: > 1. The BAR is successfully mmap'ed (I verified that it's possible to > access the device's BAR reading from the mapped region on the Host). > 2. A BAR-sized MR is installed, preventing QEMU from falling back to > "vfio_bar_ops". > 3. A DMA map will _not_ be requested in vfio_listener_region_add, as it's > explained in this BZ description. > * As a consequence of 2 and 3, the BAR is not accessible from the Guest. The point is that 3 is actually irrelevant. That step is missing in the first example and the device still works. The problem is at step 2 in this latter example, the MemoryRegion is BAR sized and with the patch applied it's expanded out to a full page size (the BAR in question is 1k). If we take a simple example QEMU instance: /usr/libexec/qemu-kvm -m 2G -net none -vga none -nographic -serial none -device vfio-pci,host=8:00.1 -monitor stdio Where 8:00.1 is: 08:00.1 USB controller: Teradici Corp. Device 2200 (prog-if 20 [EHCI]) Subsystem: Teradici Corp. Device 2200 Physical Slot: 5 Flags: fast devsel, IRQ 16, NUMA node 0 Memory at ef908000 (64-bit, non-prefetchable) [size=1K] Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [c0] Express Endpoint, MSI 00 Kernel driver in use: vfio-pci Kernel modules: snd_hda_intel In the monitor we can see the mappings of the device: (qemu) info mtree ... 00000000febff000-00000000febff3ff (prio 1, RW): VFIO 0000:08:00.1 BAR 0 00000000febff000-00000000febff3ff (prio 0, RW): VFIO 0000:08:00.1 BAR 0 mmap So let's try to read from the device: (qemu) xp /32x 0xfebff000 00000000febff000: 0xf000ff53 0xf000ff53 0xf000e2c3 0xf000ff53 00000000febff010: 0xf000ff53 0xf000ff54 0xf000ff53 0xf000ff53 00000000febff020: 0xf000fea5 0xf000e987 0xf000ef4b 0xf000ef4b 00000000febff030: 0xf000ef4b 0xf000ef4b 0xf000ef57 0xf000ef4b 00000000febff040: 0xf000f065 0xf000f84d 0xf000f841 0xf000e3fe 00000000febff050: 0xc200e739 0xf000f059 0xf000e82e 0xf000efd2 00000000febff060: 0x0000d75c 0xf0000000 0xf000fe6e 0xf000ff53 00000000febff070: 0xf000ff53 0xf000ff53 0xf0007774 0xf000ff53 Is that good or bad? Let's disable the mmap for this BAR and re-check: (qemu) info mtree ... 00000000febff000-00000000febff3ff (prio 1, RW): VFIO 0000:08:00.1 BAR 0 00000000febff000-00000000febfefff (prio 0, RW): VFIO 0000:08:00.1 BAR 0 mmap Here we see the zero size mmap (appears -1 sized), but reading from the device: (qemu) xp /32x 0xfebff000 00000000febff000: 0x01000014 0x00001414 0x00000086 0x00000000 00000000febff010: 0x00000000 0x00080b00 0x00001008 0x00000000 00000000febff020: 0x00000df8 0x00000000 0x7fffd000 0x7fffcf00 00000000febff030: 0x00000000 0x00000000 0x00000000 0x00000000 00000000febff040: 0x00000000 0x00000000 0x00000000 0x00000000 00000000febff050: 0x00000000 0x00000001 0x00001000 0x00001000 00000000febff060: 0x00001000 0x00001000 0x00000000 0x00000000 00000000febff070: 0x00000000 0x00000000 0x00000000 0x00000000 We know this latter VM boots, so we can infer that we're correctly reading from the device here. Testing again with qemu-kvm-rhev, we get: (qemu) info mtree ... 00000000febff000-00000000febfffff (prio 0, i/o): 0000:08:00.1 BAR 0 00000000febff000-00000000febfffff (prio 0, ramd): 0000:08:00.1 BAR 0 mmaps[0] So we can see that the BAR is expanded from 1K to 4K and is fully covered by an mmap, and now: (qemu) xp /32x 0xfebff000 00000000febff000: 0x01000014 0x01000014 0x00000086 0x00000086 00000000febff010: 0x00000000 0x00000000 0x00001008 0x00001008 00000000febff020: 0x00002adb 0x00002adb 0x7ffdf000 0x7ffdf000 00000000febff030: 0x00000000 0x00000000 0x00000000 0x00000000 00000000febff040: 0x00000000 0x00000000 0x00000000 0x00000000 00000000febff050: 0x00000000 0x00000000 0x00001000 0x00001000 00000000febff060: 0x00001000 0x00001000 0x00000000 0x00000000 00000000febff070: 0x00000000 0x00000000 0x00000000 0x00000000 The data changes since we reading from the device, but the format matches the working case, and of course this case also works. These cases are the CPU reading from the device. DMA mapping only comes into play for access initiated by the device. So bumping the BAR up to 4k does allow the DMA mapping, but that doesn't play a role in the examples above for CPU reads from the device, which is where SeaBIOS is failing. DMA mapping device MMIO space is only relevant for peer-to-peer between devices, which is almost never used. DMA mapping of VM RAM is the key factor that is unchanged in any of these cases. (In reply to Alex Williamson from comment #6) > These cases are the CPU reading from the device. DMA mapping only comes > into play for access initiated by the device. So bumping the BAR up to 4k > does allow the DMA mapping, but that doesn't play a role in the examples > above for CPU reads from the device, which is where SeaBIOS is failing. DMA > mapping device MMIO space is only relevant for peer-to-peer between devices, > which is almost never used. DMA mapping of VM RAM is the key factor that is > unchanged in any of these cases. As expected from the VFIO maintainer, you're definitely right ;-) I did a quick test with the patch but commenting out the VFIO_IOMMU_MAP_DMA ioctl call, and the guest is still able to access the BAR. But I still couldn't understand why using a smaller than a page size for the mmap and the MR resulted in the guest not being able to reach the BAR, so I kept digging. Turns out that, in qemu-kvm 1.5.3, there's a bug with exec.c:subpage subsystem when dealing with a MR where mr->ram == true. These are my observations: 1. Patched QEMU + BAR size == 4096 + VFIO_REGION_INFO_FLAG_MMAP: - Guest reads from BAR: address_space_rw -> qemu_get_ram_ptr - Works as expected 2. Patched QEMU + BAR size == 1024 + !(VFIO_REGION_INFO_FLAG_MMAP): - Guest reads from BAR: address_space_rw -> io_mem_read -> subpage_read -> io_mem_read -> vfio_bar_read - Works as expected 3. Patched QEMU + BAR size == 1024 + VFIO_REGION_INFO_FLAG_MMAP: - Guest reads from BAR: address_space_rw -> io_mem_read -> subpage_read -> io_mem_read -> subpage_ram_read -> qemu_get_ram_ptr - Reads from a bogus address The behavior in (3) happens because: exec.c: 1621 static uint64_t subpage_read(void *opaque, hwaddr addr, 1622 unsigned len) 1623 { (...) 1632 section = &mmio->as->dispatch->map.sections[mmio->sub_section[idx]]; // addr == 0x0 // mmio->base == 0xfebff000 1633 addr += mmio->base; // section->offset_within_address_space == 0xfebff000 1634 addr -= section->offset_within_address_space; // section->offset_within_address_space == 0 1635 addr += section->offset_within_region; // addr == 0x0 1636 return io_mem_read(section->mr, addr, len); 1637 } exec.c: 1664 static uint64_t subpage_ram_read(void *opaque, hwaddr addr, 1665 unsigned size) 1666 { 1667 ram_addr_t raddr = addr; // addr == raddr == 0x0 1668 void *ptr = qemu_get_ram_ptr(raddr); 1669 switch (size) { 1670 case 1: return ldub_p(ptr); 1671 case 2: return lduw_p(ptr); 1672 case 4: return ldl_p(ptr); 1673 default: abort(); 1674 } 1675 } So subpage_read() assumes the next mr->ops->read will deal with a relative "addr", while subpage_ram_read() needs an abolute one, resulting in qemu_get_ram_ptr() obtaining a pointer to a different location. And that's the reason why SeaBIOS acted weird, hanging or rebooting when USB EHCI support was enabled, because each write it did was hitting an arbitrary location (IDT?). I've stopped investigating the problem with subpage_ram_read(), because this function it's not even present in qemu-kvm-rhev, but perhaps it'd be worth keeping this in mind in case something weird happens in the future (I was about to suggest to abort() in subpage_register if mr->ram == true, but that'd probably be an overkill). Reproduce this bz successfully, and it is ok with qemu-kvm-1.5.3-149.el7 Test steps: 1. Unbind one EHCI device from host, and bind to vfio-pci. #lspci | grep "USB Enhanced" 00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 (rev 05) 00:1d.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #1 (rev 05) Choose "00:1a.0". # readlink /sys/bus/pci/devices/0000\:00\:1a.0/iommu_group ../../../kernel/iommu_groups/10 # ls /sys/kernel/iommu_groups/10/devices/ 0000:00:1a.0 ---> iommu group 10 only have one device, it is ok for testing. Unbind it from host and bind it vfio-pci. #ll /sys/bus/pci/devices/0000\:00\:1a.0/driver lrwxrwxrwx. 1 root root 0 Nov 29 04:19 /sys/bus/pci/devices/0000:00:1a.0/driver -> ../../../bus/pci/drivers/vfio-pci 2. Boot VM with this passthrough device: -device vfio-pci,host=00:1a.0,id=ehci \ 3. Connect to VM and check the status of the VM. ----------------Reproduce-------------- Test version: qemu: qemu-kvm-1.5.3-145.el7 kernel: kernel-3.10.0-797.el7.x86_64 Test result: Guest stuck at "Press ESC for boot menu" (i add "-boot menu=on in the command line"), without "-boot menu=on", guest stuck at "ipxe(hrrp://ipxe.org)...." when i boot without the pass through EHCI device, guest work well. ----------------Verification-------------- Test version: qemu: qemu-kvm-1.5.3-149.el7 kernel: kernel-3.10.0-797.el7.x86_64 Test result: Guest can boot up normally, reboot, shutdown and hot-plug/unplug the pass through device several times, no problem. Move to VERIFIED accordingly. Jiri, DMA is relevant to memory reads and writes initiated by the device. That's not the problem here. The problem was with processor access to memory mapped BARs, specifically BARs which are smaller than the system page size. The mapping of processor accesses to these regions was incorrect. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0816 |
Created attachment 1350352 [details] Patch backporting vfio_sub_page_bar_update_mapping Kernels from 3.10.0-693 series increase the number of VFIO_REGION_INFO_FLAG_MMAP capable devices. This has triggered a dormant issue in qemu 1.5, where BARs smaller than host's page size get mmap'ed but not dma_map'ed, making them unreachable from the guest side. The BAR is mmap'ed here: hw/misc/vfio.c: 2573 static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem, 2574 void **map, size_t size, off_t offset, 2575 const char *name) 2576 { 2577 int ret = 0; 2578 2579 if (VFIO_ALLOW_MMAP && size && bar->flags & VFIO_REGION_INFO_FLAG_MMAP) { 2580 int prot = 0; 2581 2582 if (bar->flags & VFIO_REGION_INFO_FLAG_READ) { 2583 prot |= PROT_READ; 2584 } 2585 2586 if (bar->flags & VFIO_REGION_INFO_FLAG_WRITE) { 2587 prot |= PROT_WRITE; 2588 } 2589 2590 *map = mmap(NULL, size, prot, MAP_SHARED, 2591 bar->fd, bar->fd_offset + offset); 2592 if (*map == MAP_FAILED) { 2593 *map = NULL; 2594 ret = -errno; 2595 goto empty_region; 2596 } 2597 2598 memory_region_init_ram_ptr(submem, name, size, *map); 2599 memory_region_set_skip_dump(submem); 2600 } else { 2601 empty_region: 2602 /* Create a zero sized sub-region to make cleanup easy. */ 2603 memory_region_init(submem, name, 0); 2604 } 2605 2606 memory_region_add_subregion(mem, offset, submem); 2607 2608 return ret; 2609 } But the region will be omitted here: 2257 static void vfio_listener_region_add(MemoryListener *listener, (...) 2278 2279 iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); // end is being rounded down for page alignment 2280 end = (section->offset_within_address_space + section->size) & 2281 TARGET_PAGE_MASK; // if end was smaller than TARGET_PAGE_SIZE, now end equals iova, // returning here from the function 2283 if (iova >= end) { 2284 return; 2285 } This can be fixed by backporting vfio_sub_page_bar_update_mapping from qemu 2.x. I'm attaching a patch with this approach. Alternatively, a more conservative option could be rejecting subpage sizes in vfio_mmap_bar, falling back to the non-mmap'ed mode. This has been reproduced with qemu-kvm-1.5.3-141.el7_4.2, passing through this USB EHCI device: 00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 (rev 05) I suspect any USB EHCI device can trigger the issue. On the guest, the problem manifests itself with SeaBIOS reading bogus data from the controller on the EHCI initialization and rebooting in a loop.