Bug 1511802

Summary:

Regression in QEMU handling for sub-page MMIO BARs for vfio-pci devices

Product:

Red Hat Enterprise Linux 7

Reporter:

Sergio Lopez <slopezpa>

Component:

qemu-kvm

Assignee:

Alex Williamson <alex.williamson>

Status:

CLOSED ERRATA

QA Contact:

Yanan Fu <yfu>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.4

CC:

alex.williamson, chayang, jen, jherrman, juzhang, knoel, michen, mkalinin, mtessun, qzhang, rbalakri, slopezpa, virt-maint, yfu

Target Milestone:

Keywords:

Regression, TestOnly, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

qemu-kvm-1.5.3-146.el7

Doc Type:

Bug Fix

Doc Text:

Previously, the memory mapping of processor access to base address registers (BARs) smaller than the system page size was incorrect. As a consequence, devices assigned to a guest virtual machine using the vfio-pci driver did not work, which in some cases prevented the guest from booting. This update fixes the QEMU emulator to map the affected devices properly, and the described problem no longer occurs.

Story Points:

---

Clone Of:

Clones:

1515110 (view as bug list)

Environment:

Last Closed:

2018-04-10 14:38:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1494181

Bug Blocks:

1515110

Attachments:

Description	Flags
Patch backporting vfio_sub_page_bar_update_mapping	none

Description Sergio Lopez 2017-11-10 08:11:07 UTC

Created attachment 1350352 [details]
Patch backporting vfio_sub_page_bar_update_mapping

Kernels from 3.10.0-693 series increase the number of VFIO_REGION_INFO_FLAG_MMAP capable devices. This has triggered a dormant issue in qemu 1.5, where BARs smaller than host's page size get mmap'ed but not dma_map'ed, making them unreachable from the guest side.

The BAR is mmap'ed here:

hw/misc/vfio.c:
2573 static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem,
2574                          void **map, size_t size, off_t offset,
2575                          const char *name)
2576 {
2577     int ret = 0;
2578 
2579     if (VFIO_ALLOW_MMAP && size && bar->flags & VFIO_REGION_INFO_FLAG_MMAP) {
2580         int prot = 0;
2581 
2582         if (bar->flags & VFIO_REGION_INFO_FLAG_READ) {
2583             prot |= PROT_READ;
2584         }
2585 
2586         if (bar->flags & VFIO_REGION_INFO_FLAG_WRITE) {
2587             prot |= PROT_WRITE;
2588         }
2589 
2590         *map = mmap(NULL, size, prot, MAP_SHARED,
2591                     bar->fd, bar->fd_offset + offset);
2592         if (*map == MAP_FAILED) {
2593             *map = NULL;
2594             ret = -errno;
2595             goto empty_region;
2596         }
2597 
2598         memory_region_init_ram_ptr(submem, name, size, *map);
2599         memory_region_set_skip_dump(submem);
2600     } else {
2601 empty_region:
2602         /* Create a zero sized sub-region to make cleanup easy. */
2603         memory_region_init(submem, name, 0);
2604     }
2605 
2606     memory_region_add_subregion(mem, offset, submem);
2607 
2608     return ret;
2609 }

But the region will be omitted here:

2257 static void vfio_listener_region_add(MemoryListener *listener,
(...)
2278 
2279     iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
         // end is being rounded down for page alignment
2280     end = (section->offset_within_address_space + section->size) &
2281           TARGET_PAGE_MASK;
         // if end was smaller than TARGET_PAGE_SIZE, now end equals iova,
         // returning here from the function
2283     if (iova >= end) {
2284         return;
2285     }

This can be fixed by backporting vfio_sub_page_bar_update_mapping from qemu 2.x. I'm attaching a patch with this approach.

Alternatively, a more conservative option could be rejecting subpage sizes in vfio_mmap_bar, falling back to the non-mmap'ed mode.

This has been reproduced with qemu-kvm-1.5.3-141.el7_4.2, passing through this USB EHCI device:

00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 (rev 05)

I suspect any USB EHCI device can trigger the issue.

On the guest, the problem manifests itself with SeaBIOS reading bogus data from the controller on the EHCI initialization and rebooting in a loop.

Comment 2 Alex Williamson 2017-11-10 21:39:57 UTC

This bug is already fixed in qemu-kvm for RHEL-7.5 by bug 1494181 which includes commit db0da029a185 ('vfio: Generalize region support') that inadvertently fixed this issue upstream by not mmap'ing sub-page regions:

    if (!vbasedev->no_mmap && size && region->flags &
        VFIO_REGION_INFO_FLAG_MMAP) {

becomes:

    if (!vbasedev->no_mmap &&
        region->flags & VFIO_REGION_INFO_FLAG_MMAP &&
        !(region->size & ~qemu_real_host_page_mask)) {

The sub-page mmap is thus excluded as it was prior to the kernel change.  Also note that the issue is with the mmap, not with the DMA map.  Prior to the kernel change or with the above fix, the region is not DMA mapped and the VM works well.  DMA mapping only enables the possibility of peer-to-peer DMA between devices, the vCPU has full access to the region.  I believe it's the sizing of the region which resolves the problem in the proposed fix, which also happens to enable it to be DMA mapped, but the DMA mapping is not actually the fix.

Also note that qemu-kvm-rhev-2.6.0-28.el7_3.z already contains db0da029a185 (the fix above) and additionally qemu-kvm-rhev-2.9.0-16.el7_4.z contains 95251725e335 which includes the basis for the proposed patch, allowing sub-page mmaps.

Unless there is a specific need to support sub-page mmaps (the performance difference for a USB device is likely immeasurable) or the user is unable or unwilling to move to qemu-kvm-rhev, I would suggest that the existing qemu-kvm fix is sufficient.  I've also verified that current RHEL-7.5 builds of qemu-kvm do work with assignment of Terradici USB functions.  Bug 1494181 is however not a good z-stream candidate and if z-stream is requested here I would suggest a very limited version of db0da029a185 including only a backport of the change noted above.

Comment 4 Sergio Lopez 2017-11-13 07:42:42 UTC

Hi Alex,

(In reply to Alex Williamson from comment #2)
> The sub-page mmap is thus excluded as it was prior to the kernel change. 
> Also note that the issue is with the mmap, not with the DMA map.  Prior to
> the kernel change or with the above fix, the region is not DMA mapped and
> the VM works well.  DMA mapping only enables the possibility of peer-to-peer
> DMA between devices, the vCPU has full access to the region.  I believe it's
> the sizing of the region which resolves the problem in the proposed fix,
> which also happens to enable it to be DMA mapped, but the DMA mapping is not
> actually the fix.

Hum... that doesn't match my observations:

 - Unpatched QEMU, device _without_ VFIO_REGION_INFO_FLAG_MMAP
   1. The BAR is not mmap'ed nor DMA mapped
   2. A zero-sized MR is installed, which allows QEMU to fall back to the underlying MR which is configured with "vfio_bar_ops" MemoryRegionOps.
   * The Guest can access the BAR thanks to 2.

 - Unpatched QEMU, device _with_ VFIO_REGION_INFO_FLAG_MMAP and BAR smaller than 4096:
   1. The BAR is successfully mmap'ed (I verified that it's possible to access the device's BAR reading from the mapped region on the Host).
   2. A BAR-sized MR is installed, preventing QEMU from falling back to "vfio_bar_ops".
   3. A DMA map will _not_ be requested in vfio_listener_region_add, as it's explained in this BZ description.
   * As a consequence of 2 and 3, the BAR is not accessible from the Guest.

Sergio.

Comment 6 Alex Williamson 2017-11-14 23:31:12 UTC

(In reply to Sergio Lopez from comment #4)
> Hi Alex,
> 
> (In reply to Alex Williamson from comment #2)
> > The sub-page mmap is thus excluded as it was prior to the kernel change. 
> > Also note that the issue is with the mmap, not with the DMA map.  Prior to
> > the kernel change or with the above fix, the region is not DMA mapped and
> > the VM works well.  DMA mapping only enables the possibility of peer-to-peer
> > DMA between devices, the vCPU has full access to the region.  I believe it's
> > the sizing of the region which resolves the problem in the proposed fix,
> > which also happens to enable it to be DMA mapped, but the DMA mapping is not
> > actually the fix.
> 
> Hum... that doesn't match my observations:
> 
>  - Unpatched QEMU, device _without_ VFIO_REGION_INFO_FLAG_MMAP
>    1. The BAR is not mmap'ed nor DMA mapped
>    2. A zero-sized MR is installed, which allows QEMU to fall back to the
> underlying MR which is configured with "vfio_bar_ops" MemoryRegionOps.
>    * The Guest can access the BAR thanks to 2.
> 
>  - Unpatched QEMU, device _with_ VFIO_REGION_INFO_FLAG_MMAP and BAR smaller
> than 4096:
>    1. The BAR is successfully mmap'ed (I verified that it's possible to
> access the device's BAR reading from the mapped region on the Host).
>    2. A BAR-sized MR is installed, preventing QEMU from falling back to
> "vfio_bar_ops".
>    3. A DMA map will _not_ be requested in vfio_listener_region_add, as it's
> explained in this BZ description.
>    * As a consequence of 2 and 3, the BAR is not accessible from the Guest.

The point is that 3 is actually irrelevant.  That step is missing in the first example and the device still works.  The problem is at step 2 in this latter example, the MemoryRegion is BAR sized and with the patch applied it's expanded out to a full page size (the BAR in question is 1k).  If we take a simple example QEMU instance:

/usr/libexec/qemu-kvm -m 2G -net none -vga none -nographic -serial none -device vfio-pci,host=8:00.1 -monitor stdio

Where 8:00.1 is:

08:00.1 USB controller: Teradici Corp. Device 2200 (prog-if 20 [EHCI])
	Subsystem: Teradici Corp. Device 2200
	Physical Slot: 5
	Flags: fast devsel, IRQ 16, NUMA node 0
	Memory at ef908000 (64-bit, non-prefetchable) [size=1K]
	Capabilities: [80] Power Management version 3
	Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [c0] Express Endpoint, MSI 00
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

In the monitor we can see the mappings of the device:

(qemu) info mtree
  ...
  00000000febff000-00000000febff3ff (prio 1, RW): VFIO 0000:08:00.1 BAR 0
    00000000febff000-00000000febff3ff (prio 0, RW): VFIO 0000:08:00.1 BAR 0 mmap

So let's try to read from the device:

(qemu) xp /32x 0xfebff000
00000000febff000: 0xf000ff53 0xf000ff53 0xf000e2c3 0xf000ff53
00000000febff010: 0xf000ff53 0xf000ff54 0xf000ff53 0xf000ff53
00000000febff020: 0xf000fea5 0xf000e987 0xf000ef4b 0xf000ef4b
00000000febff030: 0xf000ef4b 0xf000ef4b 0xf000ef57 0xf000ef4b
00000000febff040: 0xf000f065 0xf000f84d 0xf000f841 0xf000e3fe
00000000febff050: 0xc200e739 0xf000f059 0xf000e82e 0xf000efd2
00000000febff060: 0x0000d75c 0xf0000000 0xf000fe6e 0xf000ff53
00000000febff070: 0xf000ff53 0xf000ff53 0xf0007774 0xf000ff53

Is that good or bad?  Let's disable the mmap for this BAR and re-check:

(qemu) info mtree
  ...
  00000000febff000-00000000febff3ff (prio 1, RW): VFIO 0000:08:00.1 BAR 0
    00000000febff000-00000000febfefff (prio 0, RW): VFIO 0000:08:00.1 BAR 0 mmap

Here we see the zero size mmap (appears -1 sized), but reading from the device:

(qemu) xp /32x 0xfebff000
00000000febff000: 0x01000014 0x00001414 0x00000086 0x00000000
00000000febff010: 0x00000000 0x00080b00 0x00001008 0x00000000
00000000febff020: 0x00000df8 0x00000000 0x7fffd000 0x7fffcf00
00000000febff030: 0x00000000 0x00000000 0x00000000 0x00000000
00000000febff040: 0x00000000 0x00000000 0x00000000 0x00000000
00000000febff050: 0x00000000 0x00000001 0x00001000 0x00001000
00000000febff060: 0x00001000 0x00001000 0x00000000 0x00000000
00000000febff070: 0x00000000 0x00000000 0x00000000 0x00000000

We know this latter VM boots, so we can infer that we're correctly reading from the device here.

Testing again with qemu-kvm-rhev, we get:

(qemu) info mtree
      ...
      00000000febff000-00000000febfffff (prio 0, i/o): 0000:08:00.1 BAR 0
        00000000febff000-00000000febfffff (prio 0, ramd): 0000:08:00.1 BAR 0 mmaps[0]

So we can see that the BAR is expanded from 1K to 4K and is fully covered by an mmap, and now:

(qemu) xp /32x 0xfebff000
00000000febff000: 0x01000014 0x01000014 0x00000086 0x00000086
00000000febff010: 0x00000000 0x00000000 0x00001008 0x00001008
00000000febff020: 0x00002adb 0x00002adb 0x7ffdf000 0x7ffdf000
00000000febff030: 0x00000000 0x00000000 0x00000000 0x00000000
00000000febff040: 0x00000000 0x00000000 0x00000000 0x00000000
00000000febff050: 0x00000000 0x00000000 0x00001000 0x00001000
00000000febff060: 0x00001000 0x00001000 0x00000000 0x00000000
00000000febff070: 0x00000000 0x00000000 0x00000000 0x00000000

The data changes since we reading from the device, but the format matches the working case, and of course this case also works.

These cases are the CPU reading from the device.  DMA mapping only comes into play for access initiated by the device.  So bumping the BAR up to 4k does allow the DMA mapping, but that doesn't play a role in the examples above for CPU reads from the device, which is where SeaBIOS is failing.  DMA mapping device MMIO space is only relevant for peer-to-peer between devices, which is almost never used.  DMA mapping of VM RAM is the key factor that is unchanged in any of these cases.

Comment 10 Sergio Lopez 2017-11-15 13:12:51 UTC

(In reply to Alex Williamson from comment #6)
> These cases are the CPU reading from the device.  DMA mapping only comes
> into play for access initiated by the device.  So bumping the BAR up to 4k
> does allow the DMA mapping, but that doesn't play a role in the examples
> above for CPU reads from the device, which is where SeaBIOS is failing.  DMA
> mapping device MMIO space is only relevant for peer-to-peer between devices,
> which is almost never used.  DMA mapping of VM RAM is the key factor that is
> unchanged in any of these cases.

As expected from the VFIO maintainer, you're definitely right ;-)

I did a quick test with the patch but commenting out the VFIO_IOMMU_MAP_DMA ioctl call, and the guest is still able to access the BAR.

But I still couldn't understand why using a smaller than a page size for the mmap and the MR resulted in the guest not being able to reach the BAR, so I kept digging.

Turns out that, in qemu-kvm 1.5.3, there's a bug with exec.c:subpage subsystem when dealing with a MR where mr->ram == true. These are my observations:

 1. Patched QEMU + BAR size == 4096 + VFIO_REGION_INFO_FLAG_MMAP:
   - Guest reads from BAR: address_space_rw -> qemu_get_ram_ptr
   - Works as expected

 2. Patched QEMU + BAR size == 1024 + !(VFIO_REGION_INFO_FLAG_MMAP):
   - Guest reads from BAR: address_space_rw -> io_mem_read -> subpage_read -> io_mem_read -> vfio_bar_read
   - Works as expected

 3. Patched QEMU + BAR size == 1024 + VFIO_REGION_INFO_FLAG_MMAP:
   - Guest reads from BAR: address_space_rw -> io_mem_read -> subpage_read -> io_mem_read -> subpage_ram_read -> qemu_get_ram_ptr
   - Reads from a bogus address

The behavior in (3) happens because:

exec.c:
1621 static uint64_t subpage_read(void *opaque, hwaddr addr,
1622                              unsigned len)
1623 {
(...) 
1632     section = &mmio->as->dispatch->map.sections[mmio->sub_section[idx]];

         // addr == 0x0
         // mmio->base == 0xfebff000
1633     addr += mmio->base;

         // section->offset_within_address_space == 0xfebff000
1634     addr -= section->offset_within_address_space;

         // section->offset_within_address_space == 0
1635     addr += section->offset_within_region;

         // addr == 0x0
1636     return io_mem_read(section->mr, addr, len);
1637 }


exec.c:
1664 static uint64_t subpage_ram_read(void *opaque, hwaddr addr,
1665                                  unsigned size)
1666 {
1667     ram_addr_t raddr = addr;
         // addr == raddr == 0x0
1668     void *ptr = qemu_get_ram_ptr(raddr);
1669     switch (size) {
1670     case 1: return ldub_p(ptr);
1671     case 2: return lduw_p(ptr);
1672     case 4: return ldl_p(ptr);
1673     default: abort();
1674     }
1675 }


So subpage_read() assumes the next mr->ops->read will deal with a relative "addr", while subpage_ram_read() needs an abolute one, resulting in qemu_get_ram_ptr() obtaining a pointer to a different location.

And that's the reason why SeaBIOS acted weird, hanging or rebooting when USB EHCI support was enabled, because each write it did was hitting an arbitrary location (IDT?).

I've stopped investigating the problem with subpage_ram_read(), because this function it's not even present in qemu-kvm-rhev, but perhaps it'd be worth keeping this in mind in case something weird happens in the future (I was about to suggest to abort() in subpage_register if mr->ram == true, but that'd probably be an overkill).

Comment 22 Yanan Fu 2017-12-07 12:21:50 UTC

Reproduce this bz successfully, and it is ok with qemu-kvm-1.5.3-149.el7

Test steps:
1. Unbind one EHCI device from host, and bind to vfio-pci.
#lspci | grep "USB Enhanced"
00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 (rev 05)
00:1d.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #1 (rev 05)

Choose "00:1a.0".
# readlink /sys/bus/pci/devices/0000\:00\:1a.0/iommu_group
../../../kernel/iommu_groups/10
# ls /sys/kernel/iommu_groups/10/devices/
0000:00:1a.0  ---> iommu group 10 only have one device, it is ok for testing. 

Unbind it from host and bind it vfio-pci.
#ll /sys/bus/pci/devices/0000\:00\:1a.0/driver
lrwxrwxrwx. 1 root root 0 Nov 29 04:19 /sys/bus/pci/devices/0000:00:1a.0/driver -> ../../../bus/pci/drivers/vfio-pci

2. Boot VM with this passthrough device:
-device vfio-pci,host=00:1a.0,id=ehci \

3. Connect to VM and check the status of the VM.

----------------Reproduce--------------
Test version:
qemu: qemu-kvm-1.5.3-145.el7
kernel: kernel-3.10.0-797.el7.x86_64

Test result:
Guest stuck at "Press ESC for boot menu" (i add "-boot menu=on in the command line"), without "-boot menu=on", guest stuck at "ipxe(hrrp://ipxe.org)...."

when i boot without the pass through EHCI device, guest work well.


----------------Verification--------------
Test version:
qemu: qemu-kvm-1.5.3-149.el7
kernel: kernel-3.10.0-797.el7.x86_64


Test result:
Guest can boot up normally, reboot, shutdown and hot-plug/unplug the pass through device several times, no problem.


Move to VERIFIED accordingly.

Comment 23 Alex Williamson 2017-12-11 16:07:55 UTC

Jiri, DMA is relevant to memory reads and writes initiated by the device.  That's not the problem here.  The problem was with processor access to memory mapped BARs, specifically BARs which are smaller than the system page size.  The mapping of processor accesses to these regions was incorrect.

Comment 28 errata-xmlrpc 2018-04-10 14:38:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0816