Description of problem:
Memory ballooning zaps pages from QEMUs address space independent of IOMMU mapping and pinning of those pages through vfio. This means that 1) the pages are not released to the host as intended because pinning still holds a reference to the page, and more importantly 2) when the guest memory balloon is deflated, a new page is allocated to replace the ballooned-out page. This new page introduces a new host virtual to physical mapping for a guest physical address, yet the original GPA to HPA mapping still resides in the IOMMU, referencing the still pinned page. At this point the vCPU and assigned devices reference different host physical pages for the same guest physical address.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Assign a physical device to a VM with vfio
2. Inflate memory balloon to the furthest extent possible, ensuring as many pages as possible are "released" back to the host
3. Deflate the memory balloon
4. Perform DMA activity using the assigned device
For the time being, memory hotplug is a better solution for creating dynamic VM densities with device assignment. The immediate term solution currently upstream is to use the balloon inhibitor to prevent ballooned pages from being zapped from the QEMU virtual address space when the guest balloon inflates. This does not solve 1) above, ballooned pages are not released to the host as typically expected when using the balloon, but does solve 2) as the page is never removed from QEMU's virtual address space and therefore the GPA to HPA mappings remain consistent when the guest balloon is deflated.
Also note that in the case of assigned device hot-add, where none was previously attached, the mapping of the VM through the IOMMU will force any previously ballooned out pages to be remapped into QEMU's virtual address space. This makes previously ballooned pages ineffective, but forces the consistency of assigned device and vCPU mappings.
Long term solutions to enable effective ballooning in combination with device assignment are difficult to realize. The IOMMU API currently used by vfio does not provide for atomic IOMMU page table updates, making it impossible to support zapping a single PTE from a previous mapping. In fact, within the IOMMU API, an unmap can return an arbitrary size if the IOMMU driver chose to use a super page mapping for the original request. As a result of this, the VFIO API does not allow unmaps with finer granularity than the original mappping. These problem need to be resolved before we can actually "zap" a page out of the VM address space. When repopulating a page to the IOMMU mapping, we can always map a single page, but there is currently no mechanism to inform the IOMMU to do this. MMU notifiers can be used to tear down the mapping of a page, but the re-mapping of a page currently relies on a page fault in the processor. With an assigned device, there's currently no page faulting and no guarantee that the device won't require the page before the processor.
For now, balloon inhibiting resolves the worst of the incompatibility though does not provide effective balloon behavior. Again, memory hotplug is a more effective means of creating variable density VMs when device assignment is involved.
Relevant upstream patches:
154304cd6e99 postcopy: Synchronize usage of the balloon inhibitor
8709b3954d41 vfio/pci: Fix failure to close file descriptor on error
a1c0f886496c vfio/pci: Handle subsystem realpath() returning NULL
238e91728503 vfio/ccw/pci: Allow devices to opt-in for ballooning
c65ee433153b vfio: Inhibit ballooning based on group attachment to a container
f59489423ab7 kvm: Use inhibit to prevent ballooning without synchronous mmu
01ccbec7bdf6 balloon: Allow multiple inhibit users
These are all included in QEMU-3.1
Fix included in qemu-kvm-rhev-2.12.0-20.el7
Same steps as https://bugzilla.redhat.com/show_bug.cgi?id=1650272#c4, reproduced with rhel8 guest against qemu-kvm-rhev-2.12.0-19.el7.
Verified against qemu-kvm-rhev-2.12.0-20.el7, both rhel8 and rhel7 guest work well without dma error.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.