Bug 2013752
Summary: | VM with vGPU and vCPU config of 1 socket 16 cores fails to start | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | amashah | ||||
Component: | ovirt-engine | Assignee: | Milan Zamazal <mzamazal> | ||||
Status: | CLOSED ERRATA | QA Contact: | Nisim Simsolo <nsimsolo> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.4.8 | CC: | ahadas, ddacosta, dfodor, gilboad, mavital, michal.skrivanek, mtessun, nsimsolo, pdwyer, swachira | ||||
Target Milestone: | ovirt-4.4.9-1 | Keywords: | Regression | ||||
Target Release: | --- | Flags: | mavital:
needinfo+
|
||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ovirt-engine-4.4.9.4 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, certain CPU topologies would cause virtual machines with vGPU to fail. The current release fixes this issue.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-11-16 13:54:29 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
amashah
2021-10-13 16:14:14 UTC
According to https://wiki.qemu.org/Features/VT-d, caching-mode="on" should be indeed set in <iommu> <driver> when vfio-pci devices are present. I could reproduce the problem and the VM starts for me when I added caching-mode option. I don't know why the error occurs only with some CPU topologies, most likely due to luck. I'll prepare a patch to add the option. Makes sense, that would also explain why it happens with a 4.4.8 engine and 4.4.5 host as iommu was added as part of the fix for bz 1946231 Verified: ovirt-engine-4.4.9.4-0.1.el8ev qemu-kvm-6.0.0-33.module+el8.5.0+13041+05be2dc6.x86_64 libvirt-daemon-7.6.0-6.module+el8.5.0+13051+7ddbe958.x86_64 vdsm-4.40.90.4-1.el8ev.x86_64 Nvidia Driver Version: 460.73.02 Verification scenario: 1. Reproduce issue (try to run VM with 1 socket and 16 cores per socket). 2. Upgrade ovirt-engine and RHV host. 3. run VM again Verify VM is running with Nvidia vGPU instance. (In reply to Nisim Simsolo from comment #10) > Verified: > ovirt-engine-4.4.9.4-0.1.el8ev > qemu-kvm-6.0.0-33.module+el8.5.0+13041+05be2dc6.x86_64 > libvirt-daemon-7.6.0-6.module+el8.5.0+13051+7ddbe958.x86_64 > vdsm-4.40.90.4-1.el8ev.x86_64 > Nvidia Driver Version: 460.73.02 > > Verification scenario: > 1. Reproduce issue (try to run VM with 1 socket and 16 cores per socket). > 2. Upgrade ovirt-engine and RHV host. > 3. run VM again > Verify VM is running with Nvidia vGPU instance. looks good to me, thanks Not sure if its fixed or not, but still seeing it on vdsm-4.40.90.4-1.el8.x86_64. (In reply to Gilboa Davara from comment #12) > Not sure if its fixed or not, but still seeing it on > vdsm-4.40.90.4-1.el8.x86_64. The fix is on the engine side - what's the version of ovirt-engine? $ rpm -q ovirt-engine ovirt-engine-4.4.9.4-1.el8.noarch (In reply to Gilboa Davara from comment #14) > $ rpm -q ovirt-engine > ovirt-engine-4.4.9.4-1.el8.noarch ok interesting, can you please provide engine.log? Created attachment 1841255 [details]
Engine log
Please note that I attempted to run the VM in 3 different ways:
- Q35/BIOS with all pass-through devices (GPU, audio, USB): Memory allocation failure (NUMA?).
- Q35/BIOS with one audio device: IOMMU caching mode error.
- i440FX with all pass-through devices (GPU, audio, USB): Works out of the box, including nVidia GPU driver.
- Gilboa
I could reproduce the problem with a passthrough audio device. QEMU apparently requires the caching mode for any vfio-pci device and we should enable it if any host device is present. While I do agree that having sane defaults is preferable, I would suggest you expose the iommu and vfio flags to the UI instead. Out of 5 machines (4 Intel, one AMD) in 3 different oVirt clusters that export GPU/Audio/USB pass-through devices only one machine requires caching mode. 4 others simply work as advertised. I wonder if enabling it by default won't break existing setups. (In reply to Gilboa Davara from comment #18) > While I do agree that having sane defaults is preferable, I would suggest > you expose the iommu and vfio flags to the UI instead. This would add a complexity that would be most likely really useful only as a workaround when something gets broken. > Out of 5 machines (4 Intel, one AMD) in 3 different oVirt clusters that > export GPU/Audio/USB pass-through devices only one machine requires caching > mode. 4 others simply work as advertised. The problem is currently known to exhibit only if the VM's maximum number of vCPUs >= 256. Depending on the cluster level, the VM CPU topology and the VM firmware type, the limit may or may not be reached. There can be also different QEMU versions on different hosts and the problem may perhaps occur only on certain hardware. > I wonder if enabling it by default won't break existing setups. QEMU documentation says: ``caching-mode=on|off`` (default: off) This enables caching mode for the VT-d emulated device. When caching-mode is enabled, each guest DMA buffer mapping will generate an IOTLB invalidation from the guest IOMMU driver to the vIOMMU device in a synchronous way. It is required for ``-device vfio-pci`` to work with the VT-d device, because host assigned devices requires to setup the DMA mapping on the host before guest DMA starts. Which means enabling the flag is non-optional with vfio-pci and it works without the flag only due to some tolerance or coincidence. Many thanks for the detailed response. If you need QA services to test the fix, please let me know.. Since it's too late to handle the additional problem in 4.4.9, I opened a new bug for PCI host devices: BZ 2023313 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (0-day RHV Manager (ovirt-engine) [ovirt-4.4.9]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4699 |