Bug 2014030
Summary: | Guest can not start with nvme disk and hostdev interface together | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | yalzhang <yalzhang> |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> |
libvirt sub component: | General | QA Contact: | yalzhang <yalzhang> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | alex.williamson, coli, jdenemar, jinzhao, jsuchane, juzhang, lmen, mprivozn, stefanha, virt-maint, xuzhang, yanghliu |
Version: | 9.0 | Keywords: | AutomationTriaged, Triaged, Upstream |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-9.3.0-2.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-07 08:30:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | 9.3.0 |
Embargoed: |
Description
yalzhang@redhat.com
2021-10-14 10:21:50 UTC
Looks like qemu crashed, please attach qemu log file of the VM. Thanks. This smells like a bug in QEMU that I've encountered a while ago. It was fixed upstream: https://gitlab.com/qemu-project/qemu/-/commit/15a730e7a3aaac180df72cd5730e0617bcf44a5a I wonder whether there's something that libvirt can do, but from the commit and its message it does not look like so. Let me switch over to QEMU for further investigation. Philippe, can you investigate? (In reply to Michal Privoznik from comment #2) > This smells like a bug in QEMU that I've encountered a while ago. It was > fixed upstream: > > https://gitlab.com/qemu-project/qemu/-/commit/ > 15a730e7a3aaac180df72cd5730e0617bcf44a5a > > I wonder whether there's something that libvirt can do, but from the commit > and its message it does not look like so. Let me switch over to QEMU for > further investigation. AFAIK this doesn't look the same issue. The symptoms suggest libvirt isn't providing enough locked memory for both an assigned device and an NVMe device. Unlike all hostdev devices, the nvme-vfio driver always makes use of a separate container for devices, not just when a vIOMMU is present (perhaps even a separate container per NVMe device). AIUI the NVMe driver is better able to accommodate bumping into the locked memory limit, which would explain why instantiating or hot-adding the NVMe device last works, while instantiating or hot-adding the hostdev device last encounters hard failures because it cannot simply use less locked memory. I would guess there's a workaround to set the hard_limit to a reasonable value to support both devices until libvirt can account for the locked memory requirements of various combinations of these devices. I did a quick test on my current rhel9 test env first: Test env: 5.14.0-7.el9.x86_64 qemu-kvm-6.1.0-5.el9.x86_64 libvirt-7.6.0-2.el9.x86_64 Test scenario: (1) start a VM with a hostdev interface VF and a hostdev NVME Device The vm can be started successfully. (2) start a VM with a hostdev interface VF only The vm can be started successfully (3) start a VM with *a hostdev interface VF and a NVME UserSpace Driver Device* *This problem can be reproduced* Workaround: > I would guess there's a workaround to set the hard_limit to a reasonable value to support both devices until libvirt can account for the locked memory requirements of various combinations of these devices. This workaround works for me. The vm mentioned in test scenario(3) can be started after add the following "hard limit" xml into the vm: <memtune> <hard_limit unit='KiB'>16777216</hard_limit> <--- This value can be changed with different host configurations </memtune> Alright, so the bug is clearly in libvirt. @alex.williamson here's how libvirt calculates the memlock amount: https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c#L9253 basically, it takes <memory/> from domain XML and adds 1GiB. Can we make better estimate when vfio and nvme is at play? Unfortunately, I don't have a machine with a spare NVMe to test. Punting to Philippe, AIUI the vfio-nvme driver is better able to handle running into the locked memory limit and releasing memory, but is its locked memory usage bounded somehow until it hits that limit? Does each vfio-nvme device create a separate vfio container? Michal, does qemuDomainNeedsVFIO() trigger for vfio-nvme devices? If we're setting a limit of (VM + 1GB) for a vfio-nvme device alone and that device consumes anything more than the "fudge" portion of the 1GB "fudge factor", then there won't be enough locked memory limit remaining for the assigned device. If vfio-nvme creates a container per device then libvirt would need to add the upper bound of locked memory per nvme device to the total locked memory. (In reply to Michal Privoznik from comment #11) > Alright, so the bug is clearly in libvirt. @alex.williamson > here's how libvirt calculates the memlock amount: > > https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c#L9253 > > basically, it takes <memory/> from domain XML and adds 1GiB. Can we make > better estimate when vfio and nvme is at play? Unfortunately, I don't have a > machine with a spare NVMe to test. In addition to guest RAM, block/nvme.c also DMA maps the NVMe submission/completion queues and QEMU-internal I/O buffers that are not in guest RAM (i.e. bounce buffers that are sometimes used). You can look at the DMA allocator in util/vfio-helpers.c to see how mappings are made. It would be interesting to dump QEMUVFIOState at the time when the failure occurs in order to understand what has been mapped. (In reply to Stefan Hajnoczi from comment #13) > (In reply to Michal Privoznik from comment #11) > > Alright, so the bug is clearly in libvirt. @alex.williamson > > here's how libvirt calculates the memlock amount: > > > > https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c#L9253 > > > > basically, it takes <memory/> from domain XML and adds 1GiB. Can we make > > better estimate when vfio and nvme is at play? Unfortunately, I don't have a > > machine with a spare NVMe to test. > > In addition to guest RAM, block/nvme.c also DMA maps the NVMe > submission/completion queues and QEMU-internal I/O buffers that are not in > guest RAM (i.e. bounce buffers that are sometimes used). You can look at the > DMA allocator in util/vfio-helpers.c to see how mappings are made. It would > be interesting to dump QEMUVFIOState at the time when the failure occurs in > order to understand what has been mapped. A bit more about what direction I'm thinking in: Most of the I/O buffers should be in guest RAM. We want that code path to be fast so Philippe and I discussed the idea of permanently mapping guest RAM in block/nvme.c in the past. Today I/O buffers are temporarily mapped. Temporary mappings accumulate until DMA mapping space is exhausted or VFIO refuses to map more, then it is cleared and the temporary mappings start accumulating again. If most DMA is to/from guest RAM we could switch to a model that permanently maps guest RAM (like vfio-pci, maybe based on work-in-progress code where Philippe is unifying vfio-pci and vfio-helpers). It is still necessary to handle QEMU-internal I/O buffers (i.e. bounce buffers), but this rare situation could be handled via a mapping that doesn't grow large. Either it could be a dedicated bounce buffer in block/nvme.c that is permanently mapped (parallel requests need to wait their turn to use the single bounce buffer) or we could dynamically DMA map the I/O buffer but only one request at any time can do this (requests need to wait their turn). As long as most requests use guest RAM there won't be a measurable performance effect of capping DMA mappings in this way. This approach would set a much tighter limit on DMA mappings created by block/nvme.c (In reply to Alex Williamson from comment #12) > Punting to Philippe, AIUI the vfio-nvme driver is better able to handle > running into the locked memory limit and releasing memory, but is its locked > memory usage bounded somehow until it hits that limit? Does each vfio-nvme > device create a separate vfio container? > > Michal, does qemuDomainNeedsVFIO() trigger for vfio-nvme devices? If we're > setting a limit of (VM + 1GB) for a vfio-nvme device alone and that device > consumes anything more than the "fudge" portion of the 1GB "fudge factor", > then there won't be enough locked memory limit remaining for the assigned > device. If vfio-nvme creates a container per device then libvirt would need > to add the upper bound of locked memory per nvme device to the total locked > memory. Phil is no longer with Red Hat, so moving this needinfo to Stefan - Stefan, can you help? Thanks, -Klaus (In reply to Klaus Heinrich Kiwi from comment #17) > (In reply to Alex Williamson from comment #12) > > Punting to Philippe, AIUI the vfio-nvme driver is better able to handle > > running into the locked memory limit and releasing memory, but is its locked > > memory usage bounded somehow until it hits that limit? Does each vfio-nvme > > device create a separate vfio container? > > > > Michal, does qemuDomainNeedsVFIO() trigger for vfio-nvme devices? If we're > > setting a limit of (VM + 1GB) for a vfio-nvme device alone and that device > > consumes anything more than the "fudge" portion of the 1GB "fudge factor", > > then there won't be enough locked memory limit remaining for the assigned > > device. If vfio-nvme creates a container per device then libvirt would need > > to add the upper bound of locked memory per nvme device to the total locked > > memory. > > Phil is no longer with Red Hat, so moving this needinfo to Stefan - Stefan, > can you help? I don't see a quick fix in QEMU. We might be able to use less locked memory in QEMU but ultimately libvirt needs to set a limit that is large enough. (In reply to Alex Williamson from comment #12) > Punting to Philippe, AIUI the vfio-nvme driver is better able to handle > running into the locked memory limit and releasing memory, but is its locked > memory usage bounded somehow until it hits that limit? Does each vfio-nvme > device create a separate vfio container? Yes, util/vfio-helpers.c always creates a new container. yalzhang, can you please try to reproduce with libivrt-8.10.0 or newer? There were some changes in how libvirt calculates the limit in that release. Unfortunately, I don't have a box with a spare NVMe disk to test this. <rant> Also, this should have never been implemented in libvirt. I remember from my Theoretical Informatics class that computing how much tape a Turing Machine is going to consume is equivalent to determining whether it'll halt for a given input (halting problem). We can guess, but there will always be a counter example. </rant> It still fail with below packages: # rpm -q libvirt qemu-kvm libvirt-9.0.0-10.el9_2.x86_64 qemu-kvm-7.2.0-14.el9_2.x86_64 1. Start with nvme and hostdev device, fail: # virsh dumpxml rhel ... <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x3b' slot='0x00' function='0x0'/> </source> <target dev='vdh' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/> </disk> ... <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x19' slot='0x00' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </hostdev> ... # virsh start rhel error: Failed to start domain 'rhel' error: internal error: qemu unexpectedly closed the monitor: 2023-04-13T10:01:39.437525Z qemu-kvm: -device {"driver":"vfio-pci","host":"0000:19:00.1","id":"hostdev0","bus":"pci.7","addr":"0x0"}: VFIO_MAP_DMA failed: Cannot allocate memory 2023-04-13T10:01:39.437648Z qemu-kvm: -device {"driver":"vfio-pci","host":"0000:19:00.1","id":"hostdev0","bus":"pci.7","addr":"0x0"}: vfio 0000:19:00.1: failed to setup container for group 29: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x5602ed044590, 0xc0000, 0x7ff40000, 0x7f4ff3ec0000) = -2 (No such file or directory) 2. Start with hostdev device, hotplug nvme, succeed; 3. Start with nvme device, hotplug the hostdev device,fail # virsh attach-device rhel hostdev.xml error: Failed to attach device from hostdev.xml error: internal error: unable to execute QEMU command 'device_add': vfio 0000:19:00.1: failed to setup container for group 29: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x561ae2ce0790, 0x100000, 0x7ef00000, 0x7f5c4bf00000) = -12 (Cannot allocate memory) Patches posted on the list: https://listman.redhat.com/archives/libvir-list/2023-April/239372.html Merged upstream as: commit 5670c50ffb3cd999f0bee779bbaa8f7dc7b6e0e0 Author: Michal Prívozník <mprivozn> AuthorDate: Wed Apr 12 17:15:08 2023 +0200 Commit: Michal Prívozník <mprivozn> CommitDate: Thu Apr 20 08:37:22 2023 +0200 qemu_domain: Increase memlock limit for NVMe disks When starting QEMU, or when hotplugging a PCI device QEMU might lock some memory. How much? Well, that's an undecidable problem. But despite that, we try to guess. And it more or less works, until there's a counter example. This time, it's a guest with both <hostdev/> and an NVMe <disk/>. I've started a simple guest with 4GiB of memory: # virsh dominfo fedora Max memory: 4194304 KiB Used memory: 4194304 KiB And here are the amounts of memory that QEMU tried to lock, obtained via: grep VmLck /proc/$(pgrep qemu-kvm)/status 1) with just one <hostdev/> VmLck: 4194308 kB 2) with just one NVMe <disk/> VmLck: 4328544 kB 3) with one <hostdev/> and one NVMe <disk/> VmLck: 8522852 kB Now, what's surprising is case 2) where the locked memory exceeds the VM memory. It almost resembles VDPA. Therefore, treat is as such. Unfortunately, I don't have a box with two or more spare NVMe-s so I can't tell for sure. But setting limit too tight means QEMU refuses to start. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2014030 Signed-off-by: Michal Privoznik <mprivozn> Reviewed-by: Martin Kletzander <mkletzan> v9.2.0-263-g5670c50ffb Test with below packages: libvirt-9.2.0-2.el9_rc.79fdab25f6.x86_64 qemu-kvm-8.0.0-1.el9.x86_64 1. Start vm with both nvme and hostdev device, succeed After vm start, check the memory: # virsh dominfo rhel | grep memory Max memory: 2097152 KiB Used memory: 2097152 KiB # prlimit -p `pidof qemu-kvm` | grep mem MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2088656 kB 2. Start vm with nvme, then hotplug hostdev device, succeed 1) Start with nvme, check the memory: # prlimit -p `pidof qemu-kvm` | grep mem MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2088656 kB 2) hotplug the hostdev interface, check the memory locked, no changes. # virsh attach-device rhel hostdev.xml Device attached successfully # prlimit -p `pidof qemu-kvm` | grep mem MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2088656 kB 3. Start vm with hostdev device, hotplug nvme device 1) start vm with hostdev device # prlimit -p `pidof qemu-kvm` | grep mem MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2088656 kB 2) hostplug nvme and check the memory, no changes Hi Michal, I have checked 3 scenarios below: S1: start vm with nvme disk and hostdev interface together S2: Start vm with hostdev interface, then hotplug nvme disk S3: Start vm with nvme disk, then hotplug hostdev interface In S1&S3, the total memlock limit is $(current memory) * 2 + 1 G; while for S2, after the hotplug, it's $(current memory) + 1 G, is this expected? Test with below packages: libvirt-9.3.0-1.el9.x86_64 qemu-kvm-8.0.0-1.el9.x86_64 S1. start vm with nvme disk and hostdev interface together: 1) start vm with 2G memory: # virsh start rhel Domain 'rhel' started # virsh dominfo rhel | grep emory Max memory: 2097152 KiB Used memory: 2097152 KiB 2) check the locked memory limit and current locked memory # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 5368709120 5368709120 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 4201956 kB 3) hotunplug the hostdev interface, and check the locked memory limit do not decrease, and current locked memory decreased # virsh dumpxml rhel --xpath //hostdev > hostdev.xml # virsh detach-device rhel hostdev.xml Device detached successfully # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 5368709120 5368709120 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2113188 kB 4) hotunplug the nvme disk, check the values again: # virsh detach-device rhel disk_nvme.xml Device detached successfully # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 5368709120 5368709120 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 0 kB S2. Start vm with hostdev interface, then hotplug nvme disk 1) start vm with 2G memory, the memlock limit is 1G + current memory, which is as expected: # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2088656 kB 2) hotplug the nvme disk, the memlock limit do not changes # virsh attach-device rhel disk_nvme.xml Device attached successfully # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2091972 kB S3. Start vm with nvme disk, then hotplug hostdev interface 1) start vm set with 2G memory # virsh start rhel Domain 'rhel' started # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2113188 kB 2) hotplug the hostdev device # virsh attach-device rhel hostdev_interface.xml Device attached successfully # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 5368709120 5368709120 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 4201956 kB (In reply to yalzhang from comment #28) > Hi Michal, I have checked 3 scenarios below: > S1: start vm with nvme disk and hostdev interface together > S2: Start vm with hostdev interface, then hotplug nvme disk > S3: Start vm with nvme disk, then hotplug hostdev interface > > In S1&S3, the total memlock limit is $(current memory) * 2 + 1 G; > while for S2, after the hotplug, it's $(current memory) + 1 G, is this > expected? No. I'll post a patch for that. We're probably not accounting for the hotplugged NVMe disk when calculating the new limit. Patches posted on the list: https://listman.redhat.com/archives/libvir-list/2023-May/239832.html Test on latest libvirt with scenarios in comment 28, the result is as expected. # rpm -q libvirt qemu-kvm libvirt-9.3.0-2.el9.x86_64 qemu-kvm-8.0.0-3.el9.x86_64 For all 3 scenarios, the result is as expected. Now with 1 nvme and 1 hostdev, the locked memory limit is (current memory) * 2 + 1G. For S2, start vm with hostdev interface, then hotplug nvme disk 1) start vm with 2G memory, the memlock limit is 1G + current memory, which is as expected: # virsh dominfo rhel | grep memory Max memory: 2097152 KiB Used memory: 2097152 KiB # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 3221225472 3221225472 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 2088656 kB 2) hotplug the nvme disk, the memlock limit do not changes # virsh attach-device rhel hostdev_nvme.xml Device attached successfully # prlimit -p `pidof qemu-kvm` | grep -i memlock MEMLOCK max locked-in-memory address space 5368709120 5368709120 bytes # grep VmLck /proc/$(pgrep qemu-kvm)/status VmLck: 4201844 kB Set to be verified per above comments Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: libvirt security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6409 |