Bug 2014030

Summary:	Guest can not start with nvme disk and hostdev interface together
Product:	Red Hat Enterprise Linux 9	Reporter:	yalzhang <yalzhang>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
libvirt sub component:	General	QA Contact:	yalzhang <yalzhang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	alex.williamson, coli, jdenemar, jinzhao, jsuchane, juzhang, lmen, mprivozn, stefanha, virt-maint, xuzhang, yanghliu
Version:	9.0	Keywords:	AutomationTriaged, Triaged, Upstream
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-9.3.0-2.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-11-07 08:30:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	9.3.0
Embargoed:

Description yalzhang@redhat.com 2021-10-14 10:21:50 UTC

Description of problem:
Guest can not start with nvme disk and hostdev interface together

Version-Release number of selected component (if applicable):
# rpm -q libvirt qemu-kvm
libvirt-7.6.0-5.module+el8.5.0+12933+58cb48a1.x86_64
qemu-kvm-6.0.0-32.module+el8.5.0+12949+ac589a5c.x86_64

How reproducible:
100%

Steps to Reproduce:
1. On a system with nvme disk and sriov, prepare a vm with these 2 devices:
# virsh dumpxml rhel 
<domain type='kvm'>
  <name>rhel</name>
  <uuid>453cd7fb-69ed-4426-944c-bc155689a265</uuid>
  <memory unit='KiB'>6291456</memory>
  <currentMemory unit='KiB'>6291456</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.5.0'>hvm</type>
    <boot dev='hd'/>
  </os>
...
</devices>
<disk type='nvme' device='disk'>
      <driver name='qemu' type='raw'/>
      <source type='pci' managed='yes' namespace='1'>
        <address domain='0x0000' bus='0x87' slot='0x00' function='0x0'/>
      </source>
      <target dev='vdh' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>...<interface type='hostdev' managed='yes'>
      <mac address='52:54:00:aa:70:c4'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x04' slot='0x10' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </interface>
...</devices>
</domain>

2. Try to start the vm, it will fail with "Cannot allocate memory":
# virsh start rhel 
error: Failed to start domain 'rhel'
error: internal error: qemu unexpectedly closed the monitor: 2021-10-14T10:09:17.674948Z qemu-kvm: -device vfio-pci,host=0000:04:10.0,id=hostdev0,bus=pci.5,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory
2021-10-14T10:09:17.725472Z qemu-kvm: -device vfio-pci,host=0000:04:10.0,id=hostdev0,bus=pci.5,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory
2021-10-14T10:09:17.725636Z qemu-kvm: -device vfio-pci,host=0000:04:10.0,id=hostdev0,bus=pci.5,addr=0x0: vfio 0000:04:10.0: failed to setup container for group 69: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x55dc59044740, 0xc0000, 0x7ff40000, 0x7fd93bec0000) = -12 (Cannot allocate memory)

Actual results:
VM with nvme and hostdev interface can not start successfully

Expected results:
VM should start successfully

Additional info:
1. Start vm with nvme, then hotplug the hostdev interface, fail as "Cannot allocate memory", just as step 2;
2. Start vm with hostdev interface, and hotplug the nvme, succeed;
3. Start vm with none of these devices, hotplug hostdev interface first, then nvme, succeed; but if hotplug nvme first, the hostdev interface will fail to hotplug.

Comment 1 Jaroslav Suchanek 2021-10-21 13:15:00 UTC

Looks like qemu crashed, please attach qemu log file of the VM. Thanks.

Comment 2 Michal Privoznik 2021-10-26 08:20:21 UTC

This smells like a bug in QEMU that I've encountered a while ago. It was fixed upstream:

https://gitlab.com/qemu-project/qemu/-/commit/15a730e7a3aaac180df72cd5730e0617bcf44a5a

I wonder whether there's something that libvirt can do, but from the commit and its message it does not look like so. Let me switch over to QEMU for further investigation.

Comment 3 Klaus Heinrich Kiwi 2021-10-28 11:14:25 UTC

Philippe, can you investigate?

Comment 7 Philippe Mathieu-Daudé 2021-10-29 12:52:32 UTC

(In reply to Michal Privoznik from comment #2)
> This smells like a bug in QEMU that I've encountered a while ago. It was
> fixed upstream:
> 
> https://gitlab.com/qemu-project/qemu/-/commit/
> 15a730e7a3aaac180df72cd5730e0617bcf44a5a
> 
> I wonder whether there's something that libvirt can do, but from the commit
> and its message it does not look like so. Let me switch over to QEMU for
> further investigation.

AFAIK this doesn't look the same issue.

Comment 9 Alex Williamson 2021-10-29 14:26:57 UTC

The symptoms suggest libvirt isn't providing enough locked memory for both an assigned device and an NVMe device.  Unlike all hostdev devices, the nvme-vfio driver always makes use of a separate container for devices, not just when a vIOMMU is present (perhaps even a separate container per NVMe device).  AIUI the NVMe driver is better able to accommodate bumping into the locked memory limit, which would explain why instantiating or hot-adding the NVMe device last works, while instantiating or hot-adding the hostdev device last encounters hard failures because it cannot simply use less locked memory.

I would guess there's a workaround to set the hard_limit to a reasonable value to support both devices until libvirt can account for the locked memory requirements of various combinations of these devices.

Comment 10 Yanghang Liu 2021-11-01 08:32:24 UTC

I did a quick test on my current rhel9 test env first:

Test env:
    
    5.14.0-7.el9.x86_64
    qemu-kvm-6.1.0-5.el9.x86_64
    libvirt-7.6.0-2.el9.x86_64



Test scenario:

    (1) start a VM with a hostdev interface VF and a hostdev NVME Device 
      
        The vm can be started successfully.

    (2) start a VM with a hostdev interface VF only
     
        The vm can be started successfully

    (3) start a VM with *a hostdev interface VF and a NVME UserSpace Driver Device*


        *This problem can be reproduced*

Workaround:

    > I would guess there's a workaround to set the hard_limit to a reasonable value to support both devices until libvirt can account for the locked memory requirements of various combinations of these devices.

    This workaround works for me.

    The vm mentioned in  test scenario(3) can be started after add the following "hard limit" xml into the vm:

          <memtune>
              <hard_limit unit='KiB'>16777216</hard_limit>  <--- This value can be changed with different host configurations
          </memtune>

Comment 11 Michal Privoznik 2021-11-01 11:11:01 UTC

Alright, so the bug is clearly in libvirt. @alex.williamson here's how libvirt calculates the memlock amount:

https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c#L9253

basically, it takes <memory/> from domain XML and adds 1GiB. Can we make better estimate when vfio and nvme is at play? Unfortunately, I don't have a machine with a spare NVMe to test.

Comment 12 Alex Williamson 2021-11-01 15:50:36 UTC

Punting to Philippe, AIUI the vfio-nvme driver is better able to handle running into the locked memory limit and releasing memory, but is its locked memory usage bounded somehow until it hits that limit?  Does each vfio-nvme device create a separate vfio container?

Michal, does qemuDomainNeedsVFIO() trigger for vfio-nvme devices?  If we're setting a limit of (VM + 1GB) for a vfio-nvme device alone and that device consumes anything more than the "fudge" portion of the 1GB "fudge factor", then there won't be enough locked memory limit remaining for the assigned device.  If vfio-nvme creates a container per device then libvirt would need to add the upper bound of locked memory per nvme device to the total locked memory.

Comment 13 Stefan Hajnoczi 2021-11-08 09:41:15 UTC

(In reply to Michal Privoznik from comment #11)
> Alright, so the bug is clearly in libvirt. @alex.williamson
> here's how libvirt calculates the memlock amount:
> 
> https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c#L9253
> 
> basically, it takes <memory/> from domain XML and adds 1GiB. Can we make
> better estimate when vfio and nvme is at play? Unfortunately, I don't have a
> machine with a spare NVMe to test.

In addition to guest RAM, block/nvme.c also DMA maps the NVMe submission/completion queues and QEMU-internal I/O buffers that are not in guest RAM (i.e. bounce buffers that are sometimes used). You can look at the DMA allocator in util/vfio-helpers.c to see how mappings are made. It would be interesting to dump QEMUVFIOState at the time when the failure occurs in order to understand what has been mapped.

Comment 14 Stefan Hajnoczi 2021-11-08 10:13:56 UTC

(In reply to Stefan Hajnoczi from comment #13)
> (In reply to Michal Privoznik from comment #11)
> > Alright, so the bug is clearly in libvirt. @alex.williamson
> > here's how libvirt calculates the memlock amount:
> > 
> > https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c#L9253
> > 
> > basically, it takes <memory/> from domain XML and adds 1GiB. Can we make
> > better estimate when vfio and nvme is at play? Unfortunately, I don't have a
> > machine with a spare NVMe to test.
> 
> In addition to guest RAM, block/nvme.c also DMA maps the NVMe
> submission/completion queues and QEMU-internal I/O buffers that are not in
> guest RAM (i.e. bounce buffers that are sometimes used). You can look at the
> DMA allocator in util/vfio-helpers.c to see how mappings are made. It would
> be interesting to dump QEMUVFIOState at the time when the failure occurs in
> order to understand what has been mapped.

A bit more about what direction I'm thinking in:

Most of the I/O buffers should be in guest RAM. We want that code path to be fast so Philippe and I discussed the idea of permanently mapping guest RAM in block/nvme.c in the past. Today I/O buffers are temporarily mapped. Temporary mappings accumulate until DMA mapping space is exhausted or VFIO refuses to map more, then it is cleared and the temporary mappings start accumulating again.

If most DMA is to/from guest RAM we could switch to a model that permanently maps guest RAM (like vfio-pci, maybe based on work-in-progress code where Philippe is unifying vfio-pci and vfio-helpers). It is still necessary to handle QEMU-internal I/O buffers (i.e. bounce buffers), but this rare situation could be handled via a mapping that doesn't grow large. Either it could be a dedicated bounce buffer in block/nvme.c that is permanently mapped (parallel requests need to wait their turn to use the single bounce buffer) or we could dynamically DMA map the I/O buffer but only one request at any time can do this (requests need to wait their turn). As long as most requests use guest RAM there won't be a measurable performance effect of capping DMA mappings in this way. This approach would set a much tighter limit on DMA mappings created by block/nvme.c

Comment 17 Klaus Heinrich Kiwi 2022-01-19 13:59:58 UTC

(In reply to Alex Williamson from comment #12)
> Punting to Philippe, AIUI the vfio-nvme driver is better able to handle
> running into the locked memory limit and releasing memory, but is its locked
> memory usage bounded somehow until it hits that limit?  Does each vfio-nvme
> device create a separate vfio container?
> 
> Michal, does qemuDomainNeedsVFIO() trigger for vfio-nvme devices?  If we're
> setting a limit of (VM + 1GB) for a vfio-nvme device alone and that device
> consumes anything more than the "fudge" portion of the 1GB "fudge factor",
> then there won't be enough locked memory limit remaining for the assigned
> device.  If vfio-nvme creates a container per device then libvirt would need
> to add the upper bound of locked memory per nvme device to the total locked
> memory.

Phil is no longer with Red Hat, so moving this needinfo to Stefan - Stefan, can you help?

Thanks,

 -Klaus

Comment 18 Stefan Hajnoczi 2022-01-20 16:41:56 UTC

(In reply to Klaus Heinrich Kiwi from comment #17)
> (In reply to Alex Williamson from comment #12)
> > Punting to Philippe, AIUI the vfio-nvme driver is better able to handle
> > running into the locked memory limit and releasing memory, but is its locked
> > memory usage bounded somehow until it hits that limit?  Does each vfio-nvme
> > device create a separate vfio container?
> > 
> > Michal, does qemuDomainNeedsVFIO() trigger for vfio-nvme devices?  If we're
> > setting a limit of (VM + 1GB) for a vfio-nvme device alone and that device
> > consumes anything more than the "fudge" portion of the 1GB "fudge factor",
> > then there won't be enough locked memory limit remaining for the assigned
> > device.  If vfio-nvme creates a container per device then libvirt would need
> > to add the upper bound of locked memory per nvme device to the total locked
> > memory.
> 
> Phil is no longer with Red Hat, so moving this needinfo to Stefan - Stefan,
> can you help?

I don't see a quick fix in QEMU. We might be able to use less locked memory in QEMU but ultimately libvirt needs to set a limit that is large enough.

Comment 19 Stefan Hajnoczi 2022-01-20 16:42:49 UTC

(In reply to Alex Williamson from comment #12)
> Punting to Philippe, AIUI the vfio-nvme driver is better able to handle
> running into the locked memory limit and releasing memory, but is its locked
> memory usage bounded somehow until it hits that limit?  Does each vfio-nvme
> device create a separate vfio container?

Yes, util/vfio-helpers.c always creates a new container.

Comment 21 Michal Privoznik 2023-04-12 15:10:09 UTC

yalzhang, can you please try to reproduce with libivrt-8.10.0 or newer? There were some changes in how libvirt calculates the limit in that release. Unfortunately, I don't have a box with a spare NVMe disk to test this.

<rant>
Also, this should have never been implemented in libvirt. I remember from my Theoretical Informatics class that computing how much tape a Turing Machine is going to consume is equivalent to determining whether it'll halt for a given input (halting problem). We can guess, but there will always be a counter example.
</rant>

Comment 22 yalzhang@redhat.com 2023-04-13 10:06:40 UTC

It still fail with below packages:
# rpm -q libvirt qemu-kvm 
libvirt-9.0.0-10.el9_2.x86_64
qemu-kvm-7.2.0-14.el9_2.x86_64

1. Start with nvme and hostdev device, fail:
# virsh dumpxml rhel
...
<disk type='nvme' device='disk'>
      <driver name='qemu' type='raw'/>
      <source type='pci' managed='yes' namespace='1'>
        <address domain='0x0000' bus='0x3b' slot='0x00' function='0x0'/>
      </source>
      <target dev='vdh' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </disk>
...
<hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x19' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
...
# virsh start rhel 
error: Failed to start domain 'rhel'
error: internal error: qemu unexpectedly closed the monitor: 2023-04-13T10:01:39.437525Z qemu-kvm: -device {"driver":"vfio-pci","host":"0000:19:00.1","id":"hostdev0","bus":"pci.7","addr":"0x0"}: VFIO_MAP_DMA failed: Cannot allocate memory
2023-04-13T10:01:39.437648Z qemu-kvm: -device {"driver":"vfio-pci","host":"0000:19:00.1","id":"hostdev0","bus":"pci.7","addr":"0x0"}: vfio 0000:19:00.1: failed to setup container for group 29: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x5602ed044590, 0xc0000, 0x7ff40000, 0x7f4ff3ec0000) = -2 (No such file or directory)

2. Start with hostdev device, hotplug nvme, succeed;
3. Start with nvme device, hotplug the hostdev device,fail
# virsh attach-device rhel hostdev.xml 
error: Failed to attach device from hostdev.xml
error: internal error: unable to execute QEMU command 'device_add': vfio 0000:19:00.1: failed to setup container for group 29: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x561ae2ce0790, 0x100000, 0x7ef00000, 0x7f5c4bf00000) = -12 (Cannot allocate memory)

Comment 23 Michal Privoznik 2023-04-14 10:03:19 UTC

Patches posted on the list:

https://listman.redhat.com/archives/libvir-list/2023-April/239372.html

Comment 24 Michal Privoznik 2023-04-20 06:39:22 UTC

Merged upstream as:

commit 5670c50ffb3cd999f0bee779bbaa8f7dc7b6e0e0
Author:     Michal Prívozník <mprivozn>
AuthorDate: Wed Apr 12 17:15:08 2023 +0200
Commit:     Michal Prívozník <mprivozn>
CommitDate: Thu Apr 20 08:37:22 2023 +0200

    qemu_domain: Increase memlock limit for NVMe disks
    
    When starting QEMU, or when hotplugging a PCI device QEMU might
    lock some memory. How much? Well, that's an undecidable problem.
    
    But despite that, we try to guess. And it more or less works,
    until there's a counter example. This time, it's a guest with
    both <hostdev/> and an NVMe <disk/>. I've started a simple guest
    with 4GiB of memory:
    
      # virsh dominfo fedora
      Max memory:     4194304 KiB
      Used memory:    4194304 KiB
    
    And here are the amounts of memory that QEMU tried to lock,
    obtained via:
    
      grep VmLck /proc/$(pgrep qemu-kvm)/status
    
      1) with just one <hostdev/>
         VmLck:   4194308 kB
    
      2) with just one NVMe <disk/>
         VmLck:   4328544 kB
    
      3) with one <hostdev/> and one NVMe <disk/>
         VmLck:   8522852 kB
    
    Now, what's surprising is case 2) where the locked memory exceeds
    the VM memory. It almost resembles VDPA. Therefore, treat is as
    such.
    
    Unfortunately, I don't have a box with two or more spare NVMe-s
    so I can't tell for sure. But setting limit too tight means QEMU
    refuses to start.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2014030
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: Martin Kletzander <mkletzan>

v9.2.0-263-g5670c50ffb

Comment 27 yalzhang@redhat.com 2023-04-25 03:29:07 UTC

Test with below packages:
libvirt-9.2.0-2.el9_rc.79fdab25f6.x86_64
qemu-kvm-8.0.0-1.el9.x86_64

1. Start vm with both nvme and hostdev device, succeed
After vm start, check the memory:
# virsh dominfo rhel | grep memory
Max memory:     2097152 KiB
Used memory:    2097152 KiB

# prlimit -p `pidof qemu-kvm`  | grep mem
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes

# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2088656 kB

2. Start vm with nvme, then hotplug hostdev device, succeed
1) Start with nvme, check the memory:
# prlimit -p `pidof qemu-kvm`  | grep mem
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes

# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2088656 kB

2) hotplug the hostdev interface, check the memory locked, no changes.
# virsh attach-device rhel hostdev.xml 
Device attached successfully

# prlimit -p `pidof qemu-kvm`  | grep mem
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes

# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2088656 kB

3. Start vm with hostdev device, hotplug nvme device
1) start vm with hostdev device
# prlimit -p `pidof qemu-kvm`  | grep mem
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes
# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2088656 kB

2) hostplug nvme and check the memory, no changes

Comment 28 yalzhang@redhat.com 2023-05-08 10:16:39 UTC

Hi Michal, I have checked 3 scenarios below:
S1: start vm with nvme disk and hostdev interface together
S2: Start vm with hostdev interface, then hotplug nvme disk
S3: Start vm with nvme disk, then hotplug hostdev interface

In S1&S3, the total memlock limit is $(current memory) * 2 + 1 G;
while for S2, after the hotplug, it's $(current memory) + 1 G, is this expected?

Test with below packages:
libvirt-9.3.0-1.el9.x86_64
qemu-kvm-8.0.0-1.el9.x86_64

S1. start vm with nvme disk and hostdev interface together:
1) start vm with 2G memory:
# virsh start rhel 
Domain 'rhel' started
# virsh dominfo rhel   | grep emory 
Max memory:     2097152 KiB
Used memory:    2097152 KiB

2) check the locked memory limit and current locked memory
# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 5368709120 5368709120 bytes
# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 4201956 kB

3) hotunplug the hostdev interface, and check the locked memory limit do not decrease, and current locked memory decreased
# virsh dumpxml rhel --xpath //hostdev > hostdev.xml
# virsh detach-device rhel hostdev.xml 
Device detached successfully
# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 5368709120 5368709120 bytes
# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2113188 kB

4) hotunplug the nvme disk, check the values again:
# virsh detach-device rhel disk_nvme.xml 
Device detached successfully
# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 5368709120 5368709120 bytes
# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	       0 kB

S2. Start vm with hostdev interface, then hotplug nvme disk
1) start vm with 2G memory, the memlock limit is 1G + current memory, which is as expected:
# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes
# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2088656 kB

2) hotplug the nvme disk, the memlock limit do not changes
# virsh attach-device rhel disk_nvme.xml 
Device attached successfully
# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes
# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2091972 kB

S3. Start vm with nvme disk, then hotplug hostdev interface
1) start vm set with 2G memory
# virsh start rhel 
Domain 'rhel' started
# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes
#  grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2113188 kB

2) hotplug the hostdev device
# virsh attach-device rhel hostdev_interface.xml 
Device attached successfully
# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 5368709120 5368709120 bytes
#  grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 4201956 kB

Comment 29 Michal Privoznik 2023-05-09 10:20:35 UTC

(In reply to yalzhang from comment #28)
> Hi Michal, I have checked 3 scenarios below:
> S1: start vm with nvme disk and hostdev interface together
> S2: Start vm with hostdev interface, then hotplug nvme disk
> S3: Start vm with nvme disk, then hotplug hostdev interface
> 
> In S1&S3, the total memlock limit is $(current memory) * 2 + 1 G;
> while for S2, after the hotplug, it's $(current memory) + 1 G, is this
> expected?

No. I'll post a patch for that. We're probably not accounting for the hotplugged NVMe disk when calculating the new limit.

Comment 32 Michal Privoznik 2023-05-09 14:39:47 UTC

Patches posted on the list:

https://listman.redhat.com/archives/libvir-list/2023-May/239832.html

Comment 33 yalzhang@redhat.com 2023-05-17 03:30:19 UTC

Test on latest libvirt with scenarios in comment 28, the result is as expected.
# rpm -q libvirt qemu-kvm
libvirt-9.3.0-2.el9.x86_64
qemu-kvm-8.0.0-3.el9.x86_64

For all 3 scenarios, the result is as expected.
Now with 1 nvme and 1 hostdev, the locked memory limit is (current memory) * 2 + 1G.

For S2, start vm with hostdev interface, then hotplug nvme disk
1) start vm with 2G memory, the memlock limit is 1G + current memory, which is as expected:
# virsh dominfo rhel | grep memory
Max memory:     2097152 KiB
Used memory:    2097152 KiB

# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 3221225472 3221225472 bytes

# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 2088656 kB

2) hotplug the nvme disk, the memlock limit do not changes
# virsh attach-device rhel hostdev_nvme.xml  
Device attached successfully

# prlimit -p `pidof qemu-kvm` | grep -i memlock
MEMLOCK    max locked-in-memory address space 5368709120 5368709120 bytes

# grep VmLck /proc/$(pgrep qemu-kvm)/status
VmLck:	 4201844 kB

Comment 37 yalzhang@redhat.com 2023-05-19 05:22:43 UTC

Set to be verified per above comments

Comment 39 errata-xmlrpc 2023-11-07 08:30:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6409