RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2151869 - libvirt kills virtual machine on restart when 2M and 1G hugepages are mounted
Summary: libvirt kills virtual machine on restart when 2M and 1G hugepages are mounted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 9.0
Assignee: Michal Privoznik
QA Contact: liang cong
URL:
Whiteboard:
Depends On: 2123196 2155189
Blocks: 2132176 2132177 2132178 2152083 2152084
TreeView+ depends on / blocked
 
Reported: 2022-12-08 11:53 UTC by Jaroslav Suchanek
Modified: 2023-05-09 08:13 UTC (History)
14 users (show)

Fixed In Version: libvirt-9.0.0-1.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2123196
: 2152083 2152084 2155189 (view as bug list)
Environment:
Last Closed: 2023-05-09 07:27:43 UTC
Type: Bug
Target Upstream Version: 9.0.0
Embargoed:
lcong: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-141662 0 None None None 2022-12-08 11:56:56 UTC
Red Hat Product Errata RHBA-2023:2171 0 None None None 2023-05-09 07:29:09 UTC

Description Jaroslav Suchanek 2022-12-08 11:53:48 UTC
This has been fixed in rhel-8.6.z (bug 2132177) but was not backported to rhel-9.0.z yet.

+++ This bug was initially created as a clone of Bug #2123196 +++

Description of problem:

Running VM is killed by libvirt on qemuProcessReconnect() when restarting libvirt under the following conditions:
* VM running with 1G hugepages
* Host has both 1G and 2M hugetlbfs mounted

Within qemuProcessReconnect, the kill happens because the return of qemuProcessBuildDestroyMemoryPaths() is -1, so qemuProcessReconnect jumps/goto to the error label at the end of the function and kills the VM. See:

(gdb) b qemuProcessBuildDestroyMemoryPaths
(gdb) r
(gdb) bt
#0  qemuProcessBuildDestroyMemoryPaths (driver=0x55555597ac50, vm=0x5555559e0790, mem=0x0, build=true) at ../../src/qemu/qemu_process.c:3864
#1  0x00007fffad70d887 in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8056
#2  0x00007ffff74abb8a in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:196
#3  0x00007ffff383814a in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007ffff3567dc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) finish
Run till exit from #0  qemuProcessBuildDestroyMemoryPaths (driver=0x55555597ac50, vm=0x5555559e0790, mem=0x0, build=true) at ../../src/qemu/qemu_process.c:3866
[Detaching after fork from child process 1457]
2022-08-30 10:54:30.573+0000: 1447: info : libvirt version: 6.0.0, package: 35.2.module+el8.4.0+14226+d39fa4ab (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-02-14-11:50:49, )
2022-08-30 10:54:30.573+0000: 1447: info : hostname: rhel84.lab.toca.local
2022-08-30 10:54:30.573+0000: 1447: error : virProcessRunInFork:1161 : internal error: child reported (status=125): unable to set security context 'system_u:object_r:svirt_image_t:s0:c439,c812' on '/dev/hugepages2M/libvirt/qemu/1-testcom1': No such file or directory
0x00007fffad70d887 in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8056
8056        if (qemuProcessBuildDestroyMemoryPaths(driver, obj, NULL, true) < 0)
Value returned is $1 = -1

Due to the -1, it will 'goto error' and ends up in qemuProcessStop, killing the VM.

8056        if (qemuProcessBuildDestroyMemoryPaths(driver, obj, NULL, true) < 0)
8057            goto error;

(gdb) n
2842        return dom->def->id != -1;
(gdb)
8188            if (!priv->mon && tryMonReconn &&
(gdb)
8191            else if (priv->mon)
(gdb)
8200            qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);        <-------- VM is killed

That error above sounds related, but then I found it also happens on 6.6.0 (8.3AV), which does not kill the VM.
Just FYI, the error stack is as follows:

#0  virProcessRunInFork (cb=0x7ffff748e170 <virProcessNamespaceHelper>, opaque=0x7fffa4886630) at ../../src/util/virprocess.c:1129
#1  0x00007ffff748fce4 in virProcessRunInMountNamespace (pid=pid@entry=1395, cb=cb@entry=0x7ffff7567b10 <virSecuritySELinuxTransactionRun>,
    opaque=opaque@entry=0x7fff98004240) at ../../src/util/virprocess.c:1083
#2  0x00007ffff7567e6a in virSecuritySELinuxTransactionCommit (mgr=<optimized out>, pid=1395, lock=<optimized out>) at ../../src/security/security_selinux.c:1172
#3  0x00007ffff755f347 in virSecurityManagerTransactionCommit (mgr=0x55555599c520, pid=pid@entry=1395, lock=lock@entry=true)
    at ../../src/security/security_manager.c:299
#4  0x00007ffff755b276 in virSecurityStackTransactionCommit (mgr=<optimized out>, pid=1395, lock=<optimized out>) at ../../src/security/security_stack.c:174
#5  0x00007ffff755f347 in virSecurityManagerTransactionCommit (mgr=0x555555980ac0, pid=pid@entry=1395, lock=<optimized out>)
    at ../../src/security/security_manager.c:299
#6  0x00007fffad77a9bf in qemuSecurityDomainSetPathLabel (driver=driver@entry=0x55555597ace0, vm=vm@entry=0x5555559e0700,
    path=path@entry=0x7fff98004270 "/dev/hugepages2M/libvirt/qemu/1-testcom1", allowSubtree=allowSubtree@entry=true) at ../../src/qemu/qemu_security.c:599
#7  0x00007fffad702084 in qemuProcessBuildDestroyMemoryPathsImpl (driver=0x55555597ace0, vm=0x5555559e0700,
    path=0x7fff98004270 "/dev/hugepages2M/libvirt/qemu/1-testcom1", build=<optimized out>) at ../../src/qemu/qemu_process.c:3848
#8  0x00007fffad704871 in qemuProcessBuildDestroyMemoryPaths (driver=0x55555597ace0, vm=0x5555559e0700, mem=<optimized out>, build=true)
    at ../../src/qemu/qemu_process.c:3884
#9  0x00007fffad70d887 in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8056
#10 0x00007ffff74abb8a in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:196
#11 0x00007ffff383814a in start_thread (arg=<optimized out>) at pthread_create.c:479
#12 0x00007ffff3567dc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

However, on 6.6.0 the function does not return -1 even with that error, it returns 0. So it does not goto error and kill the VM, but continues the reconnect. See:

2022-08-30 11:18:17.739+0000: 1489: info : libvirt version: 6.6.0, package: 13.2.module+el8.3.1+10483+85317cf0 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2021-03-25-11:36:48, )
2022-08-30 11:18:17.739+0000: 1489: info : hostname: rhel84.lab.toca.local
2022-08-30 11:18:17.739+0000: 1489: warning : virSecurityDACSetOwnership:839 : Unable to restore label on '/dev/hugepages2M/libvirt/qemu/1-testcom1'. XATTRs might have been left in inconsistent state.
2022-08-30 11:18:17.739+0000: 1485: error : virProcessRunInFork:1254 : internal error: child reported (status=125): unable to stat: /dev/hugepages2M/libvirt/qemu/1-testcom1: No such file or directory
2022-08-30 11:18:17.739+0000: 1485: error : virProcessRunInFork:1256 : unable to stat: /dev/hugepages2M/libvirt/qemu/1-testcom1: No such file or directory
[Detaching after fork from child process 1490]
0x00007fffade9b48d in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8217
8217        if (qemuProcessBuildDestroyMemoryPaths(driver, obj, NULL, true) < 0)
Value returned is $1 = 0

So it looks its a problem already fixed between 6.0.0 and 6.6.0 (8.3AV), but I could not find the patch the fixes it.

Version-Release number of selected component (if applicable):
6.0.0-35.2.module+el8.4.0+14226+d39fa4ab

How reproducible:
100%

Steps to Reproduce:
1. Setup a VM to use 1G hugepages

# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.57.1.el8_4.x86_64 root=/dev/mapper/rhel-root ro resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=2

  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Create systemd mount for 2M hugepages

# cat /usr/lib/systemd/system/dev-hugepages2M.mount
[Unit]
Description=Huge Pages File System
Documentation=https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
Documentation=http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems
DefaultDependencies=no
Before=sysinit.target
ConditionPathExists=/sys/kernel/mm/hugepages
ConditionCapability=CAP_SYS_ADMIN
[Mount]
What=hugetlbfs
Where=/dev/hugepages2M
Type=hugetlbfs
Options=pagesize=2M

3. Reboot to reserve the 1G pages at boot time and for systemd to pickup the mount unit (or just reserve dynamically and systemctl reload)

4. Start the VM and stop libvirt

# virsh start testvm && systemctl stop libvirtd
Domain testvm started

5. Activate the 2M mount:
# systemctl start dev-hugepages2M.mount

6. Do virsh list (this will start libvirt via socket activation)

# virsh -r list --all
 Id   Name       State
--------------------------
 1    testvm     running

7. But the VM is killed on libvirt start, do virsh again and it will be gone already

# virsh -r list --all
 Id   Name       State
---------------------------
 -    testvm     shut off

Actual results:
* VM dead on libvirt restart

Expected results:
* VM running

--- Additional comment from Michal Privoznik on 2022-09-23 16:43:02 CEST ---

Merged upstream:

0377177c78 qemu_process.c: Propagate hugetlbfs mounts on reconnect
5853d70718 qemu_namespace: Introduce qemuDomainNamespaceSetupPath()
46b03819ae qemu_namespace: Fix a corner case in qemuDomainGetPreservedMounts()
687374959e qemu_namespace: Tolerate missing ACLs when creating a path in namespace

v8.7.0-134-g0377177c78

Comment 5 Michal Privoznik 2022-12-09 08:58:22 UTC
Merged upstream:

0377177c78 qemu_process.c: Propagate hugetlbfs mounts on reconnect
5853d70718 qemu_namespace: Introduce qemuDomainNamespaceSetupPath()
46b03819ae qemu_namespace: Fix a corner case in qemuDomainGetPreservedMounts()
687374959e qemu_namespace: Tolerate missing ACLs when creating a path in namespace

v8.7.0-134-g0377177c78

And one follow up:

3478cca80e (tag: v8.8.0-rc2) qemuProcessReconnect: Don't build memory paths

Comment 6 liang cong 2022-12-16 02:31:38 UTC
Hi Michal, I found issue when preverifying on upstream build libvirtv8.10.0-126-g8908615ef3, since the current upstream build is two month later than your commit, I am not sure if other patch affect the code, pls help to clarify, thx.

Verify steps:
1. Prepare huge page memory:
# echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# echo 2048 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

3. Start the VM and stop virtqemud
# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-admin.socket
  virtqemud.socket
  virtqemud-ro.socket

4. Mount 1G hugepage path
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G


5. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

6. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>


7. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
Device attached successfully

8. Prepare memory device with 2M hugepage source hotplug xml like below:
# cat dimm2M.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>2048</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

9. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
error: Failed to attach device from dimm1G.xml
error: internal error: unable to execute QEMU command 'object-add': can't open backing store /dev/hugepages1G/libvirt/qemu/1-vm1 for guest RAM: No such file or directory

Comment 8 liang cong 2022-12-19 06:17:00 UTC
Hi michal, I found an issue when preverify on scratch build:
libvirt-8.10.0-3.el9_rc.655748269d.x86_64

Test steps:
Verify steps:
1. Prepare huge page memory:
# echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# echo 3072 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

3. Start the VM and stop virtqemud
# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-admin.socket
  virtqemud.socket
  virtqemud-ro.socket

4. Mount 1G hugepage path
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G


5. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

6. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

7. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
Device attached successfully

8. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>2048</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

9. Hotplug dimm memory device:
# virsh attach-device vm1 dimm2M.xml 
Device attached successfully

10. Shut off guest vm
# virsh destroy vm1
Domain 'vm1' destroyed

11. Restart virtqemud
# systemctl restart virtqemud

12. Start guest vm again
# virsh start vm1
error: Failed to start domain 'vm1'
error: internal error: Process exited prior to exec: libvirt: QEMU Driver error : failed to umount devfs on /dev: Device or resource busy


Addition info: I didn't find this issue on bug preverification of bug#2152083 bug#2152084

Comment 9 Michal Privoznik 2022-12-19 12:04:32 UTC
> Addition info: I didn't find this issue on bug preverification of
> bug#2152083 bug#2152084

Yeah, that is older RHEL (9.0.0 and 9.1.0). Similarly, this worked for RHEL-8 (bug 2123196). I wonder what has changed. Let me debug. Meanwhile - what's your kernel version and output of 'mount' please?

Comment 10 liang cong 2022-12-20 01:42:37 UTC
(In reply to Michal Privoznik from comment #9)
> > Addition info: I didn't find this issue on bug preverification of
> > bug#2152083 bug#2152084
> 
> Yeah, that is older RHEL (9.0.0 and 9.1.0). Similarly, this worked for
> RHEL-8 (bug 2123196). I wonder what has changed. Let me debug. Meanwhile -
> what's your kernel version and output of 'mount' please?

kernel version: 5.14.0-200.el9.x86_64

I retest on build libvirt-8.10.0-3.el9_rc.655748269d.x86_64, qemu-kvm-7.1.0-6.el9.x86_64 with mount info as below:

1. Prepare huge page memory:
# echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# echo 3072 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

3. Start the VM and stop virtqemud
# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-admin.socket
  virtqemud.socket
  virtqemud-ro.socket

4. Mount 1G hugepage path
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G


5. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

6. check mount info:
# mount | grep dev
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=4096k,nr_inodes=1048576,mode=755,inode64)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,size=13059680k,nr_inodes=819200,mode=755,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime,seclabel)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
/dev/mapper/rhel_dell--per740xd--16-root on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime,seclabel)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime,seclabel)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel,pagesize=2M)
none on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/mapper/rhel_dell--per740xd--16-home on /home type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/sda1 on /boot type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=6529836k,nr_inodes=1632459,mode=700,inode64)
hugetlbfs on /dev/hugepages1G type hugetlbfs (rw,relatime,seclabel,pagesize=1024M)

7. Check mount namespace of qemu
# cat /proc/`pidof qemu-kvm`/mountinfo | grep dev
436 435 253:0 / / rw,relatime master:1 - xfs /dev/mapper/rhel_dell--per740xd--16-root rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
437 436 0:22 / /sys rw,nosuid,nodev,noexec,relatime master:2 - sysfs sysfs rw,seclabel
438 437 0:6 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:3 - securityfs securityfs rw
439 437 0:26 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime master:4 - cgroup2 cgroup2 rw,seclabel,nsdelegate,memory_recursiveprot
440 437 0:27 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:5 - pstore pstore rw,seclabel
441 437 0:28 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime master:6 - bpf bpf rw,mode=700
443 437 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime master:14 - debugfs debugfs rw,seclabel
444 437 0:12 / /sys/kernel/tracing rw,nosuid,nodev,noexec,relatime master:17 - tracefs tracefs rw,seclabel
445 437 0:32 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime master:18 - fusectl fusectl rw
446 437 0:33 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime master:19 - configfs configfs rw
448 460 0:23 / /dev/shm rw,nosuid,nodev master:9 - tmpfs tmpfs rw,seclabel,inode64
449 460 0:24 / /dev/pts rw,nosuid,noexec,relatime master:10 - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=000
450 460 0:19 / /dev/mqueue rw,nosuid,nodev,noexec,relatime master:15 - mqueue mqueue rw,seclabel
451 460 0:31 / /dev/hugepages rw,relatime master:16 - hugetlbfs hugetlbfs rw,seclabel,pagesize=2M
452 436 0:25 / /run rw,nosuid,nodev master:11 - tmpfs tmpfs rw,seclabel,size=13059680k,nr_inodes=819200,mode=755,inode64
453 452 0:34 / /run/credentials/systemd-sysusers.service ro,nosuid,nodev,noexec,relatime master:20 - ramfs none rw,mode=700
454 452 0:39 / /run/user/0 rw,nosuid,nodev,relatime master:192 - tmpfs tmpfs rw,seclabel,size=6529836k,nr_inodes=1632459,mode=700,inode64
455 436 0:21 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
457 436 253:2 / /home rw,relatime master:39 - xfs /dev/mapper/rhel_dell--per740xd--16-home rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
458 436 8:1 / /boot rw,relatime master:45 - xfs /dev/sda1 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
460 436 0:41 / /dev rw,nosuid,relatime - tmpfs devfs rw,seclabel,size=64k,mode=755,inode64


8. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

9. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
error: Failed to attach device from dimm1G.xml
error: internal error: unable to execute QEMU command 'object-add': can't open backing store /dev/hugepages1G/libvirt/qemu/1-vm1 for guest RAM: No such file or directory


Seems the mount is not propagate to qemu process. I will try on latest rhel9.2 to see if any changes.

Comment 11 liang cong 2022-12-20 07:44:52 UTC
I tested on build:
# rpm -q libvirt qemu-kvm
libvirt-8.10.0-3.el9_rc.655748269d.x86_64
qemu-kvm-7.1.0-6.el9.x86_64

kernel: 5.14.0-212.el9.x86_64

still get the same issue with comment#10

Comment 12 Michal Privoznik 2022-12-20 10:24:28 UTC
Indeed. There's something terribly broken (in kernel perhaps?):

kernel-5.14.0-212.el9.x86_64
systemd-252-2.el9.x86_64

1) just to make sure mount events are being propagated:

# mount --make-rshared /

2) check hugetlbfs mounts:

# mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel,pagesize=2M)

3) in another terminal, unshare mount namespace:

# unshare -m

4) verify that the hugetlbfs is mounted (from inside the NS):

# mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel,pagesize=2M)

5) now, mount another hugetlbfs (from the parent NS, NOT the one just unshared):

# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G

6) verify the mount got propagated (from the NS):

# mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel,pagesize=2M)


Therefore, this is completely independent of libvirt and should be reported against kernel for further investigation. Then, we can revisit our bug.

Comment 15 Michal Privoznik 2023-01-05 12:02:24 UTC
Alright, after thorough investigation I've merged two commits that fix the problem:

4a91324b61 qemu_namespace: Fix detection of nested mount points
379c0ce4bf qemu_namespace: Umount the original /dev before replacing it with tmpfs

v8.10.0-174-g4a91324b61

Comment 16 liang cong 2023-01-09 07:22:22 UTC
Preverified on upstream build libvirt v8.10.0-176-g6cd2b4e101,
# rpm -q qemu-kvm
qemu-kvm-7.2.0-1.fc38.x86_64


Verify steps:
1. Prepare huge page memory:
# echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# echo 3072 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

3. Start the VM and stop virtqemud
# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-ro.socket
  virtqemud-admin.socket
  virtqemud.socket


4. Mount 1G hugepage path
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G


5. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

6. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>


7. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
Device attached successfully

8. Prepare memory device with 2M hugepage source hotplug xml like below:
# cat dimm2M.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>2048</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

9. Hotplug dimm memory device:
# virsh attach-device vm1 dimm2M.xml 
Device attached successfully


10. Shutoff vm
# virsh destroy vm1
Domain vm1 destroyed


11. Restart virtqemud
# systemctl restart virtqemud

12. Start vm
# virsh start vm1
Domain 'vm1' started


Also check the below scenarios:
Steps:
1. memory backing 2M guest vm start -> stop virtqemud -> mount 1G path -> start virtqemud -> hotplug 1G dimm -> restart vm -> restart virtqemud -> hotplug 1G dimm
2. mount 1G path -> memory backing 2M guest vm start -> restart virtqemud -> hogplug 1G dimm -> restart virtqemud -> restart vm ->hogplug 1G dimm

Tested with these settings:remember_owner=1 or 0, memfd memory backing, default memory backing, 1G hugepage memory backing, 1G hugepage path as /mnt/hugepages1G

Comment 17 liang cong 2023-01-09 07:26:41 UTC
(In reply to Michal Privoznik from comment #15)
> Alright, after thorough investigation I've merged two commits that fix the
> problem:
> 
> 4a91324b61 qemu_namespace: Fix detection of nested mount points
> 379c0ce4bf qemu_namespace: Umount the original /dev before replacing it with
> tmpfs
> 
> v8.10.0-174-g4a91324b61

Hi michal, 
Shall we backport these 2 patches to rhel9.1, rhel9.0 or rhel8.6, rhel8.7, rhel8.8 builds?
And could you help to clarify why I did not see the issue fixed by these 2 patches on rhel9.1 and rhel9.0 zstream build?


Thx a lot.

Comment 18 Michal Privoznik 2023-01-17 09:23:37 UTC
(In reply to liang cong from comment #17)
> Hi michal, 
> Shall we backport these 2 patches to rhel9.1, rhel9.0 or rhel8.6, rhel8.7,
> rhel8.8 builds?

I'd rather avoid that. This was a very marginal scenario to begin with and the bug that those two commits fix is just a portion on that scenario. I think it's safe to assume nobody will run into that situation.

> And could you help to clarify why I did not see the issue fixed by these 2
> patches on rhel9.1 and rhel9.0 zstream build?

Because of how things work under the hood. What the first patch fixes is the following: previously, libvirt would fail to see that /dev/hugepages and /dev/hugepages1G are two different dirs, because it used plain prefix comparison. And yes, the former is prefix of the other. If you'd use different dirs, e.g. /dev/hugepages and /dev/myhugepages1G then everything would work even without those two patches.

Having said all of this, I think we can agree that this is very rare situation (and we even document that users should set up their mount points upfront). Therefore, I think we don't need any more backports.

Comment 22 liang cong 2023-02-02 06:54:13 UTC
Verified on build libvirt-9.0.0-3.el9.x86_64

Verify steps:
1. Prepare huge page memory:
# echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# echo 3072 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

3. Start the VM and stop virtqemud
# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-admin.socket
  virtqemud.socket
  virtqemud-ro.socket


4. Mount 1G hugepage path
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G


5. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

6. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>


7. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
Device attached successfully

8. Prepare memory device with 2M hugepage source hotplug xml like below:
# cat dimm2M.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>2048</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

9. Hotplug dimm memory device:
# virsh attach-device vm1 dimm2M.xml 
Device attached successfully


10. Shutoff vm
# virsh destroy vm1
Domain vm1 destroyed


11. Restart virtqemud
# systemctl restart virtqemud

12. Start vm
# virsh start vm1
Domain 'vm1' started


Also check the below scenarios:
Steps:
1. memory backing 2M guest vm start -> stop virtqemud -> mount 1G path -> start virtqemud -> hotplug 1G dimm -> restart vm -> restart virtqemud -> hotplug 1G dimm
2. mount 1G path -> memory backing 2M guest vm start -> restart virtqemud -> hogplug 1G dimm -> restart virtqemud -> restart vm ->hogplug 1G dimm

Tested with these settings:remember_owner=1 or 0, memfd memory backing, default memory backing, 1G hugepage memory backing, 1G hugepage path as /mnt/hugepages1G

Comment 24 errata-xmlrpc 2023-05-09 07:27:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (libvirt bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2171


Note You need to log in before you can comment on or make changes to this bug.