Bug 2123196

Summary:	libvirt kills virtual machine on restart when 2M and 1G hugepages are mounted
Product:	Red Hat Enterprise Linux 8	Reporter:	Germano Veit Michel <gveitmic>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
Status:	CLOSED ERRATA	QA Contact:	liang cong <lcong>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.4	CC:	ailan, duclee, dzheng, haizhao, jdenemar, jsuchane, lmen, mprivozn, virt-maint, yafu, yalzhang
Target Milestone:	rc	Keywords:	Triaged, Upstream, ZStream
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-8.0.0-11.module+el8.8.0+16835+1d966b61	Doc Type:	Bug Fix
Doc Text:	Cause: When libvirt is restarted after a hugetlbfs was mounted and a guest is running, libvirt tries to create guest specific path in the new hugetlbfs mount point. Because of a bug in namespace code this fails which results in the guest being killed by libvirt. Consequence: Guest is killed on libvirtd restart. Fix: Twofold. Firstly, the namespace code was fixed so that creating this guest specific path now succeeds. Secondly, the creation is postponed until really needed (memory hotplug). Result: Guests can now survive libvirtd restart.	Story Points:	---
Clone Of:
Clones:	2132176 2132177 2132178 2151869 (view as bug list)		Environment:
Last Closed:	2023-05-16 08:16:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	8.8.0
Embargoed:
Bug Depends On:
Bug Blocks:	2132176, 2132177, 2132178, 2151869

Description Germano Veit Michel 2022-09-01 00:56:08 UTC

Description of problem:

Running VM is killed by libvirt on qemuProcessReconnect() when restarting libvirt under the following conditions:
* VM running with 1G hugepages
* Host has both 1G and 2M hugetlbfs mounted

Within qemuProcessReconnect, the kill happens because the return of qemuProcessBuildDestroyMemoryPaths() is -1, so qemuProcessReconnect jumps/goto to the error label at the end of the function and kills the VM. See:

(gdb) b qemuProcessBuildDestroyMemoryPaths
(gdb) r
(gdb) bt
#0  qemuProcessBuildDestroyMemoryPaths (driver=0x55555597ac50, vm=0x5555559e0790, mem=0x0, build=true) at ../../src/qemu/qemu_process.c:3864
#1  0x00007fffad70d887 in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8056
#2  0x00007ffff74abb8a in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:196
#3  0x00007ffff383814a in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007ffff3567dc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) finish
Run till exit from #0  qemuProcessBuildDestroyMemoryPaths (driver=0x55555597ac50, vm=0x5555559e0790, mem=0x0, build=true) at ../../src/qemu/qemu_process.c:3866
[Detaching after fork from child process 1457]
2022-08-30 10:54:30.573+0000: 1447: info : libvirt version: 6.0.0, package: 35.2.module+el8.4.0+14226+d39fa4ab (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-02-14-11:50:49, )
2022-08-30 10:54:30.573+0000: 1447: info : hostname: rhel84.lab.toca.local
2022-08-30 10:54:30.573+0000: 1447: error : virProcessRunInFork:1161 : internal error: child reported (status=125): unable to set security context 'system_u:object_r:svirt_image_t:s0:c439,c812' on '/dev/hugepages2M/libvirt/qemu/1-testcom1': No such file or directory
0x00007fffad70d887 in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8056
8056        if (qemuProcessBuildDestroyMemoryPaths(driver, obj, NULL, true) < 0)
Value returned is $1 = -1

Due to the -1, it will 'goto error' and ends up in qemuProcessStop, killing the VM.

8056        if (qemuProcessBuildDestroyMemoryPaths(driver, obj, NULL, true) < 0)
8057            goto error;

(gdb) n
2842        return dom->def->id != -1;
(gdb)
8188            if (!priv->mon && tryMonReconn &&
(gdb)
8191            else if (priv->mon)
(gdb)
8200            qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);        <-------- VM is killed

That error above sounds related, but then I found it also happens on 6.6.0 (8.3AV), which does not kill the VM.
Just FYI, the error stack is as follows:

#0  virProcessRunInFork (cb=0x7ffff748e170 <virProcessNamespaceHelper>, opaque=0x7fffa4886630) at ../../src/util/virprocess.c:1129
#1  0x00007ffff748fce4 in virProcessRunInMountNamespace (pid=pid@entry=1395, cb=cb@entry=0x7ffff7567b10 <virSecuritySELinuxTransactionRun>,
    opaque=opaque@entry=0x7fff98004240) at ../../src/util/virprocess.c:1083
#2  0x00007ffff7567e6a in virSecuritySELinuxTransactionCommit (mgr=<optimized out>, pid=1395, lock=<optimized out>) at ../../src/security/security_selinux.c:1172
#3  0x00007ffff755f347 in virSecurityManagerTransactionCommit (mgr=0x55555599c520, pid=pid@entry=1395, lock=lock@entry=true)
    at ../../src/security/security_manager.c:299
#4  0x00007ffff755b276 in virSecurityStackTransactionCommit (mgr=<optimized out>, pid=1395, lock=<optimized out>) at ../../src/security/security_stack.c:174
#5  0x00007ffff755f347 in virSecurityManagerTransactionCommit (mgr=0x555555980ac0, pid=pid@entry=1395, lock=<optimized out>)
    at ../../src/security/security_manager.c:299
#6  0x00007fffad77a9bf in qemuSecurityDomainSetPathLabel (driver=driver@entry=0x55555597ace0, vm=vm@entry=0x5555559e0700,
    path=path@entry=0x7fff98004270 "/dev/hugepages2M/libvirt/qemu/1-testcom1", allowSubtree=allowSubtree@entry=true) at ../../src/qemu/qemu_security.c:599
#7  0x00007fffad702084 in qemuProcessBuildDestroyMemoryPathsImpl (driver=0x55555597ace0, vm=0x5555559e0700,
    path=0x7fff98004270 "/dev/hugepages2M/libvirt/qemu/1-testcom1", build=<optimized out>) at ../../src/qemu/qemu_process.c:3848
#8  0x00007fffad704871 in qemuProcessBuildDestroyMemoryPaths (driver=0x55555597ace0, vm=0x5555559e0700, mem=<optimized out>, build=true)
    at ../../src/qemu/qemu_process.c:3884
#9  0x00007fffad70d887 in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8056
#10 0x00007ffff74abb8a in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:196
#11 0x00007ffff383814a in start_thread (arg=<optimized out>) at pthread_create.c:479
#12 0x00007ffff3567dc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

However, on 6.6.0 the function does not return -1 even with that error, it returns 0. So it does not goto error and kill the VM, but continues the reconnect. See:

2022-08-30 11:18:17.739+0000: 1489: info : libvirt version: 6.6.0, package: 13.2.module+el8.3.1+10483+85317cf0 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2021-03-25-11:36:48, )
2022-08-30 11:18:17.739+0000: 1489: info : hostname: rhel84.lab.toca.local
2022-08-30 11:18:17.739+0000: 1489: warning : virSecurityDACSetOwnership:839 : Unable to restore label on '/dev/hugepages2M/libvirt/qemu/1-testcom1'. XATTRs might have been left in inconsistent state.
2022-08-30 11:18:17.739+0000: 1485: error : virProcessRunInFork:1254 : internal error: child reported (status=125): unable to stat: /dev/hugepages2M/libvirt/qemu/1-testcom1: No such file or directory
2022-08-30 11:18:17.739+0000: 1485: error : virProcessRunInFork:1256 : unable to stat: /dev/hugepages2M/libvirt/qemu/1-testcom1: No such file or directory
[Detaching after fork from child process 1490]
0x00007fffade9b48d in qemuProcessReconnect (opaque=<optimized out>) at ../../src/qemu/qemu_process.c:8217
8217        if (qemuProcessBuildDestroyMemoryPaths(driver, obj, NULL, true) < 0)
Value returned is $1 = 0

So it looks its a problem already fixed between 6.0.0 and 6.6.0 (8.3AV), but I could not find the patch the fixes it.

Version-Release number of selected component (if applicable):
6.0.0-35.2.module+el8.4.0+14226+d39fa4ab

How reproducible:
100%

Steps to Reproduce:
1. Setup a VM to use 1G hugepages

# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.57.1.el8_4.x86_64 root=/dev/mapper/rhel-root ro resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=2

  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Create systemd mount for 2M hugepages

# cat /usr/lib/systemd/system/dev-hugepages2M.mount
[Unit]
Description=Huge Pages File System
Documentation=https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
Documentation=http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems
DefaultDependencies=no
Before=sysinit.target
ConditionPathExists=/sys/kernel/mm/hugepages
ConditionCapability=CAP_SYS_ADMIN
[Mount]
What=hugetlbfs
Where=/dev/hugepages2M
Type=hugetlbfs
Options=pagesize=2M

3. Reboot to reserve the 1G pages at boot time and for systemd to pickup the mount unit (or just reserve dynamically and systemctl reload)

4. Start the VM and stop libvirt

# virsh start testvm && systemctl stop libvirtd
Domain testvm started

5. Activate the 2M mount:
# systemctl start dev-hugepages2M.mount

6. Do virsh list (this will start libvirt via socket activation)

# virsh -r list --all
 Id   Name       State
--------------------------
 1    testvm     running

7. But the VM is killed on libvirt start, do virsh again and it will be gone already

# virsh -r list --all
 Id   Name       State
---------------------------
 -    testvm     shut off

Actual results:
* VM dead on libvirt restart

Expected results:
* VM running

Comment 3 Michal Privoznik 2022-09-02 14:27:25 UTC

This is a tricky one. With newer libvirt it basically works due to an accident. Let me explain what is going on and why new libvirt "works". there were plenty of problems on a layer between udev and libvirt. The former (udev) is responsible for creating /dev/* nodes on device hotplug/hotunplug/config change and the latter (libvirt) needs to set correct seclabels (DAC + SELinux) on those nodes so that unprivileged QEMU can access them. This friction was 'producing more heat' the moment users/sysadmins started to write their own udev rules which clashed with libvirt set seclabels, effectively cutting a running QEMU process off. Here at libvirt we decided to create a private mount namespace for each QEMU and replace system /dev with a private, libvirt managed one. And to enhance security only those nodes that QEMU needs are exposed there. Moreover, they are dynamically created on device hotplug to/hotunplug from QEMU (again, by libvirt). Because of enhanced security, this feature is automatically turned on but can be overridden in qemu.conf (namespaces=[]).

Now, libvirt has to be careful to replace just devtmpfs mounted on /dev. The rest of the mount table has to be kept verbatim (e.g. because of disks which may be on NFS or basically anything else). And you may already see the problem. When the guest is started for the first time, its mount table contains just 1GiB hugetlbfs mount point. Then they stop libvirtd and mount another hugetlbfs. When libvirtd is started again it wants to create per-guest path in that new mountpoint as well, but:

1) it's doing so in the host namespace (i.e. the namespace where libvirtd is running), while it should have done in the QEMU's namespace! But even if it wanted to do that, it couldn't because,
2) the mount event of the other hugetlbfs (2MiB) is not propagated into the QEMU's namespace.

The reason for 2) is that mount event propagation is done outside of libvirt's control (in kernel) and thus sysadmin might mangle the QEMU's private /dev by remounting /dev in the top level namespace. The 1) is clearly a bug, but trying to create the path in the QEMU's namespace is equally wrong.

The ideal solution here would be to have sysadmin set up the mount table upfront and start VMs only afterwards. There's no harm in having hugetlbfs mounted with empty pool (= no hugepages reserved). Meanwhile, let me investigate why we need to call qemuProcessBuildDestroyMemoryPaths() from qemuProcessReconnect() in the first place. I mean, we are just reconnecting to a previously started QEMU process, no new paths need to be created. It made sense when I introduced per-domain location back in 2016 (https://gitlab.com/libvirt/libvirt/-/commit/f55afd83b13) so that the paths are created on libvirtd upgrade, but that's not the case anymore.
This won't solve the other issue at hand - even if libvirtd restarted successfully, they still wouldn't be able to use that new 2MiB hugetlbfs, because it doesn't exist in the QEMU's namespace.

And as promised, there's not a single patch that 'fixed' this behaviour upstream. It's my seclabel remembering work that has a flaw and simply ignores nonexistent paths. If you'd set remember_owner=0 in qemu.conf then you'd get the same behaviour no matter the version.


Note to QE: it doesn't really matter which hugetlbfs comes first. I can even reproduce without a single 1GiB HP allocated. Just start a guest with 2MiB HPs, then mount hugetlbfs pagesize=1GB and restart libvirtd.

Comment 4 Germano Veit Michel 2022-09-04 21:57:41 UTC

Thanks for the explanation.

That build=true when called from reconnect looks a bit suspicious to me, but this is the first time I read this code...

Comment 5 Germano Veit Michel 2022-09-04 22:08:15 UTC

(In reply to Germano Veit Michel from comment #4)
> That build=true when called from reconnect looks a bit suspicious to me, but
> this is the first time I read this code...

Ignore this, it would try to find it and delete it anyway.

Comment 6 Michal Privoznik 2022-09-05 10:14:38 UTC

After thinking about this more, I think this is a misconfiguration problem. Here's why: when libvirt creates the private namespace it also marks it as 'slave' (equivalent to 'mount --make-rslave'). The reason is, we indeed need mounts/umounts to propagate into the namespace (I know I said otherwise in previous comment). Just do a thought experiment - imagine a domain that is already running, then an NFS is mounted and a disk hotplug is attempted. We need the NFS mount to propagate, otherwise QEMU wouldn't see the disk on NFS. And maybe my memory is failing me, but IIRC the / used to be shared.

Also, there's one problem with hugetlbfs placement: if it's directly under /dev (like in the description - /dev/hugepages2M) then this won't get propagated, because the path doesn't exist at the time when QEMU process is started, and even if it did exist, QEMU is not configured to access it thus libvirt doesn't create the path in the namespace and as a result later mount is not propagated, because the mount point does not exist in the child namespace. Using any other location (e.g. /hugepages2M) works just fine.

Having said all of this, I believe when the following steps are taken then the bug goes away:
1) Run 'mount --make-rshared' before starting any guest,
2) change location of hugepages in the systemd unit file so that it's not under /dev.

Now, there is a real bug still: libvirt tries to create a domain private path in all hugetlbfs mount points even when not needed. I wanted to suggest using memfd backend but hit just this bug. I'll post a patch shortly. But the advantage of memfd is that it doesn't need any hugetlbfs mount points. I'll also post another patch that documents aforementioned reasoning.

Comment 7 Michal Privoznik 2022-09-05 14:33:23 UTC

Sent:

https://listman.redhat.com/archives/libvir-list/2022-September/234157.html

Comment 8 liang cong 2022-09-06 01:53:04 UTC

reproduced on:
libvirt-6.0.0-35.2.module+el8.4.0+14226+d39fa4ab.x86_64
qemu-kvm-4.2.0-49.module+el8.4.0+16539+22b18146.9.x86_64

with the same steps of the description.

Comment 10 liang cong 2022-09-08 02:34:29 UTC

test on:
libvirt-6.0.0-35.2.module+el8.4.0+14226+d39fa4ab.x86_64
qemu-kvm-4.2.0-49.module+el8.4.0+16539+22b18146.9.x86_64

before "4. Start the VM and stop libvirt" run command "mount --make-rshared -t hugetlbfs -o pagesize=2M hugetlbfs /dev/hugepages2M"， then work around this problem.

Comment 12 liang cong 2022-09-08 03:19:48 UTC

test on:
libvirt-6.0.0-35.2.module+el8.4.0+14226+d39fa4ab.x86_64
qemu-kvm-4.2.0-49.module+el8.4.0+16539+22b18146.9.x86_64

run command "mount -t hugetlbfs -o pagesize=2M hugetlbfs /mnt/hugepages2M"， to mount at /mnt/hugepages2M not under /dev, this also could work around this problem.

Comment 15 Michal Privoznik 2022-09-12 13:49:14 UTC

Some follow up patches:

https://listman.redhat.com/archives/libvir-list/2022-September/234290.html

Comment 16 liang cong 2022-09-21 02:01:23 UTC

test on:
libvirt-6.0.0-35.2.module+el8.4.0+14226+d39fa4ab.x86_64
qemu-kvm-4.2.0-49.module+el8.4.0+16539+22b18146.9.x86_64

before "4. Start the VM and stop libvirt" use common mount command:

"mount -t hugetlbfs -o pagesize=2M hugetlbfs /dev/hugepages2M"， also can work around this problem.

Comment 17 Michal Privoznik 2022-09-23 14:43:02 UTC

Merged upstream:

0377177c78 qemu_process.c: Propagate hugetlbfs mounts on reconnect
5853d70718 qemu_namespace: Introduce qemuDomainNamespaceSetupPath()
46b03819ae qemu_namespace: Fix a corner case in qemuDomainGetPreservedMounts()
687374959e qemu_namespace: Tolerate missing ACLs when creating a path in namespace

v8.7.0-134-g0377177c78

Comment 18 liang cong 2022-09-26 02:07:40 UTC

Tested on
libvirt-8.5.0-6.el9.x86_64
qemu-kvm-7.0.0-13.el9.x86_64


Senario1:
1. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Start the VM and stop libvirt

# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-admin.socket
  virtqemud-ro.socket
  virtqemud.socket


3. Mount 1G hugepage path
mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G

4. Start guest and stop virtqemud

# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-admin.socket
  virtqemud-ro.socket
  virtqemud.socket


5. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 8    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 8    vm1    running


6. Check the libvirt log, found error below:

40' on '/dev/hugepages1G/libvirt/qemu/8-vm1': No such file or directory
2022-09-26 01:54:08.988+0000: 56052: info : virSecuritySELinuxSetFileconImpl:1252 : Setting SELinux context on '/dev/hugepages1G/libvirt/qemu/8-vm1' to 'system_u:object_r:svirt_image_t:s0:c378,c740'
2022-09-26 01:54:08.988+0000: 56052: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f448801d970
2022-09-26 01:54:08.992+0000: 56052: error : virProcessRunInFork:1361 : internal error: child reported (status=125): unable to stat: /dev/hugepages1G/libvirt/qemu/8-vm1: No such file or directory
2022-09-26 01:54:08.992+0000: 56052: error : virProcessRunInFork:1365 : unable to stat: /dev/hugepages1G/libvirt/qemu/8-vm1: No such file or directory
2022-09-26 01:54:08.992+0000: 56052: info : virSecurityDACSetOwnership:789 : Setting DAC user and group on '/dev/hugepages1G/libvirt/qemu/8-vm1' to '107:107'



Senario2:
1. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Start guest
# virsh start vm1 
Domain 'vm1' started

3. Prepare 1G hugepage sized dimm xml like below:
# cat dimm.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>1</node>
    </target>
  </memory>

4. Mount 1G hugepage path
mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G


5. Attach memory device with dimm xml from step 3

# virsh attach-device vm1 dimm.xml 
error: Failed to attach device from dimm.xml
error: internal error: unable to execute QEMU command 'object-add': can't open backing store /dev/hugepages1G/libvirt/qemu/9-vm1 for guest RAM: No such file or directory

Comment 21 liang cong 2022-09-27 08:33:08 UTC

Hi michal, I test the with below:
libvirtv8.7.0-138-gfa2a7f888c
qemu-kvm-7.1.0-3.fc38.x86_64

test step like below:
1. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Start the VM and stop libvirt

# virsh start vm1 && systemctl stop virtqemud
Domain 'vm1' started

Warning: Stopping virtqemud.service, but it can still be activated by:
  virtqemud-admin.socket
  virtqemud-ro.socket
  virtqemud.socket


3. Mount 1G hugepage path
mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G

4. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 1    vm1    running

5. check the mount namespace of qemu about /dev path:
# cat /proc/`pidof qemu-system-x86_64`/mountinfo | grep ' /dev'
600 599 0:29 /root / rw,relatime master:1 - btrfs /dev/vda5 rw,seclabel,compress=zstd:1,space_cache,subvolid=256,subvol=/root
612 600 0:5 / /dev rw,nosuid master:8 - devtmpfs devtmpfs rw,seclabel,size=4096k,nr_inodes=1048576,mode=755,inode64
613 627 0:22 / /dev/shm rw,nosuid,nodev master:9 - tmpfs tmpfs rw,seclabel,inode64
614 627 0:23 / /dev/pts rw,nosuid,noexec,relatime master:10 - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=000
615 627 0:33 / /dev/hugepages rw,relatime master:14 - hugetlbfs hugetlbfs rw,seclabel,pagesize=2M
616 627 0:18 / /dev/mqueue rw,nosuid,nodev,noexec,relatime master:15 - mqueue mqueue rw,seclabel
624 600 0:29 /home /home rw,relatime master:45 - btrfs /dev/vda5 rw,seclabel,compress=zstd:1,space_cache,subvolid=258,subvol=/home
625 600 252:2 / /boot rw,relatime master:47 - ext4 /dev/vda2 rw,seclabel
626 625 252:3 / /boot/efi rw,relatime master:49 - vfat /dev/vda3 rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
627 612 0:52 / /dev rw,nosuid,relatime - tmpfs devfs rw,seclabel,size=64k,mode=755,inode64
629 612 0:54 / /dev/hugepages1G rw,relatime master:326 - hugetlbfs hugetlbfs rw,seclabel,pagesize=1024M
665 627 0:54 /libvirt/qemu/1-vm1 /dev/hugepages1G/libvirt/qemu/1-vm1 rw,relatime master:326 - hugetlbfs hugetlbfs rw,seclabel,pagesize=1024M


I find there is one extra mount point for 1G hugepage:
665 627 0:54 /libvirt/qemu/1-vm1 /dev/hugepages1G/libvirt/qemu/1-vm1 rw,relatime master:326 - hugetlbfs hugetlbfs rw,seclabel,pagesize=1024M

is this the patch's purpose to fix this issue? 
and why doesn't this mount points like the same rule with previous ones such as /dev/hugepages?

Comment 22 Michal Privoznik 2022-09-28 07:42:23 UTC

(In reply to liang cong from comment #21)
> Hi michal, I test the with below:
> libvirtv8.7.0-138-gfa2a7f888c
> qemu-kvm-7.1.0-3.fc38.x86_64
> 
> test step like below:
> 1. Define a guest with below memorybacking xml.
>   <memoryBacking>
>     <hugepages>
>       <page size='2048' unit='KiB'/>
>     </hugepages>
>   </memoryBacking>
> 
> 2. Start the VM and stop libvirt
> 
> # virsh start vm1 && systemctl stop virtqemud
> Domain 'vm1' started
> 
> Warning: Stopping virtqemud.service, but it can still be activated by:
>   virtqemud-admin.socket
>   virtqemud-ro.socket
>   virtqemud.socket
> 
> 
> 3. Mount 1G hugepage path
> mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G
> 
> 4. Do virsh list and guest still in running state.
> 
> # virsh -r list --all
>  Id   Name   State
> ----------------------
>  1    vm1    running
> 
> # virsh -r list --all
>  Id   Name   State
> ----------------------
>  1    vm1    running
> 
> # virsh -r list --all
>  Id   Name   State
> ----------------------
>  1    vm1    running
> 
> 5. check the mount namespace of qemu about /dev path:
> # cat /proc/`pidof qemu-system-x86_64`/mountinfo | grep ' /dev'
> 600 599 0:29 /root / rw,relatime master:1 - btrfs /dev/vda5
> rw,seclabel,compress=zstd:1,space_cache,subvolid=256,subvol=/root
> 612 600 0:5 / /dev rw,nosuid master:8 - devtmpfs devtmpfs
> rw,seclabel,size=4096k,nr_inodes=1048576,mode=755,inode64

You can see that the original devtmpfs is still mounted here, under /dev.

> 613 627 0:22 / /dev/shm rw,nosuid,nodev master:9 - tmpfs tmpfs
> rw,seclabel,inode64
> 614 627 0:23 / /dev/pts rw,nosuid,noexec,relatime master:10 - devpts devpts
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 615 627 0:33 / /dev/hugepages rw,relatime master:14 - hugetlbfs hugetlbfs
> rw,seclabel,pagesize=2M
> 616 627 0:18 / /dev/mqueue rw,nosuid,nodev,noexec,relatime master:15 -
> mqueue mqueue rw,seclabel
> 624 600 0:29 /home /home rw,relatime master:45 - btrfs /dev/vda5
> rw,seclabel,compress=zstd:1,space_cache,subvolid=258,subvol=/home
> 625 600 252:2 / /boot rw,relatime master:47 - ext4 /dev/vda2 rw,seclabel
> 626 625 252:3 / /boot/efi rw,relatime master:49 - vfat /dev/vda3
> rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,
> errors=remount-ro
> 627 612 0:52 / /dev rw,nosuid,relatime - tmpfs devfs
> rw,seclabel,size=64k,mode=755,inode64

It's just that libvirt mounts a tmpfs over it, therefore, the original /dev is still kind of in the namespace except not accessible for processes because of this tmpfs.

> 629 612 0:54 / /dev/hugepages1G rw,relatime master:326 - hugetlbfs hugetlbfs
> rw,seclabel,pagesize=1024M

Now, when systemd mounts /dev/hugepages1G this is propagated into that lower devtmpfs as this line shows. However, it's not accessible to QEMU or any other process running inside namespace, because there's still tmpfs mounted on top of the original devtmpfs.

> 665 627 0:54 /libvirt/qemu/1-vm1 /dev/hugepages1G/libvirt/qemu/1-vm1
> rw,relatime master:326 - hugetlbfs hugetlbfs rw,seclabel,pagesize=1024M

And this is the result of my fix - libvirt bind mounts the hugetlbfs into the namespace. This is a submount of the tmpfs and hence accessible to QEMU.

While libvirt could umount /dev/hugepages1G it's not necessary because it's not accessible to anything and it's also just cosmetics. What libvirt could do (but then again, just cosmetics, has no affect on QEMU), is to bind mount /dev/hugepages1G instead of domain's private path /dev/hugepages/1G/libvirt/qemu-1vm1. But that would not bring anything new and would require non-trivial amount of core rewrite.

I hope this clears up your concerns.

Comment 23 liang cong 2022-09-29 02:39:01 UTC

Hi michal, I found this error when testing this issue:
libvirt and qemu version:
libvirtv8.7.0-138-gfa2a7f888c
qemu-kvm-7.1.0-3.fc38.x86_64

test step like below:
1. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Mount 1G hugepage path
mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G

3. Start the VM

# virsh start vm1
Domain 'vm1' started

4. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>1</node>
    </target>
  </memory>

5. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
error: Failed to attach device from dimm1G.xml
error: internal error: unable to execute QEMU command 'object-add': can't open backing store /dev/hugepages1G/libvirt/qemu/2-vm1 for guest RAM: Permission denied


BTW, if mount hugepage 1G path then start a guest with 1G hugepages memory backing has same error.

Comment 24 Michal Privoznik 2022-09-29 10:13:12 UTC

(In reply to liang cong from comment #23)
> Hi michal, I found this error when testing this issue:

Hm. I can't reproduce. With these steps I get:

virsh # attach-device ble dimm.xml 
error: Failed to attach device from dimm.xml
error: internal error: Unable to find any usable hugetlbfs mount for 1048576 KiB

because the hugetlbfs was mounted only after libvirtd/virtqemud was started. But after I restart the daemon I get:

virsh # attach-device ble dimm.xml 
error: Reconnected to the hypervisor
Device attached successfully


Maybe I'm missing something? Also, can you please see whether there is an error message in audit.log that would correspond to this error?
BTW: I'm contemplating on removing this code from reconnect phase completely, because both domain startup and domain hotplug code now handle creating that domain private path (/dev/hugepages1G/...). I haven't posted a patch yet, because I want to test it first, but once I'm done I'll post it.

Comment 25 Michal Privoznik 2022-09-29 11:39:13 UTC

Alright, it passed my testing, so I've posted it onto the list:

https://listman.redhat.com/archives/libvir-list/2022-September/234543.html

Comment 34 liang cong 2022-10-09 09:25:23 UTC

Find an issue on:
# rpm -q libvirt qemu-kvm
libvirt-8.0.0-11.module+el8.8.0+16835+1d966b61.x86_64
qemu-kvm-6.2.0-21.module+el8.8.0+16781+9f4724c2.x86_64

1. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Mount 1G hugepage path
mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G

3. Start vm
# virsh start vm1
Domain vm1 started

4. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

5. Attach dimm memory devices as step4 defined.
# virsh attach-device vm1 dimm1G.xml 
Device attached successfully

6. Shutoff the vm
# virsh destroy vm1
Domain 'vm1' destroyed

7. umount and mount 1G hugepage path again
# umount /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G

8. start the vm
# virsh start vm1
Domain 'vm1' started

9. Attach dimm memory devices as step4 defined again
# virsh attach-device vm1 dimm1G.xml 
error: Failed to attach device from dimm1G.xml
error: internal error: unable to execute QEMU command 'object-add': can't open backing store /dev/hugepages1G/libvirt/qemu/3-vm1 for guest RAM: Permission denied


@mprivozn could you help to check this issue?

Comment 35 liang cong 2022-10-11 02:51:34 UTC

More info about comment#34
if restart libvirtd between step7 and step8, then issue is gone. 

But I found one extra issue, if restart libvirtd after step8 then no mater how many times restart libvirtd, the error like 'can't open backing store /dev/hugepages1G/libvirt/qemu/3-vm1 for guest RAM: Permission denied' would always occur.

Comment 40 liang cong 2022-10-19 02:50:48 UTC

Verified on build:
rpm -q libvirt qemu-kvm
libvirt-8.0.0-11.module+el8.8.0+16835+1d966b61.x86_64
qemu-kvm-6.2.0-22.module+el8.8.0+16816+1d3555ec.x86_64

Verify steps:
1. Define a guest with below memorybacking xml.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>

2. Start the VM and stop libvirt

# virsh start vm1 && systemctl stop libvirtd
Domain vm1 started

Warning: Stopping libvirtd.service, but it can still be activated by:
  libvirtd.socket
  libvirtd-ro.socket
  libvirtd-admin.socket

3. Mount 1G hugepage path
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G


4. Do virsh list and guest still in running state.

# virsh -r list --all
 Id   Name   State
----------------------
 3    vm1    running

# virsh -r list --all
 Id   Name   State
----------------------
 3    vm1    running

5. Prepare memory device hotplug xml like below:
# cat dimm1G.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>1048576</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>


6. Hotplug dimm memory device:
# virsh attach-device vm1 dimm1G.xml 
Device attached successfully

7. Prepare memory device with 2M hugepage source hotplug xml like below:
# cat dimm2M.xml 
<memory model='dimm'>
    <source>
      <pagesize unit='KiB'>2048</pagesize>
      <nodemask>0-1</nodemask>
    </source>
    <target>
      <size unit='KiB'>1048576</size>
      <node>0</node>
    </target>
  </memory>

8. Hotplug dimm memory device:
# virsh attach-device vm1 dimm2M.xml 
Device attached successfully


9. Shutoff vm
# virsh destroy vm1
Domain vm1 destroyed


10. Restart libvirtd
# systemctl restart libvirtd

11. Start vm
# virsh start vm1
Domain 'vm1' started


Also check the below scenarios:
Steps:
1. memory backing 2M guest vm start -> stop libvirt -> mount 1G path -> start libvirt -> hotplug 1G dimm -> restart vm -> restart libvirtd -> hotplug 1G dimm
2. mount 1G path -> memory backing 2M guest vm start -> restart libvirtd -> hogplug 1G dimm -> restart libvirtd -> restart vm ->hogplug 1G dimm

Tested with these settings：remember_owner=1 or 0, memfd memory backing, default memory backing, 1G hugepage memory backing, 1G hugepage path as /mnt/hugepages1G


Additional info:
1. Restart libvirt after mount hugepage path.
2. Umount and mount hugepage path may cause another issue bug#2134009

Comment 43 errata-xmlrpc 2023-05-16 08:16:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2757