Bug 2158704

Summary: RFE: Prefer /dev/userfaultfd over userfaultfd(2) syscall
Product: Red Hat Enterprise Linux 9 Reporter: Michal Privoznik <mprivozn>
Component: qemu-kvmAssignee: Peter Xu <peterx>
qemu-kvm sub component: Live Migration QA Contact: Li Xiaohui <xiaohli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: alexander.lougovski, chayang, coli, jinzhao, juzhang, leobras, nilal, peterx, quintela, virt-maint
Version: 9.2Keywords: FutureFeature, Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-7.2.0-9.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2158705 2158706 (view as bug list) Environment:
Last Closed: 2023-05-09 07:23:43 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2158706    
Bug Blocks: 2158705    

Description Michal Privoznik 2023-01-06 08:10:20 UTC
Description of problem:
So far, for postcopy migration the userfaultfd(2) syscall is used. But this has couple of drawbacks (which are summarized nicely in kernel commit [1]). To resolve these, kernel came up with /dev/userfaultfd device, and this is a request to switch to that.

Please note, some scenarios where QEMU is running might be disallowing the userfaultfd(2) syscall as it is viewed as too powerful. For intsance KubeVirt [2].


1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d5de004e009add27db76c5cdc9f1f7f7dc087e7

2: https://issues.redhat.com/browse/OCPBUGS-5031

Comment 2 John Ferlan 2023-01-09 20:31:40 UTC
I know Nitesh is taking over Live Migration shortly, but we need to consider this sooner than later  especially if OpenShift goes thru with the plan to alter the default seccomp profile. There is a "work-around" of sorts in the plan for kubevirt (https://github.com/kubevirt/kubevirt/pull/8917).

Comment 6 Li Xiaohui 2023-02-14 02:33:02 UTC
Hi Peter, 
About the verification of this bug, I think running postcopy test is ok, what do you think?

Comment 7 Peter Xu 2023-02-14 14:09:56 UTC
Xiaohui,

Thanks for raising this question.  Yes that should be enough.

To make sure you're using the new /dev/userfaultfd descriptor, you can do this to disable the userfaultfd syscall first for qemu:

[NOTE: this will not disable the whole userfaultfd syscall, but only the unprivileged kernel userfaultfd, which will stop QEMU from using it already because qemu will need that privileged uffd for handle kernel faults]
# echo 0 > /proc/sys/vm/unprivileged_userfaultfd

With above, we should already fail to boot the dest QEMU with postcopy enabled, like this:

[note: here we don't need root privilege or it won't fail]
$ ./qemu-system-x86_64 -incoming defer -global migration.x-postcopy-ram=on
qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not available: Operation not permitted
qemu-system-x86_64: Postcopy is not supported

Or if you enable postcopy via QMP I think that should just fail the QMP command to enable postcopy.

Then, with the new kernel and have /dev/userfaultfd being there with the right permissions:

# chmod 0666 /dev/userfaultfd

One should be able to start dest QEMU successfully, like:

[note: here we don't need root privilege too to compare with above]
$ ./qemu-system-x86_64 -incoming defer -global migration.x-postcopy-ram=on
qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not available: Operation not permitted
qemu-system-x86_64: Postcopy is not supported

With that, a simplest round of postcopy would suffice.

Thanks.

Comment 8 Peter Xu 2023-02-14 14:12:31 UTC
(In reply to Peter Xu from comment #7)
> One should be able to start dest QEMU successfully, like:
> 
> [note: here we don't need root privilege too to compare with above]
> $ ./qemu-system-x86_64 -incoming defer -global migration.x-postcopy-ram=on
> qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not
> available: Operation not permitted
> qemu-system-x86_64: Postcopy is not supported

Sorry, it's a copy-paste error..  It should just succeed and continue here.

Comment 9 Li Xiaohui 2023-02-16 11:07:20 UTC
(In reply to Peter Xu from comment #7)
> Xiaohui,
> 
> Thanks for raising this question.  Yes that should be enough.
> 
> To make sure you're using the new /dev/userfaultfd descriptor, you can do
> this to disable the userfaultfd syscall first for qemu:
> 
> [NOTE: this will not disable the whole userfaultfd syscall, but only the
> unprivileged kernel userfaultfd, which will stop QEMU from using it already
> because qemu will need that privileged uffd for handle kernel faults]
> # echo 0 > /proc/sys/vm/unprivileged_userfaultfd
> 
> With above, we should already fail to boot the dest QEMU with postcopy
> enabled, like this:
> 
> [note: here we don't need root privilege or it won't fail]
> $ ./qemu-system-x86_64 -incoming defer -global migration.x-postcopy-ram=on
> qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not
> available: Operation not permitted
> qemu-system-x86_64: Postcopy is not supported
> 
> Or if you enable postcopy via QMP I think that should just fail the QMP
> command to enable postcopy.
> 
> Then, with the new kernel and have /dev/userfaultfd being there with the
> right permissions:

Here, still need to disable the userfaultfd syscall?

> 
> # chmod 0666 /dev/userfaultfd

I have verified the relevant kernel bug 2158706 on kernel-5.14.0-270.el9.x86_64, in that bug, I can see the default permissions isn't 0666:
https://bugzilla.redhat.com/show_bug.cgi?id=2158706#c16

[root@dell-per7525-25 bz2158706]# ls -lt /dev/userfaultfd 
crw-------. 1 root root 10, 126 Feb 15 08:01 /dev/userfaultfd

So we must give 666 permissons to /dev/userfaultfd for postcopy migration? If not, will fail to start postcopy? 
If so, why don't we keep 666 as the default for /dev/userfaultfd?

> 
> One should be able to start dest QEMU successfully, like:
> 
> [note: here we don't need root privilege too to compare with above]
> $ ./qemu-system-x86_64 -incoming defer -global migration.x-postcopy-ram=on
> 
> It should just succeed and continue here
> 
> With that, a simplest round of postcopy would suffice.
> 
> Thanks.

Thank you to help provide the test steps.

Comment 10 Peter Xu 2023-02-16 14:50:30 UTC
(In reply to Li Xiaohui from comment #9)
> So we must give 666 permissons to /dev/userfaultfd for postcopy migration?

Not really.   Here I just wanted to make sure we have permission to access the new devfile so we can test it.

> If not, will fail to start postcopy? 

Yes.

> If so, why don't we keep 666 as the default for /dev/userfaultfd?

The permission here isn't important to me - that should be managed by system admins in the future no matter what's the default values (not only permissions, but owner, group, etc.).  E.g., in production QEMU can be put into a group who always have permission to access /dev/userfaultfd, then the permission can be 0660 disallowing any process from using kernel traps freely but it'll let QEMU pass.

So IMHO here we don't need to worry about the default values (which I think should follow the whole system for any default devfile node), but whether it'll work for us as long as the permission is validated.

Thanks.
Peter

Comment 12 Yanan Fu 2023-02-20 12:45:56 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 15 Li Xiaohui 2023-02-23 03:40:46 UTC
1 ) kernel-5.14.0-270.el9.x86_64 && qemu-kvm-7.2.0-8.el9.x86_64, qemu user
[qemu@dell-per7525-26 /]$ cat /proc/sys/vm/unprivileged_userfaultfd 
0
[qemu@dell-per7525-26 /]$ /usr/libexec/qemu-kvm -cpu EPYC-Milan -monitor stdio -machine q35 -incoming defer 
(qemu) migrate_set_capability postcopy-ram on
postcopy_ram_supported_by_host: userfaultfd not available: Operation not permitted
Error: Postcopy is not supported


2 ) kernel-5.14.0-270.el9.x86_64 && qemu-kvm-7.2.0-10.el9.x86_64, qemu user
[root@dell-per7525-26 qemu-kvm-latest]# cat /proc/sys/vm/unprivileged_userfaultfd
0
[root@dell-per7525-26 qemu-kvm-latest]# ls -lt /dev/userfaultfd 
crw-rw-rw-. 1 root root 10, 126 Feb 22 08:56 /dev/userfaultfd
[qemu@dell-per7525-26 /]$ /usr/libexec/qemu-kvm -cpu EPYC-Milan -monitor stdio -machine q35 -incoming defer
(qemu) migrate_set_capability postcopy-ram on
(qemu) info migrate_capabilities 
...
postcopy-ram: on
...

3 ) kernel-5.14.0-270.el9.x86_64 && qemu-kvm-7.2.0-10.el9.x86_64, root user. Run postcopy all cases and tier 1 test loop, all pass.
[root@dell-per7525-26 ~]# ls -lt /dev/userfaultfd 
crw-------. 1 root root 10, 126 Feb 22 08:56 /dev/userfaultfd
[root@dell-per7525-25 ipa]# python3 Start2Run.py --test_requirement=VIRT_49060_x86_q35_blockdev --src_host_ip=10.73.2.80 --dst_host_ip=10.73.2.82 --share_images_dir=/mnt/xiaohli --sys_image_name=rhel920-64-virtio-scsi.qcow2 --guest_os_type=linux --firmware=ovmf --cpu_model=EPYC-Milan,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,vaes=on,vpclmulqdq=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,cmp-legacy=on,virt-ssbd=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,erms=off,fsrm=off
========================= Test Requirement: VIRT-49060-X86-Q35-BLOCKDEV(Migration - x86) =========================
--> Running case(1/11): BASE-TEST-POSTCOPY-Migration basic precopy test without setting downtime and speed (4 min 36 sec)--- PASS.
--> Running case(2/11): VIRT-49062-[postcopy] Migration finishes only with postcopy under high stress (rhel only) (14 min 33 sec)--- PASS.
--> Running case(3/11): VIRT-58670-[postcopy] Cancel migration during the precopy phase (1 min 16 sec)--- PASS.
--> Running case(4/11): VIRT-58672-[postcopy] Source should recovers when fail the destination during the precopy phase (1 min 16 sec)--- PASS.
--> Running case(5/11): VIRT-85702-[postcopy] Post-copy migration with XBZRLE compression (2 min 56 sec)--- PASS.
--> Running case(6/11): VIRT-86251-[postcopy] live migration post-copy support file-backed memory (3 min 24 sec)--- PASS.
--> Running case(7/11): VIRT-93722-[postcopy]Postcopy migration with Numa pinned and Hugepage pinned guest--file backend (3 min 40 sec)--- PASS.
--> Running case(8/11): VIRT-294886-[migration] Postcopy migration recover after migrate-pause (2 min 36 sec)--- PASS.
--> Running case(9/11): RHEL-150076-[postcopy] Set postcopy migration speed(max-postcopy-bandwidth) (4 min 40 sec)--- PASS.
--> Running case(10/11): RHEL-186017-[postcopy] Basic postcopy migration (3 min 12 sec)--- PASS.
--> Running case(11/11): RHEL-189930-[postcopy] Post-copy migration with enabling auto-converge (3 min 32 sec)--- PASS.

[root@dell-per7525-25 ipa]# python3 Start2Run.py --test_requirement=tier1_q35_blockdev --src_host_ip=10.73.2.80 --dst_host_ip=10.73.2.82 --share_images_dir=/mnt/xiaohli --sys_image_name=rhel920-64-virtio-scsi.qcow2 --guest_os_type=linux --firmware=ovmf --cpu_model=EPYC-Milan,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,vaes=on,vpclmulqdq=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,cmp-legacy=on,virt-ssbd=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,erms=off,fsrm=off
========================= Test Requirement: TIER1-Q35-BLOCKDEV(Migration - x86) =========================
--> Running case(1/10): RHEL-178709-[migration] Basic migration test (4 min 44 sec)--- PASS.
--> Running case(2/10): VIRT-10022-[migration] Migrate guest via a compressed file (4 min 24 sec)--- PASS.
--> Running case(3/10): VIRT-10061-[migration] Cancel a migration process with "migration_cancel" command (7 min 16 sec)--- PASS.
--> Running case(4/10): VIRT-10067-[migration] Set migration downtime (3 min 4 sec)--- PASS.
--> Running case(5/10): RHEL-186017-[postcopy] Basic postcopy migration (2 min 40 sec)--- PASS.
--> Running case(6/10): VIRT-10081-[migration][page delta compression] Check live migration statistics for xbzrle specific options (3 min 40 sec)--- PASS.
--> Running case(7/10): VIRT-48421-[auto converge] Live migration with auto converge- dynamic cpu throttling (3 min 4 sec)--- PASS.
--> Running case(8/10): VIRT-85868-[TLS]TLS encryption migration via ipv4 addr(3 min 0 sec)--- PASS.
--> Running case(9/10): VIRT-109869-[Multiple-fds] Live migration with multifd on (10 min 44 sec)--- PASS.
--> Running case(10/10): VIRT-296185-[zero copy] Zero copy migration (1 min 52 sec)--- PASS.
**********************************************************************************************

Comment 16 Li Xiaohui 2023-02-23 03:57:37 UTC
Per above Comment 15, mark this bug verified.


BTW, I think we don't need to add extra cases for this bug's change. Keeping test postcopy feature is enough. Peter, what do you think?

Comment 17 Peter Xu 2023-02-23 15:19:46 UTC
(In reply to Li Xiaohui from comment #16)
> BTW, I think we don't need to add extra cases for this bug's change. Keeping
> test postcopy feature is enough. Peter, what do you think?

Agreed.

Comment 21 errata-xmlrpc 2023-05-09 07:23:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2162