Description of problem: As you know, Kubevirt runs libvirt in a container. By default, this container doesn't have a SYS_NICE capability. At first, kubevirt was adding this capability only to the virt-launcher container that was running a domain with vcpu pinning. (<vcpupin>) However, the problem appears when kubevirt attempts to start a VM without vcpu pinning after a VM with vcpu pinning. In that case, the laster fails with "cannot set CPU affinity on process X: Operation not permitted" - Due to this kubevirt is forced to add the SYS_NICE capability to all virt-launcher containers. Explicitly setting the cpuset (<vcpu placement='static' cpuset='0,3'>1</vcpu>) also doesn't help in this case. I understand libvirt is always calling sched_setaffinity() even when there is no pinning. I wonder can this can be avoided in our case?
So the reason for this is that libvirt does not have to be running on all cpus and can be restricted to a subset, but it should not affect any qemu processes that are spawned. I will have a look at whether we can query the affinity first and change it only if needed in a reliable way. But as far as I remember, kernel did not fail when `sched_setaffinity()` was called without the permission in case the request resulted in no effective change.
(In reply to Martin Kletzander from comment #1) > So the reason for this is that libvirt does not have to be running on all > cpus and can be restricted to a subset, but it should not affect any qemu > processes that are spawned. I will have a look at whether we can query the > affinity first and change it only if needed in a reliable way. But as far > as I remember, kernel did not fail when `sched_setaffinity()` was called > without the permission in case the request resulted in no effective change. If there is no explicit affinity in the guest XML, then I think it is reasonable to ignore the failure of sched_setaffinity, and just let QEMU inherit libvirtd's current affinity. This would not be a affect any currently working scenarios, and will let libvirtd "do the right thing" when run inside a container with restricted affinity. Consider a host with 8 CPUs, we have the following possible scenarios 1 Bare metal libvirtd has affinity of 8 CPUs QEMU should get 8 CPUs 2 Bare metal libvirtd has affinity of 2 CPUs QEMU should get 8 CPUs 3 Container has affinity of 8 CPUs libvirtd has affinity of 8 CPus QEMU should get 8 CPUs 4 Container has affinity of 8 CPUs libvirtd has affinity of 2 CPus QEMU should get 8 CPUs 5 Container has affinity of 4 CPUs libvirtd has affinity of 4 CPus QEMU should get 4 CPUs 6 Container has affinity of 4 CPUs libvirtd has affinity of 2 CPus QEMU should get 4 CPUs Scenarios 1 & 2 always work unless systemd restricted libvirtd privs. IIRC scenarios 3 works because we check current affinity first and skip the sched_setaffinity call, avoiding the SYS_NICE issue Scenario 4 works only if CAP_SYS_NICE is availalbe Scenarios 5 & 6 works only if CAP_SYS_NICE is present *AND* the cgroups cpuset is not set on the container. If we blindly ignore the sched_setaffinity failure, then scenarios 4, 5 and 6 should all work, but with caveat in case 4 and 6, that QEMU will only get 2 CPUs instead of the possible 8 and 4 respectively. This is still better than failing. Ergo, I think we can blindly ignore the setaffinity failure, but *ONLY* ignore it when there was no affinity specified in the XML config. If user specified affinity explicitly, we must report an error if it can't be honoured
Setting sev to high, however we still need to check if session mode is sufficient for us - which would then obsolete this bug.
(In reply to Fabian Deutsch from comment #3) Any news with the session mode under normal user?
(In reply to Martin Kletzander from comment #4) > (In reply to Fabian Deutsch from comment #3) > Any news with the session mode under normal user? Hi Martin, Unfortunatelly, not yet. It took us a while to find a solution for [1] which was a pre-requsite for this work. We will start looking into switching to the session mode and running libvirt as non-root again and will update soon. [1] https://github.com/kubevirt/kubevirt/pull/3290
So I re-read the comment #2 again and I trust Daniel's decision. Even though, I must say, that I am only half-convinced by it. Both the embed driver and session mode would actually do what you are requesting and more. I'm not sure whether this is crippling libvirt or not, but as I said, I trust Dan's decision. Patch posted here: https://www.redhat.com/archives/libvir-list/2020-September/msg00264.html
Fixed upstream with commit v6.7.0-54-g3791f29b085c: commit 3791f29b085c514b171f9d8fc702975f9df9733c Author: Martin Kletzander <mkletzan> Date: Fri Sep 4 14:17:30 2020 +0200 qemu: Do not error out when setting affinity failed
Reproduce this bug use gdb and taskset on libvirt-daemon-6.0.0-25.4.module+el8.2.1+8060+c0c58169.x86_64: 1. disable cpuset cgroup and manualy start libvirtd: # vim /etc/libvirt/qemu.conf cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ] stdio_handler = "file" # /usr/sbin/libvirtd 2. there is no vcpupin/emulatorpin in guest xml 3. use taskset and gdb to make a similar env: # taskset -c -p 1-10 `pidof libvirtd` pid 32902's current affinity list: 0-39 pid 32902's new affinity list: 1-10 # gdb -p `pidof libvirtd` (gdb) b virProcessSetAffinity 4. start guest # virsh start vm1 (blocking) 5. in gdb terminal change libvirtd permission: (gdb) handle SIG33 nostop Signal Stop Print Pass to program Description SIG33 No Yes Yes Real-time event 33 (gdb) call (int)setuid(33) $1 = 0 (gdb) c result: # virsh start vm1 error: Failed to start domain vm1 error: cannot set CPU affinity on process 33203: Operation not permitted And fail to verify this bug on libvirt-daemon-6.6.0-7.module+el8.3.0+8424+5ea525c5.x86_64 since libvirtd crashed: 1. disable cpuset cgroup and manualy start libvirtd: # vim /etc/libvirt/qemu.conf cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ] stdio_handler = "file" # /usr/sbin/libvirtd 2. there is no vcpupin/emulatorpin in guest xml 3. use taskset and gdb to make a similar env: # taskset -c -p 1-10 `pidof libvirtd` # gdb -p `pidof libvirtd` (gdb) b virProcessSetAffinity 4. start guest # virsh start vm1 (blocking) 5. in gdb terminal change libvirtd permission: (gdb) handle SIG33 nostop Signal Stop Print Pass to program Description SIG33 No Yes Yes Real-time event 33 (gdb) call (int)setuid(33) $1 = 0 (gdb) c Thread 6 "rpc-worker" received signal SIGABRT, Aborted. 0x00007f95417e87ff in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007f95417e87ff in raise () from /lib64/libc.so.6 #1 0x00007f95417d2c35 in abort () from /lib64/libc.so.6 #2 0x00007f954182b987 in __libc_message () from /lib64/libc.so.6 #3 0x00007f9541832d8c in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f9541834afd in _int_free () from /lib64/libc.so.6 #5 0x00007f95457d4677 in virProcessSetAffinity (pid=<optimized out>, map=0x7f952803bea0, quiet=quiet@entry=true) at ../../src/util/virprocess.c:490 #6 0x00007f950896d968 in qemuProcessInitCpuAffinity (vm=0x5558d274d6a0) at ../../src/qemu/qemu_process.c:2585 #7 qemuProcessLaunch (conn=0x7f950c006660, driver=0x7f94d40fbc80, vm=0x5558d274d6a0, asyncJob=QEMU_ASYNC_JOB_START, incoming=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=17) at ../../src/qemu/qemu_process.c:6902 #8 0x00007f9508972775 in qemuProcessStart (conn=conn@entry=0x7f950c006660, driver=driver@entry=0x7f94d40fbc80, vm=vm@entry=0x5558d274d6a0, updatedCPU=updatedCPU@entry=0x0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_START, migrateFrom=migrateFrom@entry=0x0, migrateFd=-1, migratePath=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=<optimized out>) at ../../src/qemu/qemu_process.c:7202 #9 0x00007f95089d83f3 in qemuDomainObjStart (conn=0x7f950c006660, driver=0x7f94d40fbc80, vm=0x5558d274d6a0, flags=<optimized out>, asyncJob=QEMU_ASYNC_JOB_START) at ../../src/qemu/qemu_driver.c:7531 #10 0x00007f95089d8a5f in qemuDomainCreateWithFlags (dom=0x7f950c008d00, flags=0) at ../../src/qemu/qemu_driver.c:7582 #11 0x00007f95459b0af7 in virDomainCreate (domain=domain@entry=0x7f950c008d00) at ../../src/libvirt-domain.c:6531 #12 0x00005558d1867e06 in remoteDispatchDomainCreate (server=0x5558d274d080, msg=0x5558d278fc60, args=<optimized out>, rerr=0x7f95393718f0, client=0x5558d27951b0) at ./remote/remote_daemon_dispatch_stubs.h:4894 #13 remoteDispatchDomainCreateHelper (server=0x5558d274d080, client=0x5558d27951b0, msg=0x5558d278fc60, rerr=0x7f95393718f0, args=<optimized out>, ret=0x0) at ./remote/remote_daemon_dispatch_stubs.h:4873 #14 0x00007f95458deb19 in virNetServerProgramDispatchCall (msg=0x5558d278fc60, client=0x5558d27951b0, server=0x5558d274d080, prog=0x5558d27a1810) at ../../src/rpc/virnetserverprogram.c:430 #15 virNetServerProgramDispatch (prog=0x5558d27a1810, server=server@entry=0x5558d274d080, client=0x5558d27951b0, msg=0x5558d278fc60) at ../../src/rpc/virnetserverprogram.c:302 #16 0x00007f95458e3d16 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x5558d274d080) at ../../src/rpc/virnetserver.c:137 #17 virNetServerHandleJob (jobOpaque=0x5558d271f3c0, opaque=0x5558d274d080) at ../../src/rpc/virnetserver.c:154 #18 0x00007f95457f334f in virThreadPoolWorker (opaque=<optimized out>) at ../../src/util/virthreadpool.c:163 #19 0x00007f95457f294b in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:233 #20 0x00007f9541f9814a in start_thread () from /lib64/libpthread.so.0 #21 0x00007f95418adf23 in clone () from /lib64/libc.so.6 (gdb) f 5 #5 0x00007f95457d4677 in virProcessSetAffinity (pid=<optimized out>, map=0x7f952803bea0, quiet=quiet@entry=true) at ../../src/util/virprocess.c:490 490 CPU_FREE(mask); (gdb) p *mask $4 = {__bits = {140278598113984, 140278597945552, 0 <repeats 14 times>}} Looks like a double free problem.
Hi Martin, Could you please help to check issue in comment 13? and is that okay to track this crash in this bug? Thanks in advance for your help! Luyao
Yep, thanks for finding out, I really wonder what caused it to not happen to me. The fix is here: https://www.redhat.com/archives/libvir-list/2020-October/msg01439.html Let's see how that goes.
Fixed upstream by v6.9.0-rc1-6-g1f807631f402: commit 1f807631f402210d036ec4803e7adfefa222f786 Author: Martin Kletzander <mkletzan> Date: Tue Oct 27 13:48:38 2020 +0100 util: Avoid double free in virProcessSetAffinity
(In reply to Martin Kletzander from comment #16) > Yep, thanks for finding out, I really wonder what caused it to not happen to > me. The fix is here: > > https://www.redhat.com/archives/libvir-list/2020-October/msg01439.html > > Let's see how that goes. Thanks a lot for your fix and I will test your patch later when I have a test environment. And move bug status back to Assigned.
Thanks Jiri.
(In reply to Martin Kletzander from comment #17) > Fixed upstream by v6.9.0-rc1-6-g1f807631f402: > > commit 1f807631f402210d036ec4803e7adfefa222f786 > Author: Martin Kletzander <mkletzan> > Date: Tue Oct 27 13:48:38 2020 +0100 > > util: Avoid double free in virProcessSetAffinity I have rebuild a libvirt with this patch and retested with the same steps in comment 13. The test result shows that libvirtd crash issue have been fixed.
Verify this bug with libvirt-daemon-6.10.0-1.module+el8.4.0+8898+a84e86e1.x86_64, steps the same with bug 1894409 comment 7.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0639