Bug 1819801 - let domain without cpu pinning run without the need of CAP_SYS_NICE
Summary: let domain without cpu pinning run without the need of CAP_SYS_NICE
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: 8.3
Assignee: Martin Kletzander
QA Contact: Luyao Huang
URL:
Whiteboard:
Depends On:
Blocks: 1894409
TreeView+ depends on / blocked
 
Reported: 2020-04-01 15:25 UTC by Vladik Romanovsky
Modified: 2021-09-28 09:26 UTC (History)
15 users (show)

Fixed In Version: libvirt-6.6.0-8.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1894409 (view as bug list)
Environment:
Last Closed: 2021-02-22 15:39:38 UTC
Type: Feature Request
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Vladik Romanovsky 2020-04-01 15:25:08 UTC
Description of problem:

As you know, Kubevirt runs libvirt in a container. By default, this container doesn't have a SYS_NICE capability.

At first, kubevirt was adding this capability only to the virt-launcher container that was running a domain with vcpu pinning. (<vcpupin>)

However, the problem appears when kubevirt attempts to start a VM without vcpu pinning after a VM with vcpu pinning. In that case, the laster fails with
"cannot set CPU affinity on process X: Operation not permitted" -

Due to this kubevirt is forced to add the SYS_NICE capability to all virt-launcher containers.

Explicitly setting the cpuset (<vcpu placement='static' cpuset='0,3'>1</vcpu>)
also doesn't help in this case.

I understand libvirt is always calling sched_setaffinity() even when there is no pinning. I wonder can this can be avoided in our case?

Comment 1 Martin Kletzander 2020-04-09 19:33:28 UTC
So the reason for this is that libvirt does not have to be running on all cpus and can be restricted to a subset, but it should not affect any qemu processes that are spawned.  I will have a look at whether we can query the affinity first and change it only if needed in a reliable way.  But as far as I remember, kernel did not fail when `sched_setaffinity()` was called without the permission in case the request resulted in no effective change.

Comment 2 Daniel Berrangé 2020-05-21 15:37:59 UTC
(In reply to Martin Kletzander from comment #1)
> So the reason for this is that libvirt does not have to be running on all
> cpus and can be restricted to a subset, but it should not affect any qemu
> processes that are spawned.  I will have a look at whether we can query the
> affinity first and change it only if needed in a reliable way.  But as far
> as I remember, kernel did not fail when `sched_setaffinity()` was called
> without the permission in case the request resulted in no effective change.

If there is no explicit affinity in the guest XML, then I think it is reasonable to ignore the failure of sched_setaffinity, and just let QEMU inherit libvirtd's current affinity.  This would not be a affect any currently working scenarios, and will let libvirtd "do the right thing" when run inside a container with restricted affinity. 

Consider a host with 8 CPUs, we have the following possible scenarios


 1 Bare metal

   libvirtd has affinity of 8 CPUs

   QEMU should get 8 CPUs


 2 Bare metal

   libvirtd has affinity of 2 CPUs

   QEMU should get 8 CPUs


 3 Container has affinity of 8 CPUs

   libvirtd has affinity of 8 CPus

   QEMU should get 8 CPUs


 4 Container has affinity of 8 CPUs

   libvirtd has affinity of 2 CPus

   QEMU should get 8 CPUs


 5 Container has affinity of 4 CPUs

   libvirtd has affinity of 4 CPus

   QEMU should get 4 CPUs


 6 Container has affinity of 4 CPUs

   libvirtd has affinity of 2 CPus

   QEMU should get 4 CPUs


Scenarios 1 & 2 always work unless systemd restricted libvirtd privs.


IIRC scenarios 3 works because we check current affinity first and skip the sched_setaffinity call, avoiding the SYS_NICE issue

Scenario 4 works only if CAP_SYS_NICE is availalbe

Scenarios 5 & 6 works only if CAP_SYS_NICE is present *AND* the cgroups cpuset is not set on the container.


If we blindly ignore the sched_setaffinity failure, then scenarios 4, 5 and 6 should all work, but with caveat in case 4 and 6, that QEMU will only get 2 CPUs instead of the possible 8 and 4 respectively. This is still better than failing.

Ergo, I think we can blindly ignore the setaffinity failure, but *ONLY*  ignore it when there was no affinity specified in the XML config.  If user specified affinity explicitly, we must report an error if it can't be honoured

Comment 3 Fabian Deutsch 2020-05-28 13:50:04 UTC
Setting sev to high, however we still need to check if session mode is sufficient for us - which would then obsolete this bug.

Comment 4 Martin Kletzander 2020-08-25 07:25:33 UTC
(In reply to Fabian Deutsch from comment #3)
Any news with the session mode under normal user?

Comment 5 Vladik Romanovsky 2020-08-25 12:45:34 UTC
(In reply to Martin Kletzander from comment #4)
> (In reply to Fabian Deutsch from comment #3)
> Any news with the session mode under normal user?

Hi Martin,

Unfortunatelly, not yet.
It took us a while to find a solution for [1] which was a pre-requsite for this work.
We will start looking into switching to the session mode and running libvirt as non-root again and will update soon.


[1]  https://github.com/kubevirt/kubevirt/pull/3290

Comment 6 Martin Kletzander 2020-09-04 12:25:01 UTC
So I re-read the comment #2 again and I trust Daniel's decision.  Even though, I must say, that I am only half-convinced by it.  Both the embed driver and session mode would actually do what you are requesting and more.  I'm not sure whether this is crippling libvirt or not, but as I said, I trust Dan's decision.

Patch posted here:

  https://www.redhat.com/archives/libvir-list/2020-September/msg00264.html

Comment 7 Martin Kletzander 2020-09-04 12:46:41 UTC
Fixed upstream with commit v6.7.0-54-g3791f29b085c:

commit 3791f29b085c514b171f9d8fc702975f9df9733c
Author: Martin Kletzander <mkletzan>
Date:   Fri Sep 4 14:17:30 2020 +0200

    qemu: Do not error out when setting affinity failed

Comment 13 Luyao Huang 2020-10-23 07:58:45 UTC
Reproduce this bug use gdb and taskset on libvirt-daemon-6.0.0-25.4.module+el8.2.1+8060+c0c58169.x86_64:

1. disable cpuset cgroup and manualy start libvirtd:

# vim /etc/libvirt/qemu.conf

cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]
stdio_handler = "file"

# /usr/sbin/libvirtd

2. there is no vcpupin/emulatorpin in guest xml

3. use taskset and gdb to make a similar env:

# taskset -c -p 1-10 `pidof libvirtd`
pid 32902's current affinity list: 0-39
pid 32902's new affinity list: 1-10

# gdb -p `pidof libvirtd`

(gdb) b virProcessSetAffinity

4. start guest 
# virsh start vm1
(blocking)

5. in gdb terminal change libvirtd permission:

(gdb) handle SIG33 nostop
Signal        Stop	Print	Pass to program	Description
SIG33         No	Yes	Yes		Real-time event 33
(gdb) call (int)setuid(33)
$1 = 0
(gdb) c

result:

# virsh start vm1
error: Failed to start domain vm1
error: cannot set CPU affinity on process 33203: Operation not permitted


And fail to verify this bug on libvirt-daemon-6.6.0-7.module+el8.3.0+8424+5ea525c5.x86_64 since libvirtd crashed:

1. disable cpuset cgroup and manualy start libvirtd:

# vim /etc/libvirt/qemu.conf

cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]
stdio_handler = "file"

# /usr/sbin/libvirtd

2. there is no vcpupin/emulatorpin in guest xml

3. use taskset and gdb to make a similar env:

# taskset -c -p 1-10 `pidof libvirtd`


# gdb -p `pidof libvirtd`

(gdb) b virProcessSetAffinity

4. start guest 
# virsh start vm1
(blocking)

5. in gdb terminal change libvirtd permission:

(gdb) handle SIG33 nostop
Signal        Stop	Print	Pass to program	Description
SIG33         No	Yes	Yes		Real-time event 33
(gdb) call (int)setuid(33)
$1 = 0
(gdb) c

Thread 6 "rpc-worker" received signal SIGABRT, Aborted.
0x00007f95417e87ff in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f95417e87ff in raise () from /lib64/libc.so.6
#1  0x00007f95417d2c35 in abort () from /lib64/libc.so.6
#2  0x00007f954182b987 in __libc_message () from /lib64/libc.so.6
#3  0x00007f9541832d8c in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f9541834afd in _int_free () from /lib64/libc.so.6
#5  0x00007f95457d4677 in virProcessSetAffinity (pid=<optimized out>, map=0x7f952803bea0, quiet=quiet@entry=true) at ../../src/util/virprocess.c:490
#6  0x00007f950896d968 in qemuProcessInitCpuAffinity (vm=0x5558d274d6a0) at ../../src/qemu/qemu_process.c:2585
#7  qemuProcessLaunch (conn=0x7f950c006660, driver=0x7f94d40fbc80, vm=0x5558d274d6a0, asyncJob=QEMU_ASYNC_JOB_START, incoming=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=17) at ../../src/qemu/qemu_process.c:6902
#8  0x00007f9508972775 in qemuProcessStart (conn=conn@entry=0x7f950c006660, driver=driver@entry=0x7f94d40fbc80, vm=vm@entry=0x5558d274d6a0, updatedCPU=updatedCPU@entry=0x0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_START, 
    migrateFrom=migrateFrom@entry=0x0, migrateFd=-1, migratePath=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=<optimized out>) at ../../src/qemu/qemu_process.c:7202
#9  0x00007f95089d83f3 in qemuDomainObjStart (conn=0x7f950c006660, driver=0x7f94d40fbc80, vm=0x5558d274d6a0, flags=<optimized out>, asyncJob=QEMU_ASYNC_JOB_START) at ../../src/qemu/qemu_driver.c:7531
#10 0x00007f95089d8a5f in qemuDomainCreateWithFlags (dom=0x7f950c008d00, flags=0) at ../../src/qemu/qemu_driver.c:7582
#11 0x00007f95459b0af7 in virDomainCreate (domain=domain@entry=0x7f950c008d00) at ../../src/libvirt-domain.c:6531
#12 0x00005558d1867e06 in remoteDispatchDomainCreate (server=0x5558d274d080, msg=0x5558d278fc60, args=<optimized out>, rerr=0x7f95393718f0, client=0x5558d27951b0) at ./remote/remote_daemon_dispatch_stubs.h:4894
#13 remoteDispatchDomainCreateHelper (server=0x5558d274d080, client=0x5558d27951b0, msg=0x5558d278fc60, rerr=0x7f95393718f0, args=<optimized out>, ret=0x0) at ./remote/remote_daemon_dispatch_stubs.h:4873
#14 0x00007f95458deb19 in virNetServerProgramDispatchCall (msg=0x5558d278fc60, client=0x5558d27951b0, server=0x5558d274d080, prog=0x5558d27a1810) at ../../src/rpc/virnetserverprogram.c:430
#15 virNetServerProgramDispatch (prog=0x5558d27a1810, server=server@entry=0x5558d274d080, client=0x5558d27951b0, msg=0x5558d278fc60) at ../../src/rpc/virnetserverprogram.c:302
#16 0x00007f95458e3d16 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x5558d274d080) at ../../src/rpc/virnetserver.c:137
#17 virNetServerHandleJob (jobOpaque=0x5558d271f3c0, opaque=0x5558d274d080) at ../../src/rpc/virnetserver.c:154
#18 0x00007f95457f334f in virThreadPoolWorker (opaque=<optimized out>) at ../../src/util/virthreadpool.c:163
#19 0x00007f95457f294b in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:233
#20 0x00007f9541f9814a in start_thread () from /lib64/libpthread.so.0
#21 0x00007f95418adf23 in clone () from /lib64/libc.so.6


(gdb) f 5
#5  0x00007f95457d4677 in virProcessSetAffinity (pid=<optimized out>, map=0x7f952803bea0, quiet=quiet@entry=true) at ../../src/util/virprocess.c:490
490	    CPU_FREE(mask);

(gdb) p *mask
$4 = {__bits = {140278598113984, 140278597945552, 0 <repeats 14 times>}}


Looks like a double free problem.

Comment 15 Luyao Huang 2020-10-23 08:02:36 UTC
Hi Martin,

Could you please help to check issue in comment 13? and is that okay to track this crash in this bug? Thanks in advance for your help!

Luyao

Comment 16 Martin Kletzander 2020-10-27 13:10:03 UTC
Yep, thanks for finding out, I really wonder what caused it to not happen to me.  The fix is here:

https://www.redhat.com/archives/libvir-list/2020-October/msg01439.html

Let's see how that goes.

Comment 17 Martin Kletzander 2020-10-27 15:39:56 UTC
Fixed upstream by v6.9.0-rc1-6-g1f807631f402:

commit 1f807631f402210d036ec4803e7adfefa222f786
Author: Martin Kletzander <mkletzan>
Date:   Tue Oct 27 13:48:38 2020 +0100

    util: Avoid double free in virProcessSetAffinity

Comment 20 Luyao Huang 2020-10-28 02:08:59 UTC
(In reply to Martin Kletzander from comment #16)
> Yep, thanks for finding out, I really wonder what caused it to not happen to
> me.  The fix is here:
> 
> https://www.redhat.com/archives/libvir-list/2020-October/msg01439.html
> 
> Let's see how that goes.

Thanks a lot for your fix and I will test your patch later when I have a test environment.

And move bug status back to Assigned.

Comment 25 Danilo de Paula 2020-10-29 18:09:48 UTC
Thanks Jiri.

Comment 27 Luyao Huang 2020-11-04 06:38:53 UTC
(In reply to Martin Kletzander from comment #17)
> Fixed upstream by v6.9.0-rc1-6-g1f807631f402:
> 
> commit 1f807631f402210d036ec4803e7adfefa222f786
> Author: Martin Kletzander <mkletzan>
> Date:   Tue Oct 27 13:48:38 2020 +0100
> 
>     util: Avoid double free in virProcessSetAffinity

I have rebuild a libvirt with this patch and retested with the same steps in comment 13.
The test result shows that libvirtd crash issue have been fixed.

Comment 31 Luyao Huang 2020-12-16 08:15:11 UTC
Verify this bug with libvirt-daemon-6.10.0-1.module+el8.4.0+8898+a84e86e1.x86_64, steps the same with bug 1894409 comment 7.

Comment 33 errata-xmlrpc 2021-02-22 15:39:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0639


Note You need to log in before you can comment on or make changes to this bug.