Bug 1916346 - hot-unplug and then hotplug VFIO host device fails in non privileged container
Summary: hot-unplug and then hotplug VFIO host device fails in non privileged container
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: 8.4
Assignee: Andrea Bolognani
QA Contact: yalzhang@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 15:27 UTC by omergi
Modified: 2023-03-14 14:45 UTC (History)
21 users (show)

Fixed In Version: libvirt-7.0.0-12.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-25 06:46:34 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
Domain XML with VFIO host device (9.00 KB, text/plain)
2021-01-14 15:27 UTC, omergi
no flags Details
Domain XML after hot-unplug VFIO host device (8.69 KB, text/plain)
2021-01-14 15:28 UTC, omergi
no flags Details
VFIO host device XML (177 bytes, text/plain)
2021-01-14 15:29 UTC, omergi
no flags Details

Description omergi 2021-01-14 15:27:44 UTC
Created attachment 1747452 [details]
Domain XML with VFIO host device

Description of problem:
KubeVirt currently runs libvirt as root in non privileged container. 
This container doesn't have SYS_ADMIN or CAP_SYS_RESOURCE capabilities.

On running Kubevirt VM, when we detach a VFIO host device (hot-unplug)
and try to attach it again (hot-plug) it fails with the following error:

From the pod that host libvirt daemon:
# virsh list
 Id   Name            State
-------------------------------
 1    default_test3   running

# virsh detach-device 1 hostdev.xml 
Device detached successfully

# virsh attach-device 1 hostdev.xml --live
error: Failed to attach device from hostdev.xml
error: cannot limit locked memory of process 102 to 2074083328: Operation not permitted

# error from debug logs 
cannot get locked memory limit of process 102: Operation not permitted","pos":"virProcessGetMaxMemLock:818","subcomponent":"libvirt","thread":"44"
cannot limit locked memory of process 102 to 2074083328: Operation not permitted","pos":"virProcessSetMaxMemLock:774","subcomponent":"libvirt","thread":"44"


We identified that the lack of SYS_RESOURCE capability in the container that hosts libvirt daemon process and the fact that libvirtd and qemu processes use different user/group causing this behavior.

Version-Release number of selected component (if applicable):
libvirt-daemon-driver-qemu-6.6.0-3


Steps to Reproduce:
1. Create domain with assigned VFIO device using Kubevirt (attachment 1747452 [details]).
For example:

kind: VirtualMachineInstance
spec:
  networks:
  - name: sriov3
    multus:
      networkName: sriov-network
  domain:
    devices:
      interfaces:
      - sriov: {}
        name: sriov3

The above configuration converted to:
<domain type='kvm' id='1'>
...
  <devices>
  ...
    <hostdev mode='subsystem' type='pci' managed='no'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x03' function='0x1'/>
      </source>
      <alias name='ua-sriov-sriov3'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x01' function='0x0'/>
    </hostdev>
   ...
   </devices>
...
</domain>

2. Connect to the non privileged container that runs libvirt daemon
3. Detach a vfio hostdevice (attachment 1747455 [details])
    virsh detach-device 1 hostdev.xml 
4. Attach that device
    virsh attach-device 1 hostdev.xml --live

Actual results:
Failed to hot-unplug and then hotplug a VFIO host device to domain in non-privileged container.

Expected results:
hot-unplug and then hotplug a VFIO host device to domain in non-privileged container successfully.

Additional info:

Comment 1 omergi 2021-01-14 15:28:44 UTC
Created attachment 1747454 [details]
Domain XML after hot-unplug VFIO host device

Comment 2 omergi 2021-01-14 15:29:28 UTC
Created attachment 1747455 [details]
VFIO host device XML

Comment 3 omergi 2021-01-15 09:53:05 UTC
I would like to suggest a workaround for this issue 
After reviewing [1] and [2], I noticed that we could make 
max memlock limit adjustment to do best-effort when it should 
be decreased.

1. Fetching current qemu domain process max memlock limit [1] 
Getting the current max memlock limit is done by using 
prlimit syscall requires a process to have CAP_SYS_RESOURCE 
capability (correct me if I am wrong)
In case of a failure we could do best-effort and get the 
max memlock limits from /proc/<PID>/limits file instead.


2. Setting requested max memlock limit to qemu domain process [2] 
In case setting the requested max memlock limit fails, 
instead of failing immediately: 
We could check if the requested limit is lower then the current limit,
if true do not fail, in any other case fail.
  
[1] https://gitlab.com/libvirt/libvirt/-/blob/v6.6.0/src/util/virprocess.c#L785
[2] https://gitlab.com/libvirt/libvirt/-/blob/v6.6.0/src/qemu/qemu_domain.c#L8869

Comment 4 Jiri Denemark 2021-01-15 10:08:16 UTC
Right I think this is a reasonable thing to do. Failing to set the limit is
not an issue if the container already has a big enough limit.

Comment 10 Andrea Bolognani 2021-02-04 19:21:44 UTC
(In reply to omergi from comment #3)
> I would like to suggest a workaround for this issue 
> After reviewing [1] and [2], I noticed that we could make 
> max memlock limit adjustment to do best-effort when it should 
> be decreased.
> 
> 1. Fetching current qemu domain process max memlock limit [1] 
> Getting the current max memlock limit is done by using 
> prlimit syscall requires a process to have CAP_SYS_RESOURCE 
> capability (correct me if I am wrong)

This is the case according to prlimit(2), and indeed also how it
appears to work in practice:

  $ prlimit -p $$
  RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
  AS         address space limit                unlimited unlimited bytes
  ...

  $ prlimit -p 1
  prlimit: failed to get the AS resource limit: Operation not permitted

> In case of a failure we could do best-effort and get the 
> max memlock limits from /proc/<PID>/limits file instead.

Yeah, that should be viable:

  $ cat /proc/1/limits
  Limit                     Soft Limit           Hard Limit           Units
  Max cpu time              unlimited            unlimited            seconds
  ...

It feels quite silly, though: if I can look up the limits for a
random process by parsing a specific /proc file, why would I not be
allowed to obtain the same information via a system call?

> 2. Setting requested max memlock limit to qemu domain process [2] 
> In case setting the requested max memlock limit fails, 
> instead of failing immediately: 
> We could check if the requested limit is lower then the current limit,
> if true do not fail, in any other case fail.

After spending a bit of time looking into this, I'm afraid making it
work is not going to be as easy as one would hope :(

Basically this is how libvirt expect things to work:

  1) libvirtd is started with the default, usually very low, memory
     locking limit;

  2) when a VM that requires to lock additional memory is started,
     the limit for the QEMU process is increased accordingly;

  3) if the requirement is removed, for example because the last VFIO
     device that was assigned to the VM is hotunplugged, the limit is
     restored to the original, low value;

  4) if the requirement is later re-introduced, the limit is raised
     once more.

KubeVirt, on the other hand, is expecting things to work this way:

  1) KubeVirt raises the memory limit of the libvirtd process much
     higher than the default, and in fact higher than it expects
     libvirt to set it for the QEMU process;

  2) libvirt calculates the more accurate, lower limit and applies it
     to the QEMU process;

  3) further changes to the limit are performed without requiring
     elevated privileges or special capabilities such as
     CAP_SYS_RESOURCE.

While this summary fails to capture a lot of nuance - such as for
example the fact that, when the VM is first started, there is more
that we are allowed to do if we make sure it happens before exec() is
called - the basic problem is: if libvirtd is not given enough
privilege to raise the limit as needed, how can it cope with
hotplugging?

I can imagine a scenario where libvirt does not attempt to lower the
limit once the last VFIO device is hotunplugged from it, but that
doesn't address the scenario where a VM that initially had no VFIO
devices attached to it gets one hotplugged. In fact, based on my
understanding of the code I would expect this last scenario to fail
already - is that not the case? Am I missing something?

Comment 11 Edward Haas 2021-02-05 11:02:04 UTC
(In reply to Andrea Bolognani from comment #10)

> I can imagine a scenario where libvirt does not attempt to lower the
> limit once the last VFIO device is hotunplugged from it, but that
> doesn't address the scenario where a VM that initially had no VFIO
> devices attached to it gets one hotplugged. In fact, based on my
> understanding of the code I would expect this last scenario to fail
> already - is that not the case? Am I missing something?

I think you are right, but kubevirt can get out of this.

Kubevirt currently is able to increase libvirtd locked memory from outside the container
by identifying the libvirtd pid and increasing the locked memory on it.
Therefore, it can do the same for the qemu pid.

Furthermore, there is another scenario where this fix will help without
any special workaround from kubevirt: On the source host,
the devices are detached before the migration, but in case of
a migration failure, it would be preferred to re-attach them
again to the domain. This currently does not work, although
the locked memory has the appropriate (exact) size.

Originally I hopped that the migration process will create the
domain with the original locked memory, but that is indeed
incorrect and kubevirt must intervene to raise it before
attempting to attach a VFIO device.
It would have been nicer not to do that, but unless we can
add a "fake" VFIO device or other parameter to trick libvirt
in thinking a VFIO is to be taken into account, I see no other
way.

Comment 12 Petr Horáček 2021-02-10 14:11:11 UTC
Andrea, have Edward answered your concerns? Is there anything else we could help you clarify?

Comment 13 Laine Stump 2021-02-11 08:06:04 UTC
This is a bit of a tangent, but just FYI, I just posted libvirt patches to support the <teaming> element with a plain <hostdev> (so it doesn't need access to the PF.

This doesn't solve the problem described in this BZ, but it would permit a much simpler implementation of migration with a VFIO-assigned VF (in particular, it eliminates the need for hotplugging the device just to migrate).

Patches are here in case anyone wants to take a look:

https://listman.redhat.com/archives/libvir-list/2021-February/msg00651.html

Comment 14 Andrea Bolognani 2021-02-11 15:22:01 UTC
(In reply to Edward Haas from comment #11)
> (In reply to Andrea Bolognani from comment #10)
> > I can imagine a scenario where libvirt does not attempt to lower the
> > limit once the last VFIO device is hotunplugged from it, but that
> > doesn't address the scenario where a VM that initially had no VFIO
> > devices attached to it gets one hotplugged. In fact, based on my
> > understanding of the code I would expect this last scenario to fail
> > already - is that not the case? Am I missing something?
> 
> I think you are right, but kubevirt can get out of this.
> 
> Kubevirt currently is able to increase libvirtd locked memory from outside
> the container
> by identifying the libvirtd pid and increasing the locked memory on it.
> Therefore, it can do the same for the qemu pid.
> 
> Furthermore, there is another scenario where this fix will help without
> any special workaround from kubevirt: On the source host,
> the devices are detached before the migration, but in case of
> a migration failure, it would be preferred to re-attach them
> again to the domain. This currently does not work, although
> the locked memory has the appropriate (exact) size.
> 
> Originally I hopped that the migration process will create the
> domain with the original locked memory, but that is indeed
> incorrect and kubevirt must intervene to raise it before
> attempting to attach a VFIO device.
> It would have been nicer not to do that, but unless we can
> add a "fake" VFIO device or other parameter to trick libvirt
> in thinking a VFIO is to be taken into account, I see no other
> way.

Looking again at the initial report, I was confused as to how the
detach-device call could succeed, considering that libvirt will after
hot-unplug attempt to return the memory locking limit to its original
value, which it shouldn't be able to do.

What happens is, when qemuDomainAdjustMaxMemLock() is initially
called, it will ignore the failure in virProcessGetMaxMemLock(), and
that in turn will result in the call made to
virProcessSetMaxMemLock() at hot-unplug time to become a no-op. Hence
the lack of error.


Anyway, going back to the proposed solution of turning lowering the
memlock limit from a hard failure to a soft one: I don't think we can
do that in the general case. While in the context of KubeVirt
libvirtd and QEMU are running in a constrained environment, in a
classic virtualization scenario they aren't, and giving the QEMU
process the ability to lock a lot more memory than we know it needs
strikes me as a potential security issue.

At the same time, with the ability to call prlimit() removed from
libvirtd, there is not much it can do to either lower or raise the
memlock limit...

So I think the only viable way forward is to

  1) change libvirt so that, when CAP_SYS_RESOURCE is not available,
     it doesn't even attempt to change the memlock limit;

  2) change KubeVirt so that it consistently updates the memlock
     limit for the QEMU process as devices are hot(un)plugged.

Maybe instead of quietly not updating the memory limits when
CAP_SYS_RESOURCE is not available we might want to have a qemu.conf
setting that can be used to opt into this behavior? I don't have an
exceedingly strong opinion on the subject.

Does this proposal sound reasonable?


As sort of a side-question, can you recommend convenient ways to test
VFIO scenarios for KubeVirt? Ideally something that doesn't require
having actual VFIO-capable hardware handy, like assigning emulated
devices to a nested VM or something along those lines?

Comment 15 Edward Haas 2021-02-12 15:40:27 UTC
(In reply to Andrea Bolognani from comment #14)
> 
> Looking again at the initial report, I was confused as to how the
> detach-device call could succeed, considering that libvirt will after
> hot-unplug attempt to return the memory locking limit to its original
> value, which it shouldn't be able to do.
> 
> What happens is, when qemuDomainAdjustMaxMemLock() is initially
> called, it will ignore the failure in virProcessGetMaxMemLock(), and
> that in turn will result in the call made to
> virProcessSetMaxMemLock() at hot-unplug time to become a no-op. Hence
> the lack of error.

Yes, it will just silently fail and leave the memory as is.
 
> 
> Anyway, going back to the proposed solution of turning lowering the
> memlock limit from a hard failure to a soft one: I don't think we can
> do that in the general case. While in the context of KubeVirt
> libvirtd and QEMU are running in a constrained environment, in a
> classic virtualization scenario they aren't, and giving the QEMU
> process the ability to lock a lot more memory than we know it needs
> strikes me as a potential security issue.

I think the suggested logic prioritize keeping the existing behavior
as is and only extending it to be "softer" when declaring an error.
For a classic setup, it will work as today and for setups with
the missing capability, to fail only in case the end result cannot
sustain the requirements (i.e. only failure to increase the memlock).

And it should do that only if the relevant error is thrown, not on any
error.

> 
> At the same time, with the ability to call prlimit() removed from
> libvirtd, there is not much it can do to either lower or raise the
> memlock limit...
> 
> So I think the only viable way forward is to
> 
>   1) change libvirt so that, when CAP_SYS_RESOURCE is not available,
>      it doesn't even attempt to change the memlock limit;
> 
>   2) change KubeVirt so that it consistently updates the memlock
>      limit for the QEMU process as devices are hot(un)plugged.
> 
> Maybe instead of quietly not updating the memory limits when
> CAP_SYS_RESOURCE is not available we might want to have a qemu.conf
> setting that can be used to opt into this behavior? I don't have an
> exceedingly strong opinion on the subject.
> 
> Does this proposal sound reasonable?

From a a client point of view, I would like to see the hot-unplug
keeping its behavior as is (makes sense also for backward comparability).
That means that it should not lower the memlock in case it cannot.

On hot-plug I will like libvirt not to fail if there is enough memlock
on the domain process. If there is not enough, to fail.

On how this is solved internally we have little say here.
I can only recommend here to use "ask for forgiveness and not permission".
I do not know if there other constraints on when the memlock is not available,
and to me, it would seem to be nicer to solve it in a consistent manner.
By consistent, I mean that lowering the memlock is optional, raising it
is a must and reading the current value can have an alternative path.
Be it CAP_SYS_RESOURCE or not, this logic sounds safe to me.

Regarding the configuration option: I would prefer to see an option
to create a domain with a "dummy" VFIO device or a field that tells
libvirt to always assume there is a VFIO device in the config.
This should also solve the problem in cases where we are interested to
detach/attach, be it locally or through a migration.
But this is only optional, and can even be an alternative to the prev
logic.

> 
> 
> As sort of a side-question, can you recommend convenient ways to test
> VFIO scenarios for KubeVirt? Ideally something that doesn't require
> having actual VFIO-capable hardware handy, like assigning emulated
> devices to a nested VM or something along those lines?

I think that passing a simple NIC to the VM should do the trick, no?
I do not know if it can be passed through two levels.

Comment 16 Andrea Bolognani 2021-02-12 17:39:06 UTC
(In reply to Edward Haas from comment #15)
> (In reply to Andrea Bolognani from comment #14)
> > Anyway, going back to the proposed solution of turning lowering the
> > memlock limit from a hard failure to a soft one: I don't think we can
> > do that in the general case. While in the context of KubeVirt
> > libvirtd and QEMU are running in a constrained environment, in a
> > classic virtualization scenario they aren't, and giving the QEMU
> > process the ability to lock a lot more memory than we know it needs
> > strikes me as a potential security issue.
> 
> I think the suggested logic prioritize keeping the existing behavior
> as is and only extending it to be "softer" when declaring an error.
> For a classic setup, it will work as today and for setups with
> the missing capability, to fail only in case the end result cannot
> sustain the requirements (i.e. only failure to increase the memlock).
> 
> And it should do that only if the relevant error is thrown, not on any
> error.

Well, whether you decide whether to call prlimit() based on the
availability of CAP_SYS_RESOURCE or you call the function
unconditionally and then decide whether to report an error on failure
based on whether -EPERM is returned are just two slightly different
shades of the same bikeshed color, aren't they? :)

> > At the same time, with the ability to call prlimit() removed from
> > libvirtd, there is not much it can do to either lower or raise the
> > memlock limit...
> > 
> > So I think the only viable way forward is to
> > 
> >   1) change libvirt so that, when CAP_SYS_RESOURCE is not available,
> >      it doesn't even attempt to change the memlock limit;
> > 
> >   2) change KubeVirt so that it consistently updates the memlock
> >      limit for the QEMU process as devices are hot(un)plugged.
> > 
> > Maybe instead of quietly not updating the memory limits when
> > CAP_SYS_RESOURCE is not available we might want to have a qemu.conf
> > setting that can be used to opt into this behavior? I don't have an
> > exceedingly strong opinion on the subject.
> > 
> > Does this proposal sound reasonable?
> 
> From a a client point of view, I would like to see the hot-unplug
> keeping its behavior as is (makes sense also for backward comparability).
> That means that it should not lower the memlock in case it cannot.
> 
> On hot-plug I will like libvirt not to fail if there is enough memlock
> on the domain process. If there is not enough, to fail.

This last part makes sense: if we're not calling prlimit() ourselves,
then we should very well make sure that the existing limit is at
least as high as we would have set it.

> On how this is solved internally we have little say here.
> I can only recommend here to use "ask for forgiveness and not permission".
> I do not know if there other constraints on when the memlock is not
> available,
> and to me, it would seem to be nicer to solve it in a consistent manner.
> By consistent, I mean that lowering the memlock is optional, raising it
> is a must and reading the current value can have an alternative path.
> Be it CAP_SYS_RESOURCE or not, this logic sounds safe to me.

The problem with applying this approach unconditionally, as I've
tried to explain earlier, is that in the general case failing to keep
the memlock limit as low as possible is a potential security risk,
and so it should be considered a hard failure rather than a soft one.
The fact that libvirt currently doesn't behave that way is a bug,
although one that users are quite unlikely to actually hit outside of
KubeVirt.

This same reason also makes me think that we should have a
configuration option for this after all, rather than basing the
behavior exclusively on whether CAP_SYS_RESOURCE is available: if
someone is running libvirt in a containerized environment but without
an external entity such as KubeVirt taking care of modifying the
memlock limit, then all failures should result in hard errors.

> Regarding the configuration option: I would prefer to see an option
> to create a domain with a "dummy" VFIO device or a field that tells
> libvirt to always assume there is a VFIO device in the config.
> This should also solve the problem in cases where we are interested to
> detach/attach, be it locally or through a migration.
> But this is only optional, and can even be an alternative to the prev
> logic.

Note that, while things are relatively simple for x86, they can be
much more nuanced on other architectures: see for example what
happens on ppc64[1], where such a binary option would not be
effective because having zero, one, two or more VFIO devices assigned
to the VM result in different memlock limits.

In addition to that, note that using mdev and NVMe devices might also
require a higher-than-normal memlock limit, so you can't just solve
everything with a dummy VFIO device.

> > As sort of a side-question, can you recommend convenient ways to test
> > VFIO scenarios for KubeVirt? Ideally something that doesn't require
> > having actual VFIO-capable hardware handy, like assigning emulated
> > devices to a nested VM or something along those lines?
> 
> I think that passing a simple NIC to the VM should do the trick, no?
> I do not know if it can be passed through two levels.

My laptop doesn't have any VFIO-capable devices in it, so that rules
that out.

Either way, I would assume the KubeVirt functest suite to have some
test that ensure VFIO functionality is not broken by code changes,
no? And that would run even on cloud machines which don't have
assignable PCI devices of their own?

I think pkg/virt-handler/device-manager/pci_device_test.go might be
close to what I'm looking for, but I was hoping you could just point
me straight to the interesting part so that I could avoid having to
navigate the uncharted territory first :)


[1] https://gitlab.com/libvirt/libvirt/-/blob/f28a652a32e0764e1ab528a004c33a49937a0c45/src/qemu/qemu_domain.c#L8935-9060

Comment 17 Andrea Bolognani 2021-03-05 19:21:12 UTC
Patches posted upstream:

  https://listman.redhat.com/archives/libvir-list/2021-March/msg00293.html

Sorry this took so long! I got kinda lost in the woods trying to make
sense of our internal APIs, and ended up spending a significant chunk
of time making them more reasonable.

Edward, if I provided you with a build of libvirt that includes this
new feature, would you be able to work on the KubeVirt changes and
validate that it does in fact solve your issue?

Comment 18 Edward Haas 2021-03-09 07:27:44 UTC
(In reply to Andrea Bolognani from comment #17)
> Edward, if I provided you with a build of libvirt that includes this
> new feature, would you be able to work on the KubeVirt changes and
> validate that it does in fact solve your issue?

Yes, we can validate this.
We will need it based on libvirt 6.6 as this is what we are currently deploying.

Comment 19 Andrea Bolognani 2021-03-10 10:53:13 UTC
(In reply to Edward Haas from comment #18)
> (In reply to Andrea Bolognani from comment #17)
> > Edward, if I provided you with a build of libvirt that includes this
> > new feature, would you be able to work on the KubeVirt changes and
> > validate that it does in fact solve your issue?
> 
> Yes, we can validate this.
> We will need it based on libvirt 6.6 as this is what we are currently
> deploying.

Great!

Please grab the branch at

  https://github.com/andreabolognani/kubevirt/tree/memlock-test

and let me know whether that improves things for you.

Comment 20 Andrea Bolognani 2021-03-15 16:50:53 UTC
(In reply to Andrea Bolognani from comment #19)
> (In reply to Edward Haas from comment #18)
> > Yes, we can validate this.
> > We will need it based on libvirt 6.6 as this is what we are currently
> > deploying.
>
> Great!
>
> Please grab the branch at
>
>   https://github.com/andreabolognani/kubevirt/tree/memlock-test
>
> and let me know whether that improves things for you.

Edward,

did you get a chance to validate the libvirt changes?

They have now been accepted upstream, but it would be great if I
could get a thumbs up from the KubeVirt side before actually merging
them, in the off chance that further tweaks are necessary.

Thanks!

Comment 21 omergi 2021-03-21 17:50:02 UTC
Hi Andrea

I can confirm the version on 'memlock-test' branch works,verified by:
 creating kubevirt VM with SRIOV VF device
 connected to the non-privileged container
 detach SRIOV VF hostdev 
 re-attach the device
Finished successfully w/o the Operation not permitted errors.
Also checked libvirtd and qemu-kvm processes memlock rlimits 
and they wont change on detach/attach the device.


Looking forward for the release with those bits
Thanks you very much! :)

Comment 22 Andrea Bolognani 2021-03-22 15:15:06 UTC
(In reply to omergi from comment #21)
> I can confirm the version on 'memlock-test' branch works,verified by:
>  creating kubevirt VM with SRIOV VF device
>  connected to the non-privileged container
>  detach SRIOV VF hostdev 
>  re-attach the device
> Finished successfully w/o the Operation not permitted errors.
> Also checked libvirtd and qemu-kvm processes memlock rlimits 
> and they wont change on detach/attach the device.

Thanks for testing! I have now pushed this as

  commit af41380672a1c1c7ce5531f761bf8834452ff58e
  Author: Andrea Bolognani <abologna>
  Date:   Tue Mar 9 11:40:21 2021 +0100

    qemu: Only raise memlock limit if necessary
    
    Attempting to set the memlock limit might fail if we're running
    in a containerized environment where CAP_SYS_RESOURCE is not
    available, and if the limit is already high enough there's no
    point in trying to raise it anyway.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1916346
    
    Signed-off-by: Andrea Bolognani <abologna>
    Reviewed-by: Michal Privoznik <mprivozn>

  v7.1.0-280-gaf41380672

and it will be included in the upcoming libvirt 7.2.0 release.

Have you also tested the scenario where a VFIO device is hotplugged
into a VM that did *not* have one assigned to it when it was created?

This should work fine from the libvirt point of view, but I think
KubeVirt might need to be modified so that it pokes inside the
virt-launcher container, looks for the QEMU process and raises its
memlock limit the same way it would for the libvirtd process if a
VFIO device had been configured from the start.

Comment 25 Edward Haas 2021-03-23 05:45:21 UTC
> and it will be included in the upcoming libvirt 7.2.0 release.
> 
> Have you also tested the scenario where a VFIO device is hotplugged
> into a VM that did *not* have one assigned to it when it was created?

This is not a supported scenario in KubeVirt. But I think we actually tested
it as part of the migration flow which starts a domain on the target without
any VFIO (SR-IOV in our case) devices, then attaches them.
I think this is pretty much the same flow as you described except that
it is done on a migrated domain.

> 
> This should work fine from the libvirt point of view, but I think
> KubeVirt might need to be modified so that it pokes inside the
> virt-launcher container, looks for the QEMU process and raises its
> memlock limit the same way it would for the libvirtd process if a
> VFIO device had been configured from the start.

Yes, this is what was done by Or M. when testing.

Comment 27 Andrea Bolognani 2021-03-25 10:52:05 UTC
(In reply to Edward Haas from comment #25)
> > and it will be included in the upcoming libvirt 7.2.0 release.
> > 
> > Have you also tested the scenario where a VFIO device is hotplugged
> > into a VM that did *not* have one assigned to it when it was created?
> 
> This is not a supported scenario in KubeVirt. But I think we actually tested
> it as part of the migration flow which starts a domain on the target without
> any VFIO (SR-IOV in our case) devices, then attaches them.
> I think this is pretty much the same flow as you described except that
> it is done on a migrated domain.

I think the difference is that, in the migration case, KubeVirt knows
in advance that a VFIO device is going to be added to the VM, and so
it will raise the memlock limit for libvirtd on the target host
before the VM is started; in a pure hotplug scenario, the requirement
to increase the memlock limit would be introduced during the lifetime
of the VM, and so the limit would need to be increased for the qemu
process rather than the libvirtd process.

That said, if hotplug of VFIO devices is not supported in KubeVirt
then we don't need to change anything I guess :)

(In reply to Edward Haas from comment #26)
> [...] even with the current
> state, we have a workaround (an nasty one) that gives the CAP_SYS_RESOURCE capability
> to the container *if* SR-IOV live migration feature is enabled (it is feature-gated).

Was this workaround disabled when testing the libvirt changes? If
not, we might not have validated that the fix actually does what it's
intended to.

IIUC, most of

  https://github.com/kubevirt/kubevirt/pull/4874

would have to be reverted, and the feature gate (if retained) would
become a no-op.

Comment 28 omergi 2021-03-25 13:48:42 UTC
> I think the difference is that, in the migration case, KubeVirt knows
> in advance that a VFIO device is going to be added to the VM, and so
> it will raise the memlock limit for libvirtd on the target host
> before the VM is started; in a pure hotplug scenario, the requirement
> to increase the memlock limit would be introduced during the lifetime
> of the VM, and so the limit would need to be increased for the qemu
> process rather than the libvirtd process.
> 
> That said, if hotplug of VFIO devices is not supported in KubeVirt
> then we don't need to change anything I guess :)
> 

When I tested migration with your patch I had to do adjustments to Kubevirt 
so it will change QEMU process memlock limit as well as libvirtd process.

> Was this workaround disabled when testing the libvirt changes? If
> not, we might not have validated that the fix actually does what it's
> intended to.
> 
> IIUC, most of
> 
>   https://github.com/kubevirt/kubevirt/pull/4874
> 
> would have to be reverted, and the feature gate (if retained) would
> become a no-op.

When I tested it I removed the feature-gate code (the PR you mentioned)
and verified the VM pod runs without SYS_RESOURCE capability.

Comment 29 Andrea Bolognani 2021-03-25 17:10:27 UTC
(In reply to omergi from comment #28)
> > I think the difference is that, in the migration case, KubeVirt knows
> > in advance that a VFIO device is going to be added to the VM, and so
> > it will raise the memlock limit for libvirtd on the target host
> > before the VM is started; in a pure hotplug scenario, the requirement
> > to increase the memlock limit would be introduced during the lifetime
> > of the VM, and so the limit would need to be increased for the qemu
> > process rather than the libvirtd process.
> > 
> > That said, if hotplug of VFIO devices is not supported in KubeVirt
> > then we don't need to change anything I guess :)
> 
> When I tested migration with your patch I had to do adjustments to Kubevirt 
> so it will change QEMU process memlock limit as well as libvirtd process.

Oh, great!

> > Was this workaround disabled when testing the libvirt changes? If
> > not, we might not have validated that the fix actually does what it's
> > intended to.
> > 
> > IIUC, most of
> > 
> >   https://github.com/kubevirt/kubevirt/pull/4874
> > 
> > would have to be reverted, and the feature gate (if retained) would
> > become a no-op.
> 
> When I tested it I removed the feature-gate code (the PR you mentioned)
> and verified the VM pod runs without SYS_RESOURCE capability.

Very cool! I assume you'll send in both these changes as a follow-up
PR once libvirt has been updated to contain the fix?

Comment 33 Michal Privoznik 2021-03-26 17:05:31 UTC
Setting UPSTREAM keyword based on comment 22.

Comment 36 yalzhang@redhat.com 2021-03-29 09:41:27 UTC
Hi Andrea, I have tried to reproduce it on libvirt-libs-7.0.0-11.module+el8.4.0+10505+3a8d753f.x86_64. The details are as below, could you please help to check if the configuration and steps are correct? Thank you!

1. Prepare a running vm with hostdev device:
# virsh nodedev-detach pci_0000_82_10_7
Device pci_0000_82_10_7 detached

Start a vm with hostdev device as below:
 <hostdev mode='subsystem' type='pci' managed='no'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

Stop libvirtd service:
# systemctl stop libvirtd

2.Modify qemu.conf to start qemu process with root:
# cat /etc/libvirt/qemu.conf
user = "root"
group = "root"

3.Create a normal user:
# useradd test
# passwd test

4.Add permission of libvirt related dirs for user test:
# chown -R test:test /etc/libvirt
# chown -R test:test /var/lib/libvirt
# chown -R test:test /var/cache/libvirt
# chown -R test:test /run/
# chown -R test:test /var/run

5.Start a container with test user and then start virtqemud process with root user in container:
$ unshare --map-root-user -U -- virtqemud --verbose


6.Open another terminal, set libvirt.conf to connect to virtqemud:
$ cat /etc/libvirt/libvirt.conf
uri_default = "qemu:///system"
remote_mode = "direct"

7.After virtqemud started, open another terminal and check the device for the vm:
$ unshare --map-root-user -U -- virsh -c qemu:///system edit rhel
...
 <hostdev mode='subsystem' type='pci' managed='no'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>
…

8.prepare the xml and do hot unplug and hotplug:
$ cat net.xml
<hostdev mode='subsystem' type='pci' managed='no'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </hostdev>

$ unshare --map-root-user -U -- virsh -c qemu:///system  detach-device rhel net.xml 
Device detached successfully

$ unshare --map-root-user -U -- virsh -c qemu:///system  attach-device rhel net.xml 
error: Failed to attach device from net.xml
error: cannot limit locked memory of process 32852 to 2147483648: Operation not permitted

Comment 37 Andrea Bolognani 2021-03-29 17:07:51 UTC
(In reply to yalzhang from comment #36)
> Hi Andrea, I have tried to reproduce it on
> libvirt-libs-7.0.0-11.module+el8.4.0+10505+3a8d753f.x86_64. The details are
> as below, could you please help to check if the configuration and steps are
> correct? Thank you!

Do I understand correctly that the steps above are intended to
reproduce the issue in the current official build of libvirt, so that
you can use that as a baseline when verifying that the backport did
successfully address the bug?

> 1. Prepare a running vm with hostdev device:
> # virsh nodedev-detach pci_0000_82_10_7
> Device pci_0000_82_10_7 detached
> 
> Start a vm with hostdev device as below:
>  <hostdev mode='subsystem' type='pci' managed='no'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/>
>       </source>
>       <address type='pci' domain='0x0000' bus='0x04' slot='0x00'
> function='0x0'/>
>     </hostdev>
> 
> Stop libvirtd service:
> # systemctl stop libvirtd
> 
> 2.Modify qemu.conf to start qemu process with root:
> # cat /etc/libvirt/qemu.conf
> user = "root"
> group = "root"

This doesn't match the configuration used by KubeVirt or RHEL, both
of which use qemu:qemu for the QEMU process.

Why are you changing this? Do things fail otherwise? The VM is
already running at this point, so changing the configuration file is
not going to have any effect as far as I can tell.

> 3.Create a normal user:
> # useradd test
> # passwd test
> 
> 4.Add permission of libvirt related dirs for user test:
> # chown -R test:test /etc/libvirt
> # chown -R test:test /var/lib/libvirt
> # chown -R test:test /var/cache/libvirt
> # chown -R test:test /run/
> # chown -R test:test /var/run

Hopefully you're doing this in a throwaway test environment, because
it's definitely not the kind of change one wants to make to a regular
machine :)

> 5.Start a container with test user and then start virtqemud process with
> root user in container:
> $ unshare --map-root-user -U -- virtqemud --verbose

May I ask why you're using virtqemud here?

While the expectation is that everything will work just fine when
using the modular daemons, the reality is that what's currently used
by KubeVirt, as well as the default setup for RHEL, is the monolithic
daemon, so it seems to me that it would make more sense to follow the
path of least resistance and use that one? Especially since that's
how you started the VM in the first place, so you're sort of mixing
and matching.

> 6.Open another terminal, set libvirt.conf to connect to virtqemud:
> $ cat /etc/libvirt/libvirt.conf
> uri_default = "qemu:///system"

This is unnecessary, as you're passing the URL to virsh every single
time.

> remote_mode = "direct"

virsh should be smart enough to connect to the standalone virtqemud
daemon if it can't connect to libvirtd, but you can also force that
without editing libvirt.conf by using

  qemu:///system?mode=direct

as the connection URI. However, see the point made above for using
virtqemud as opposed to libvirtd.

> 7.After virtqemud started, open another terminal and check the device for
> the vm:
> $ unshare --map-root-user -U -- virsh -c qemu:///system edit rhel
> ...
>  <hostdev mode='subsystem' type='pci' managed='no'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/>
>       </source>
>       <address type='pci' domain='0x0000' bus='0x04' slot='0x00'
> function='0x0'/>
>     </hostdev>
> …
> 
> 8.prepare the xml and do hot unplug and hotplug:
> $ cat net.xml
> <hostdev mode='subsystem' type='pci' managed='no'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/>
>       </source>
>       <alias name='hostdev0'/>
>       <address type='pci' domain='0x0000' bus='0x01' slot='0x00'
> function='0x0'/>
>     </hostdev>
> 
> $ unshare --map-root-user -U -- virsh -c qemu:///system  detach-device rhel
> net.xml 
> Device detached successfully
> 
> $ unshare --map-root-user -U -- virsh -c qemu:///system  attach-device rhel
> net.xml 
> error: Failed to attach device from net.xml
> error: cannot limit locked memory of process 32852 to 2147483648: Operation
> not permitted

This last part looks about right.


Taking a step back, what you're doing to reproduce the issue seems to
be more complicated than I would assume to be necessary. Is starting
the VM with the regular, fully-privileged libvirtd something that you
need to do in order to be able to use the VFIO device initially?

Another thing: since the whole issue is about libvirtd not having the
CAP_SYS_RESOURCE capability inside KubeVirt's virt-launcher
container, it might make sense to drop that capability explicitly
when (re-)launching the daemon, using something like

  $ unshare --map-root-user -U -- \
      capsh --drop=cap_sys_resource -- \
        libvirtd

When using KubeVirt, the set of capabilities is actually restricted
further, and looks like

  cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
  cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,
  cap_net_admin,cap_sys_chroot,cap_sys_nice,cap_mknod,
  cap_audit_write,cap_setfcap+eip

but I haven't managed to figure out the exact unshare/capsh
invocation that would result in the same set.

Comment 38 yalzhang@redhat.com 2021-03-30 09:51:24 UTC
Hi Andrea, thank you for your feedback about this. Yes, I tried to reproduce the bug in comment 36. And all the settings are to make the virtqemud run successfully in the namespace. The reason to use virtqemud instead of libvirtd is that libvirtd needs more permissions to set.
The method you have mentioned is very convenient to use, but I can not run it successfully, could you provide some detailed configurations and preparations?

We have seen such bugs about kubevirt before(Bug 1901685), but we do not have a kubevirt environment. How to emulate the environment to modify and check a specific permission is a problem. Do you know how to check if a specific permission is available or not, and if it is expected when we start libvirtd in the namespace?  

There are my steps:
Start libvirtd service, detach the nodedev device, then start the vm with hostdev device, then stop the libvirtd service, switch to test user, try to run libvirtd without any configurations for the directories or config files:
[test@dell-per730-36 ~]$ unshare --map-root-user -U -- capsh --drop=cap_sys_resource  -- libvirtd
/usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file

[test@dell-per730-36 ~]$ unshare --map-root-user -U  -- libvirtd
2021-03-30 09:23:50.703+0000: 2704: info : libvirt version: 7.0.0, package: 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, )
2021-03-30 09:23:50.703+0000: 2704: info : hostname: dell-per730-36.lab.eng.pek2.redhat.com
2021-03-30 09:23:50.703+0000: 2704: error : main:915 : Can't load config file: Failed to open file '/etc/libvirt/libvirtd.conf': Permission denied: /etc/libvirt/libvirtd.conf

2. Then I modified the permission:
[root@dell-per730-36 ~]# chown -R test:test /etc/libvirt/
[root@dell-per730-36 ~]# chown -R test:test /run
chown: changing ownership of '/run/user/0/gvfs': Function not implemented
[root@dell-per730-36 ~]# chown -R  test:test /var/lib/
[root@dell-per730-36 ~]# chown -R  test:test /var/cache/libvirt/

3. After these modifications, the libvirtd can start but with error:
[test@dell-per730-36 ~]$ unshare --map-root-user -U  -- libvirtd
2021-03-30 09:27:11.504+0000: 2969: info : libvirt version: 7.0.0, package: 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, )
2021-03-30 09:27:11.504+0000: 2969: info : hostname: dell-per730-36.lab.eng.pek2.redhat.com
2021-03-30 09:27:11.504+0000: 2969: error : virGDBusGetSystemBus:101 : internal error: Unable to get system bus connection: Exhausted all available authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL)
2021-03-30 09:27:11.504+0000: 2969: warning : networkStateInitialize:796 : DBus not available, disabling firewalld support in bridge_network_driver: internal error: Unable to get system bus connection: Exhausted all available authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL)
2021-03-30 09:27:11.522+0000: 2969: error : virCommandWait:2760 : internal error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit status 1: /sys/firmware/dmi/tables/smbios_entry_point: Permission denied
/dev/mem: Permission denied
2021-03-30 09:27:11.589+0000: 2982: error : virCgroupSetValueRaw:535 : Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d2\x2drhel.scope/libvirt/cpuset.mems': Permission denied
^C

4. But when I run with the option “-- capsh --drop=cap_sys_resource”, it still fails.
[test@dell-per730-36 ~]$ unshare --map-root-user -U  -- capsh --drop=cap_sys_resource  -- libvirtd
/usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file

5. So I tried to start it without the option:
$ unshare --map-root-user -U  -- libvirtd

On another terminal:
[test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c qemu:///system detach-device rhel net.xml 
Device detached successfully

[test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c qemu:///system attach-device rhel net.xml 
error: Failed to attach device from net.xml
error: cannot limit locked memory of process 2314 to 2147483648: Operation not permitted

Does this show that the cap_sys_resource is dropped by default? If these steps are proper for testing and verifying the bug?

Comment 39 Andrea Bolognani 2021-03-30 12:45:31 UTC
(In reply to yalzhang from comment #38)
> Hi Andrea, thank you for your feedback about this. Yes, I tried to reproduce
> the bug in comment 36. And all the settings are to make the virtqemud run
> successfully in the namespace. The reason to use virtqemud instead of
> libvirtd is that libvirtd needs more permissions to set.
> The method you have mentioned is very convenient to use, but I can not run
> it successfully, could you provide some detailed configurations and
> preparations?

Sorry if I was not clear on this, but the unshare/capsh command line
I provided was not one that I have actually tested but rather my
guess at what sort of combination might do the trick! I wish I had
ready-made instructions to share with you, but the harsh truth is
that I simply don't :(

> We have seen such bugs about kubevirt before(Bug 1901685), but we do not
> have a kubevirt environment. How to emulate the environment to modify and
> check a specific permission is a problem. Do you know how to check if a
> specific permission is available or not, and if it is expected when we start
> libvirtd in the namespace?  

This is another question that I unfortunately don't have a very good
answer for :( What I usually do is start a VM in KubeVirt, then enter
the corresponding container via `kubectl exec` and poke around. To
find out what capabilities a certain process has, I use `getpcap`.

> There are my steps:
> Start libvirtd service, detach the nodedev device, then start the vm with
> hostdev device, then stop the libvirtd service, switch to test user, try to
> run libvirtd without any configurations for the directories or config files:
> [test@dell-per730-36 ~]$ unshare --map-root-user -U -- capsh
> --drop=cap_sys_resource  -- libvirtd
> /usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file
> 
> [test@dell-per730-36 ~]$ unshare --map-root-user -U  -- libvirtd
> 2021-03-30 09:23:50.703+0000: 2704: info : libvirt version: 7.0.0, package:
> 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc.
> <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, )
> 2021-03-30 09:23:50.703+0000: 2704: info : hostname:
> dell-per730-36.lab.eng.pek2.redhat.com
> 2021-03-30 09:23:50.703+0000: 2704: error : main:915 : Can't load config
> file: Failed to open file '/etc/libvirt/libvirtd.conf': Permission denied:
> /etc/libvirt/libvirtd.conf
> 
> 2. Then I modified the permission:
> [root@dell-per730-36 ~]# chown -R test:test /etc/libvirt/
> [root@dell-per730-36 ~]# chown -R test:test /run
> chown: changing ownership of '/run/user/0/gvfs': Function not implemented
> [root@dell-per730-36 ~]# chown -R  test:test /var/lib/
> [root@dell-per730-36 ~]# chown -R  test:test /var/cache/libvirt/
> 
> 3. After these modifications, the libvirtd can start but with error:
> [test@dell-per730-36 ~]$ unshare --map-root-user -U  -- libvirtd
> 2021-03-30 09:27:11.504+0000: 2969: info : libvirt version: 7.0.0, package:
> 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc.
> <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, )
> 2021-03-30 09:27:11.504+0000: 2969: info : hostname:
> dell-per730-36.lab.eng.pek2.redhat.com
> 2021-03-30 09:27:11.504+0000: 2969: error : virGDBusGetSystemBus:101 :
> internal error: Unable to get system bus connection: Exhausted all available
> authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL)
> 2021-03-30 09:27:11.504+0000: 2969: warning : networkStateInitialize:796 :
> DBus not available, disabling firewalld support in bridge_network_driver:
> internal error: Unable to get system bus connection: Exhausted all available
> authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL)
> 2021-03-30 09:27:11.522+0000: 2969: error : virCommandWait:2760 : internal
> error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit
> status 1: /sys/firmware/dmi/tables/smbios_entry_point: Permission denied
> /dev/mem: Permission denied
> 2021-03-30 09:27:11.589+0000: 2982: error : virCgroupSetValueRaw:535 :
> Unable to write to
> '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d2\x2drhel.scope/libvirt/
> cpuset.mems': Permission denied
> ^C
> 
> 4. But when I run with the option “-- capsh --drop=cap_sys_resource”, it
> still fails.
> [test@dell-per730-36 ~]$ unshare --map-root-user -U  -- capsh
> --drop=cap_sys_resource  -- libvirtd
> /usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file
> 
> 5. So I tried to start it without the option:
> $ unshare --map-root-user -U  -- libvirtd
> 
> On another terminal:
> [test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c
> qemu:///system detach-device rhel net.xml 
> Device detached successfully
> 
> [test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c
> qemu:///system attach-device rhel net.xml 
> error: Failed to attach device from net.xml
> error: cannot limit locked memory of process 2314 to 2147483648: Operation
> not permitted
> 
> Does this show that the cap_sys_resource is dropped by default?

I think it just shows that, regardless of whether or not the
container runtime presents CAP_SYS_RESOURCE as available to the
process, the reality is that you're running libvirtd as an
unprivileged process and your user simply lacks the necessary
permissions to execute that operation.

> If these
> steps are proper for testing and verifying the bug?

I don't think they are, because the environment libvirt is running in
looks significantly different from the one it'd be running as part of
KubeVirt.

Back when the base container for virt-launcher was still built
separately instead of along with the rest of KubeVirt, the README
file for the repository contained the following command line:

  $ docker run \
      --name libvirtd \
      --rm \
      --net=host \
      --pid=host \
      --ipc=host \
      --user=root \
      --privileged \
      --volume=/:/host:Z \
      --volume=/dev:/host-dev \
      --volume=/sys:/host-sys \
      --volume=/var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \
      --volume=/var/run/libvirt:/var/run/libvirt \
      --tty=true \
      --detached=true \
      kubevirt/libvirt@sha256:1234567890abcdef...

I wonder if it could be adapted so that you could have a simple
Dockerfile along the lines of

  FROM registry.access.redhat.com/ubi8/ubi
  RUN dnf install libvirt-daemon-kvm

and then run libvirtd inside the container in a way that's reasonably
close to how KubeVirt would run it? That would help a lot both
libvirt developers when it comes to preparing fixes / features
targeted at KubeVirt, and libvirt QEs when it comes the time to
validate them.

@rmohr, what do you think?

Comment 42 Petr Horáček 2021-04-01 09:28:39 UTC
@abologna, @yalzhang I agree that having a simple container to make these kinds of verification easier should be the end goal.

For the meantime, is there another way for you to test this with libvirt alone? Would it help if we gave you access to a cluster that runs this patched version of libvirt, so you can poke around there?

Comment 43 yalzhang@redhat.com 2021-04-01 10:17:32 UTC
(In reply to Petr Horáček from comment #42)
> @abologna, @yalzhang I agree that having a simple
> container to make these kinds of verification easier should be the end goal.
> 
> For the meantime, is there another way for you to test this with libvirt
> alone? Would it help if we gave you access to a cluster that runs this
> patched version of libvirt, so you can poke around there?

Yes, it will be great that we can use the cluster to test the bug as it's difficult for us to prepare such a env. in a short time. Or we may ask @edwardh's help to verify the bug.

Comment 44 Edward Haas 2021-04-01 10:35:34 UTC
(In reply to yalzhang from comment #43)
> (In reply to Petr Horáček from comment #42)
> > @abologna, @yalzhang I agree that having a simple
> > container to make these kinds of verification easier should be the end goal.
> > 
> > For the meantime, is there another way for you to test this with libvirt
> > alone? Would it help if we gave you access to a cluster that runs this
> > patched version of libvirt, so you can poke around there?
> 
> Yes, it will be great that we can use the cluster to test the bug as it's
> difficult for us to prepare such a env. in a short time. Or we may ask
> @edwardh's help to verify the bug.

How is this different from what @omergi performed here: https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c28 ?

Comment 47 Andrea Bolognani 2021-04-01 14:38:36 UTC
(In reply to Edward Haas from comment #44)
> (In reply to yalzhang from comment #43)
> > (In reply to Petr Horáček from comment #42)
> > > @abologna, @yalzhang I agree that having a simple
> > > container to make these kinds of verification easier should be the end goal.
> > > 
> > > For the meantime, is there another way for you to test this with libvirt
> > > alone? Would it help if we gave you access to a cluster that runs this
> > > patched version of libvirt, so you can poke around there?
> > 
> > Yes, it will be great that we can use the cluster to test the bug as it's
> > difficult for us to prepare such a env. in a short time. Or we may ask
> > @edwardh's help to verify the bug.
> 
> How is this different from what @omergi performed here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c28 ?

In that case the libvirt package used was based on 6.6.0, which is
the version used upstream at least until [kubevirt/kubevirt#5328][]
has been merged, with the memlock fix patches backported by
addressing a non-trivial amount of merge conflicts; the way the fix
will land in RHEL AV 8.4, on the other hand, is by performing a very
straightforward backport on top of the much newer libvirt 7.0.0.

Maybe we should approach this from a different angle: can we split QE
responsibilities between libvirt QE and OpenShift Virtualization QE?

The former would verify that, in a RHEL AV environment, the changes
don't cause any regressions, whereas the latter would make sure that
the fix actually addresses the issue KubeVirt faces, just like
@omergi did earlier (but with a proper scratch build this time).

This way, each QE person would be able to work in an environment
they're comfortable with and getting the bug to VERIFIED state should
not see unnecessary delays.

Does that make sense?


[kubevirt/kubevirt#5328]: https://github.com/kubevirt/kubevirt/pull/5328

Comment 48 Edward Haas 2021-04-01 14:55:52 UTC
(In reply to Andrea Bolognani from comment #47)
> (In reply to Edward Haas from comment #44)
> > How is this different from what @omergi performed here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c28 ?
> 
> In that case the libvirt package used was based on 6.6.0, which is
> the version used upstream at least until [kubevirt/kubevirt#5328][]
> has been merged, with the memlock fix patches backported by
> addressing a non-trivial amount of merge conflicts; the way the fix
> will land in RHEL AV 8.4, on the other hand, is by performing a very
> straightforward backport on top of the much newer libvirt 7.0.0.
> 
> Maybe we should approach this from a different angle: can we split QE
> responsibilities between libvirt QE and OpenShift Virtualization QE?
> 
> The former would verify that, in a RHEL AV environment, the changes
> don't cause any regressions, whereas the latter would make sure that
> the fix actually addresses the issue KubeVirt faces, just like
> @omergi did earlier (but with a proper scratch build this time).
> 
> This way, each QE person would be able to work in an environment
> they're comfortable with and getting the bug to VERIFIED state should
> not see unnecessary delays.
> 
> Does that make sense?

Aha, so that was libvirt 6.6 and we now talking about 7.0.

It is pretty complicated to test this on RHEL.
@omergi can test this in the same manner you requested it last time, that should
be close enough to what we need. Right?

Comment 49 Andrea Bolognani 2021-04-01 16:26:54 UTC
(In reply to Edward Haas from comment #48)
> (In reply to Andrea Bolognani from comment #47)
> > Maybe we should approach this from a different angle: can we split QE
> > responsibilities between libvirt QE and OpenShift Virtualization QE?
> > 
> > The former would verify that, in a RHEL AV environment, the changes
> > don't cause any regressions, whereas the latter would make sure that
> > the fix actually addresses the issue KubeVirt faces, just like
> > @omergi did earlier (but with a proper scratch build this time).
> > 
> > This way, each QE person would be able to work in an environment
> > they're comfortable with and getting the bug to VERIFIED state should
> > not see unnecessary delays.
> > 
> > Does that make sense?
> 
> Aha, so that was libvirt 6.6 and we now talking about 7.0.
> 
> It is pretty complicated to test this on RHEL.
> @omergi can test this in the same manner you requested it last time, that
> should
> be close enough to what we need. Right?

Yes, that's exactly what I had in mind! And @yalzhang can simply go
through the existing test scenarios for VFIO with libvirt on RHEL and
verify that the changes don't introduce any regression there.

I'm not familiar with the build process for OpenShift Virtualization,
so I wouldn't know where to start with creating a downstream branch
that can be used for testing, but I assume @omergi will know how to
do that when provided with a bunch of fresh libvirt RPMs.

Does that sound good?

The scratch build that I made a few days ago appears to be still
available. I'll repost the link below.

Comment 51 Andrea Bolognani 2021-04-01 17:35:22 UTC
(In reply to Andrea Bolognani from comment #14)
> As sort of a side-question, can you recommend convenient ways to test
> VFIO scenarios for KubeVirt? Ideally something that doesn't require
> having actual VFIO-capable hardware handy, like assigning emulated
> devices to a nested VM or something along those lines?

@omergi shared the following steps outside of bugzilla, and I'm
reporting them here for reference:

  export KUBEVIRT_PROVIDER=kind-k8s-sriov-1.17.0
  export KUBEVIRT_NUM_NODES=3
  export KUBEVIRT_E2E_FOCUS=SRIOV
  # Create containerized k8s cluster with one control-plane and two workers
  # with sriov-device-plugin and sriov-cni
  make cluster-up
  # Build and deploy Kubevirt operator
  make cluster-sync
  # Run Kubevirt SRIOV E2E tests including the migration test:
  E2E_FOCUS=SRIOV make functest

When I tried following them on my machine, I immediately got

  ./cluster-up/up.sh
  SR-IOV cluster can be only started with 2 nodes
  make: *** [Makefile:122: cluster-up] Error 1

which was easy to work around by setting KUBEVIRT_NUM_NODES to 2
instead of 3; even after I did that, however, 'make cluster-up'
failed with

  + SRIOV_NODE=sriov-worker
  + NODE_PFS=($(move_sriov_pfs_netns_to_node "$SRIOV_NODE" "$PF_COUNT_PER_NODE"))
  ++ move_sriov_pfs_netns_to_node sriov-worker 1
  ++ local -r node=sriov-worker
  ++ local -r pf_count_per_node=1
  +++ docker inspect -f '{{.State.Pid}}' sriov-worker
  ++ local -r pid=499048
  ++ pf_array=()
  ++ local pf_array
  ++ mkdir -p /var/run/netns/
  mkdir: cannot create directory ‘/var/run/netns/’: Permission denied
  ++ ln -sf /proc/499048/ns/net /var/run/netns/sriov-worker
  ln: failed to create symbolic link '/var/run/netns/sriov-worker': No such file or directory
  ++ sriov_pfs=($(find /sys/class/net/*/device/sriov_numvfs))
  +++ find '/sys/class/net/*/device/sriov_numvfs'
  find: ‘/sys/class/net/*/device/sriov_numvfs’: No such file or directory
  ++ local -r sriov_pfs
  ++ '[' 0 -eq 0 ']'
  ++ echo 'FATAL: Could not find available sriov PFs'
  FATAL: Could not find available sriov PFs
  ++ return 1
  make: *** [Makefile:122: cluster-up] Error 1

So I guess it's necessary for the machine this is run on to have at
least one SRIOV-capable device after all? I'll try to see if I can
get my hands on a suitable machine and try again.

Comment 52 Edward Haas 2021-04-01 21:05:13 UTC
(In reply to Andrea Bolognani from comment #49)
> (In reply to Edward Haas from comment #48)
> > Aha, so that was libvirt 6.6 and we now talking about 7.0.
> > 
> > It is pretty complicated to test this on RHEL.
> > @omergi can test this in the same manner you requested it last time, that
> > should
> > be close enough to what we need. Right?
> 
> Yes, that's exactly what I had in mind! And @yalzhang can simply go
> through the existing test scenarios for VFIO with libvirt on RHEL and
> verify that the changes don't introduce any regression there.
> 
> I'm not familiar with the build process for OpenShift Virtualization,
> so I wouldn't know where to start with creating a downstream branch
> that can be used for testing, but I assume @omergi will know how to
> do that when provided with a bunch of fresh libvirt RPMs.
> 
> Does that sound good?

We are able to test this against kubevirt, this is what @omergi has done previously.
I suggest we focus there, the same as we did when you provided the PR for libvirt 6.6

Comment 53 yalzhang@redhat.com 2021-04-02 09:05:24 UTC
I have tested with scratch build in comment 50 with basic scenarios with VFIO on rhel, no regressions found till now. The scenarios includes: 1) start vm with hostdev interface or hostdev device, then do hotplug and unplug; 2) start vm without any hostdev interface or device, then do hotplug and then unplug; I will also trigger one round auto test with it.

Comment 54 Andrea Bolognani 2021-04-02 09:21:47 UTC
(In reply to Edward Haas from comment #52)
> (In reply to Andrea Bolognani from comment #49)
> > I'm not familiar with the build process for OpenShift Virtualization,
> > so I wouldn't know where to start with creating a downstream branch
> > that can be used for testing, but I assume @omergi will know how to
> > do that when provided with a bunch of fresh libvirt RPMs.
>
> We are able to test this against kubevirt, this is what @omergi has done
> previously.
> I suggest we focus there, the same as we did when you provided the PR for
> libvirt 6.6

Gotcha! I have prepared a KubeVirt branch that consumes a libvirt
build made from the same dist-git tree used for the scratch build:

  https://github.com/andreabolognani/kubevirt/tree/bz1916346.memlock

Note that QEMU has *not* been updated, so you'll be driving an old
QEMU with a new libvirt: this is a supported scenario from the
upstream point of view, but not necessarily a recommended one and
also not one that matches what is going to be shipped in whatever
version of KubeVirt switches over to RHEL AV 8.4 contents, but I
thought it would be better to isolate the the libvirt update from the
QEMU one for testing purposes.

A COPR that contains a newer QEMU version is also available, and you
can switch to it locally for additional testing if you want to:

  https://copr.fedorainfracloud.org/coprs/g/kubevirt/qemu-5.2.0-14.el8/

That's the one that is consumed by my ongoing PR for updating libvirt
and QEMU in KubeVirt:

  https://github.com/kubevirt/kubevirt/pull/5328

(In reply to yalzhang from comment #53)
> I have tested with scratch build in comment 50 with basic scenarios with
> VFIO on rhel, no regressions found till now. The scenarios includes: 1)
> start vm with hostdev interface or hostdev device, then do hotplug and
> unplug; 2) start vm without any hostdev interface or device, then do hotplug
> and then unplug; I will also trigger one round auto test with it.

This is excellent news, thank you!

Comment 55 omergi 2021-04-06 11:00:08 UTC
> Gotcha! I have prepared a KubeVirt branch that consumes a libvirt
> build made from the same dist-git tree used for the scratch build:
> 
>   https://github.com/andreabolognani/kubevirt/tree/bz1916346.memlock
> 
> Note that QEMU has *not* been updated, so you'll be driving an old
> QEMU with a new libvirt: this is a supported scenario from the
> upstream point of view, but not necessarily a recommended one and
> also not one that matches what is going to be shipped in whatever
> version of KubeVirt switches over to RHEL AV 8.4 contents, but I
> thought it would be better to isolate the the libvirt update from the
> QEMU one for testing purposes.
> 
> A COPR that contains a newer QEMU version is also available, and you
> can switch to it locally for additional testing if you want to:
> 
>   https://copr.fedorainfracloud.org/coprs/g/kubevirt/qemu-5.2.0-14.el8/
> 
> That's the one that is consumed by my ongoing PR for updating libvirt
> and QEMU in KubeVirt:
> 
>   https://github.com/kubevirt/kubevirt/pull/5328
> 

@abologna I tested again now with the new branch and it works, 
I verified it the same as I did before https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c21

I Also updated my branch its rebased on latest Kubevirt with changes from 
https://github.com/andreabolognani/kubevirt/tree/bz1916346.memlock
It also includes:
- Adjusting QEMU process memory lock limit
- feature-gate code is removed (no SYS_RESOURCE capability)

Comment 59 yalzhang@redhat.com 2021-04-08 10:35:30 UTC
Test on libvirt-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 with basic scenarios: 1) start vm with hostdev, then do hotplug-> hotunplug; 2) start vm without any hostdev interface, and do hotplug and unplug, no regression issue found. And the auto job finished successfully.

Comment 60 chhu 2021-04-09 07:58:30 UTC
Reproduced on libvirt-daemon-driver-qemu-6.6.0-13.module+el8.3.1+9548+0a8fede5.x86_64 on CNV2.6.1,
Tested on libvirt-daemon-driver-qemu-7.0.0-12.module+el8.4.0+10596+32ba7df3.x86_64 on CNV2.6.1(with RHEL-AV 8.4.0 libvirt,qemu-kvm)

Hot-unplug VF, then hotplug VF back successfully.

Test steps:
1. Start VM with SR-IOV VF in CNV2.6.1 environment with RHEL-AV 8.4.0 libvirt,qemu-kvm
2. Check the VM is running, login to guest console, run ifconfig, touch file, can ping guest from other host.
   Check the hostdev xml in VM.
#./oc get vmi
NAME    AGE     PHASE     IP             NODENAME
rhel8   7m23s   Running   10.73.33.123   bootp-73-33-166.lab.eng.pek2.redhat.com

#./oc get pod|grep virt-lau
virt-launcher-rhel8-89dhj                             1/1     Running   0          17m

#./oc rsh virt-launcher-rhel8-89dhj
sh-4.4# virsh list --all
 Id   Name                  State
-------------------------------------
 1    openshift-cnv_rhel8   running

sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

3. Hot-unplug VF successfully, check the VM xml, no hostdev device, check in VM console, no interface.
sh-4.4# virsh detach-device openshift-cnv_rhel8 hostdev.xml
Device detached successfully
sh-4.4#  virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8
sh-4.4#

4. Hotplug VF back successfully. check in VM xml, there is hostdev device, 
   check in VM console, there is "Ethernet Virtual Funciton 700 Series [8086:154c]" device, can ping VM from other host.

sh-4.4# cat hostdev.xml
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

sh-4.4# virsh attach-device openshift-cnv_rhel8 hostdev.xml --live
Device attached successfully

sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

[root@dell-per730-45 ocp4.7.4]# ./oc get vmi
NAME    AGE   PHASE     IP             NODENAME
rhel8   24m   Running   10.73.33.123   bootp-73-33-166.lab.eng.pek2.redhat.com

[root@dell-per730-45 ocp4.7.4]# ping 10.73.33.123
PING 10.73.33.123 (10.73.33.123) 56(84) bytes of data.
64 bytes from 10.73.33.123: icmp_seq=1 ttl=61 time=0.206 ms
......

Comment 61 chhu 2021-04-09 08:07:24 UTC
Run other testings on this environment,
I think the behaviors are as design of current CNV2.6.1

1. Restart the VM with VF successfully in CNV web console.
   The virt-launcher pod is terminated and created a new pod.
   Check the VF device is changed, and the network ip address is changed, can ping VM from other host after rebooting.

#./oc get vmi
NAME    AGE     PHASE     IP             NODENAME
rhel8   6m41s   Running   10.73.33.152   bootp-73-33-166.lab.eng.pek2.redhat.com

#./oc get pod|grep virt-launcher
virt-launcher-rhel8-7r4hs                             1/1     Running   0          17s : it was virt-launcher-rhel8-89dhj before restart VM

#./oc rsh virt-launcher-rhel8-7r4hs
sh-4.4# virsh list --all
 Id   Name                  State
-------------------------------------
 1    openshift-cnv_rhel8   running

sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 6
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x02' function='0x3'/>  : it was function='0x1' before restart VM
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

2. Stop and star VM with VF successfully.
   The virt-launcher pod is terminated and created a new pod,
   check the VF device is changed, and the network ipaddress is changed, can ping VM from other host.

3. Attach another VF to VM failed. I think this is as design in current CNV2.6.1.

sh-4.4#  virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x02' function='0x3'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

sh-4.4# cat hostdev.xml
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/>
      </source>
      <alias name='hostdev0'/>
    </hostdev>

sh-4.4# virsh attach-device openshift-cnv_rhel8 hostdev.xml --live
error: Failed to attach device from hostdev.xml
error: unable to stat: /dev/vfio/63: No such file or directory

Comment 62 chhu 2021-04-09 08:13:27 UTC
Hi, Andrea

Do you think the testing in comment 55,59,60,61 is enough or not ?
Would you like to us to check some other test scenarios in CNV env with RHEL-AV 8.4.0 libvirt/qemu-kvm for this bug ?


Regards,
Chenli Hu

Comment 63 Andrea Bolognani 2021-04-09 11:40:41 UTC
(In reply to chhu from comment #62)
> Hi, Andrea
> 
> Do you think the testing in comment 55,59,60,61 is enough or not ?
> Would you like to us to check some other test scenarios in CNV env with
> RHEL-AV 8.4.0 libvirt/qemu-kvm for this bug ?

The libvirt part (Comment 59) looks reasonable, as does the KubeVirt
test from Comment 55.

As for your own tests (Comment 60 and Comment 61), the details are a
bit too KubeVirt-specific for me to judge - perhaps @omergi can take
a look? And if there are no objections from that side, we can move
the bug to VERIFIED.

Comment 66 yalzhang@redhat.com 2021-04-12 02:31:56 UTC
Considering the schedule, we will verify the bug now. And we will do more tests if needed after @omergi's feedback. Thank you!

Comment 67 omergi 2021-04-12 11:43:35 UTC
@chhu Thanks for testing this no objections from my side

Comment 68 chhu 2021-04-13 01:00:36 UTC
(In reply to omergi from comment #67)
> @chhu Thanks for testing this no objections from my side
@omergi, thank you!

Comment 70 errata-xmlrpc 2021-05-25 06:46:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098


Note You need to log in before you can comment on or make changes to this bug.