Created attachment 1747452 [details] Domain XML with VFIO host device Description of problem: KubeVirt currently runs libvirt as root in non privileged container. This container doesn't have SYS_ADMIN or CAP_SYS_RESOURCE capabilities. On running Kubevirt VM, when we detach a VFIO host device (hot-unplug) and try to attach it again (hot-plug) it fails with the following error: From the pod that host libvirt daemon: # virsh list Id Name State ------------------------------- 1 default_test3 running # virsh detach-device 1 hostdev.xml Device detached successfully # virsh attach-device 1 hostdev.xml --live error: Failed to attach device from hostdev.xml error: cannot limit locked memory of process 102 to 2074083328: Operation not permitted # error from debug logs cannot get locked memory limit of process 102: Operation not permitted","pos":"virProcessGetMaxMemLock:818","subcomponent":"libvirt","thread":"44" cannot limit locked memory of process 102 to 2074083328: Operation not permitted","pos":"virProcessSetMaxMemLock:774","subcomponent":"libvirt","thread":"44" We identified that the lack of SYS_RESOURCE capability in the container that hosts libvirt daemon process and the fact that libvirtd and qemu processes use different user/group causing this behavior. Version-Release number of selected component (if applicable): libvirt-daemon-driver-qemu-6.6.0-3 Steps to Reproduce: 1. Create domain with assigned VFIO device using Kubevirt (attachment 1747452 [details]). For example: kind: VirtualMachineInstance spec: networks: - name: sriov3 multus: networkName: sriov-network domain: devices: interfaces: - sriov: {} name: sriov3 The above configuration converted to: <domain type='kvm' id='1'> ... <devices> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x04' slot='0x03' function='0x1'/> </source> <alias name='ua-sriov-sriov3'/> <address type='pci' domain='0x0000' bus='0x07' slot='0x01' function='0x0'/> </hostdev> ... </devices> ... </domain> 2. Connect to the non privileged container that runs libvirt daemon 3. Detach a vfio hostdevice (attachment 1747455 [details]) virsh detach-device 1 hostdev.xml 4. Attach that device virsh attach-device 1 hostdev.xml --live Actual results: Failed to hot-unplug and then hotplug a VFIO host device to domain in non-privileged container. Expected results: hot-unplug and then hotplug a VFIO host device to domain in non-privileged container successfully. Additional info:
Created attachment 1747454 [details] Domain XML after hot-unplug VFIO host device
Created attachment 1747455 [details] VFIO host device XML
I would like to suggest a workaround for this issue After reviewing [1] and [2], I noticed that we could make max memlock limit adjustment to do best-effort when it should be decreased. 1. Fetching current qemu domain process max memlock limit [1] Getting the current max memlock limit is done by using prlimit syscall requires a process to have CAP_SYS_RESOURCE capability (correct me if I am wrong) In case of a failure we could do best-effort and get the max memlock limits from /proc/<PID>/limits file instead. 2. Setting requested max memlock limit to qemu domain process [2] In case setting the requested max memlock limit fails, instead of failing immediately: We could check if the requested limit is lower then the current limit, if true do not fail, in any other case fail. [1] https://gitlab.com/libvirt/libvirt/-/blob/v6.6.0/src/util/virprocess.c#L785 [2] https://gitlab.com/libvirt/libvirt/-/blob/v6.6.0/src/qemu/qemu_domain.c#L8869
Right I think this is a reasonable thing to do. Failing to set the limit is not an issue if the container already has a big enough limit.
(In reply to omergi from comment #3) > I would like to suggest a workaround for this issue > After reviewing [1] and [2], I noticed that we could make > max memlock limit adjustment to do best-effort when it should > be decreased. > > 1. Fetching current qemu domain process max memlock limit [1] > Getting the current max memlock limit is done by using > prlimit syscall requires a process to have CAP_SYS_RESOURCE > capability (correct me if I am wrong) This is the case according to prlimit(2), and indeed also how it appears to work in practice: $ prlimit -p $$ RESOURCE DESCRIPTION SOFT HARD UNITS AS address space limit unlimited unlimited bytes ... $ prlimit -p 1 prlimit: failed to get the AS resource limit: Operation not permitted > In case of a failure we could do best-effort and get the > max memlock limits from /proc/<PID>/limits file instead. Yeah, that should be viable: $ cat /proc/1/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds ... It feels quite silly, though: if I can look up the limits for a random process by parsing a specific /proc file, why would I not be allowed to obtain the same information via a system call? > 2. Setting requested max memlock limit to qemu domain process [2] > In case setting the requested max memlock limit fails, > instead of failing immediately: > We could check if the requested limit is lower then the current limit, > if true do not fail, in any other case fail. After spending a bit of time looking into this, I'm afraid making it work is not going to be as easy as one would hope :( Basically this is how libvirt expect things to work: 1) libvirtd is started with the default, usually very low, memory locking limit; 2) when a VM that requires to lock additional memory is started, the limit for the QEMU process is increased accordingly; 3) if the requirement is removed, for example because the last VFIO device that was assigned to the VM is hotunplugged, the limit is restored to the original, low value; 4) if the requirement is later re-introduced, the limit is raised once more. KubeVirt, on the other hand, is expecting things to work this way: 1) KubeVirt raises the memory limit of the libvirtd process much higher than the default, and in fact higher than it expects libvirt to set it for the QEMU process; 2) libvirt calculates the more accurate, lower limit and applies it to the QEMU process; 3) further changes to the limit are performed without requiring elevated privileges or special capabilities such as CAP_SYS_RESOURCE. While this summary fails to capture a lot of nuance - such as for example the fact that, when the VM is first started, there is more that we are allowed to do if we make sure it happens before exec() is called - the basic problem is: if libvirtd is not given enough privilege to raise the limit as needed, how can it cope with hotplugging? I can imagine a scenario where libvirt does not attempt to lower the limit once the last VFIO device is hotunplugged from it, but that doesn't address the scenario where a VM that initially had no VFIO devices attached to it gets one hotplugged. In fact, based on my understanding of the code I would expect this last scenario to fail already - is that not the case? Am I missing something?
(In reply to Andrea Bolognani from comment #10) > I can imagine a scenario where libvirt does not attempt to lower the > limit once the last VFIO device is hotunplugged from it, but that > doesn't address the scenario where a VM that initially had no VFIO > devices attached to it gets one hotplugged. In fact, based on my > understanding of the code I would expect this last scenario to fail > already - is that not the case? Am I missing something? I think you are right, but kubevirt can get out of this. Kubevirt currently is able to increase libvirtd locked memory from outside the container by identifying the libvirtd pid and increasing the locked memory on it. Therefore, it can do the same for the qemu pid. Furthermore, there is another scenario where this fix will help without any special workaround from kubevirt: On the source host, the devices are detached before the migration, but in case of a migration failure, it would be preferred to re-attach them again to the domain. This currently does not work, although the locked memory has the appropriate (exact) size. Originally I hopped that the migration process will create the domain with the original locked memory, but that is indeed incorrect and kubevirt must intervene to raise it before attempting to attach a VFIO device. It would have been nicer not to do that, but unless we can add a "fake" VFIO device or other parameter to trick libvirt in thinking a VFIO is to be taken into account, I see no other way.
Andrea, have Edward answered your concerns? Is there anything else we could help you clarify?
This is a bit of a tangent, but just FYI, I just posted libvirt patches to support the <teaming> element with a plain <hostdev> (so it doesn't need access to the PF. This doesn't solve the problem described in this BZ, but it would permit a much simpler implementation of migration with a VFIO-assigned VF (in particular, it eliminates the need for hotplugging the device just to migrate). Patches are here in case anyone wants to take a look: https://listman.redhat.com/archives/libvir-list/2021-February/msg00651.html
(In reply to Edward Haas from comment #11) > (In reply to Andrea Bolognani from comment #10) > > I can imagine a scenario where libvirt does not attempt to lower the > > limit once the last VFIO device is hotunplugged from it, but that > > doesn't address the scenario where a VM that initially had no VFIO > > devices attached to it gets one hotplugged. In fact, based on my > > understanding of the code I would expect this last scenario to fail > > already - is that not the case? Am I missing something? > > I think you are right, but kubevirt can get out of this. > > Kubevirt currently is able to increase libvirtd locked memory from outside > the container > by identifying the libvirtd pid and increasing the locked memory on it. > Therefore, it can do the same for the qemu pid. > > Furthermore, there is another scenario where this fix will help without > any special workaround from kubevirt: On the source host, > the devices are detached before the migration, but in case of > a migration failure, it would be preferred to re-attach them > again to the domain. This currently does not work, although > the locked memory has the appropriate (exact) size. > > Originally I hopped that the migration process will create the > domain with the original locked memory, but that is indeed > incorrect and kubevirt must intervene to raise it before > attempting to attach a VFIO device. > It would have been nicer not to do that, but unless we can > add a "fake" VFIO device or other parameter to trick libvirt > in thinking a VFIO is to be taken into account, I see no other > way. Looking again at the initial report, I was confused as to how the detach-device call could succeed, considering that libvirt will after hot-unplug attempt to return the memory locking limit to its original value, which it shouldn't be able to do. What happens is, when qemuDomainAdjustMaxMemLock() is initially called, it will ignore the failure in virProcessGetMaxMemLock(), and that in turn will result in the call made to virProcessSetMaxMemLock() at hot-unplug time to become a no-op. Hence the lack of error. Anyway, going back to the proposed solution of turning lowering the memlock limit from a hard failure to a soft one: I don't think we can do that in the general case. While in the context of KubeVirt libvirtd and QEMU are running in a constrained environment, in a classic virtualization scenario they aren't, and giving the QEMU process the ability to lock a lot more memory than we know it needs strikes me as a potential security issue. At the same time, with the ability to call prlimit() removed from libvirtd, there is not much it can do to either lower or raise the memlock limit... So I think the only viable way forward is to 1) change libvirt so that, when CAP_SYS_RESOURCE is not available, it doesn't even attempt to change the memlock limit; 2) change KubeVirt so that it consistently updates the memlock limit for the QEMU process as devices are hot(un)plugged. Maybe instead of quietly not updating the memory limits when CAP_SYS_RESOURCE is not available we might want to have a qemu.conf setting that can be used to opt into this behavior? I don't have an exceedingly strong opinion on the subject. Does this proposal sound reasonable? As sort of a side-question, can you recommend convenient ways to test VFIO scenarios for KubeVirt? Ideally something that doesn't require having actual VFIO-capable hardware handy, like assigning emulated devices to a nested VM or something along those lines?
(In reply to Andrea Bolognani from comment #14) > > Looking again at the initial report, I was confused as to how the > detach-device call could succeed, considering that libvirt will after > hot-unplug attempt to return the memory locking limit to its original > value, which it shouldn't be able to do. > > What happens is, when qemuDomainAdjustMaxMemLock() is initially > called, it will ignore the failure in virProcessGetMaxMemLock(), and > that in turn will result in the call made to > virProcessSetMaxMemLock() at hot-unplug time to become a no-op. Hence > the lack of error. Yes, it will just silently fail and leave the memory as is. > > Anyway, going back to the proposed solution of turning lowering the > memlock limit from a hard failure to a soft one: I don't think we can > do that in the general case. While in the context of KubeVirt > libvirtd and QEMU are running in a constrained environment, in a > classic virtualization scenario they aren't, and giving the QEMU > process the ability to lock a lot more memory than we know it needs > strikes me as a potential security issue. I think the suggested logic prioritize keeping the existing behavior as is and only extending it to be "softer" when declaring an error. For a classic setup, it will work as today and for setups with the missing capability, to fail only in case the end result cannot sustain the requirements (i.e. only failure to increase the memlock). And it should do that only if the relevant error is thrown, not on any error. > > At the same time, with the ability to call prlimit() removed from > libvirtd, there is not much it can do to either lower or raise the > memlock limit... > > So I think the only viable way forward is to > > 1) change libvirt so that, when CAP_SYS_RESOURCE is not available, > it doesn't even attempt to change the memlock limit; > > 2) change KubeVirt so that it consistently updates the memlock > limit for the QEMU process as devices are hot(un)plugged. > > Maybe instead of quietly not updating the memory limits when > CAP_SYS_RESOURCE is not available we might want to have a qemu.conf > setting that can be used to opt into this behavior? I don't have an > exceedingly strong opinion on the subject. > > Does this proposal sound reasonable? From a a client point of view, I would like to see the hot-unplug keeping its behavior as is (makes sense also for backward comparability). That means that it should not lower the memlock in case it cannot. On hot-plug I will like libvirt not to fail if there is enough memlock on the domain process. If there is not enough, to fail. On how this is solved internally we have little say here. I can only recommend here to use "ask for forgiveness and not permission". I do not know if there other constraints on when the memlock is not available, and to me, it would seem to be nicer to solve it in a consistent manner. By consistent, I mean that lowering the memlock is optional, raising it is a must and reading the current value can have an alternative path. Be it CAP_SYS_RESOURCE or not, this logic sounds safe to me. Regarding the configuration option: I would prefer to see an option to create a domain with a "dummy" VFIO device or a field that tells libvirt to always assume there is a VFIO device in the config. This should also solve the problem in cases where we are interested to detach/attach, be it locally or through a migration. But this is only optional, and can even be an alternative to the prev logic. > > > As sort of a side-question, can you recommend convenient ways to test > VFIO scenarios for KubeVirt? Ideally something that doesn't require > having actual VFIO-capable hardware handy, like assigning emulated > devices to a nested VM or something along those lines? I think that passing a simple NIC to the VM should do the trick, no? I do not know if it can be passed through two levels.
(In reply to Edward Haas from comment #15) > (In reply to Andrea Bolognani from comment #14) > > Anyway, going back to the proposed solution of turning lowering the > > memlock limit from a hard failure to a soft one: I don't think we can > > do that in the general case. While in the context of KubeVirt > > libvirtd and QEMU are running in a constrained environment, in a > > classic virtualization scenario they aren't, and giving the QEMU > > process the ability to lock a lot more memory than we know it needs > > strikes me as a potential security issue. > > I think the suggested logic prioritize keeping the existing behavior > as is and only extending it to be "softer" when declaring an error. > For a classic setup, it will work as today and for setups with > the missing capability, to fail only in case the end result cannot > sustain the requirements (i.e. only failure to increase the memlock). > > And it should do that only if the relevant error is thrown, not on any > error. Well, whether you decide whether to call prlimit() based on the availability of CAP_SYS_RESOURCE or you call the function unconditionally and then decide whether to report an error on failure based on whether -EPERM is returned are just two slightly different shades of the same bikeshed color, aren't they? :) > > At the same time, with the ability to call prlimit() removed from > > libvirtd, there is not much it can do to either lower or raise the > > memlock limit... > > > > So I think the only viable way forward is to > > > > 1) change libvirt so that, when CAP_SYS_RESOURCE is not available, > > it doesn't even attempt to change the memlock limit; > > > > 2) change KubeVirt so that it consistently updates the memlock > > limit for the QEMU process as devices are hot(un)plugged. > > > > Maybe instead of quietly not updating the memory limits when > > CAP_SYS_RESOURCE is not available we might want to have a qemu.conf > > setting that can be used to opt into this behavior? I don't have an > > exceedingly strong opinion on the subject. > > > > Does this proposal sound reasonable? > > From a a client point of view, I would like to see the hot-unplug > keeping its behavior as is (makes sense also for backward comparability). > That means that it should not lower the memlock in case it cannot. > > On hot-plug I will like libvirt not to fail if there is enough memlock > on the domain process. If there is not enough, to fail. This last part makes sense: if we're not calling prlimit() ourselves, then we should very well make sure that the existing limit is at least as high as we would have set it. > On how this is solved internally we have little say here. > I can only recommend here to use "ask for forgiveness and not permission". > I do not know if there other constraints on when the memlock is not > available, > and to me, it would seem to be nicer to solve it in a consistent manner. > By consistent, I mean that lowering the memlock is optional, raising it > is a must and reading the current value can have an alternative path. > Be it CAP_SYS_RESOURCE or not, this logic sounds safe to me. The problem with applying this approach unconditionally, as I've tried to explain earlier, is that in the general case failing to keep the memlock limit as low as possible is a potential security risk, and so it should be considered a hard failure rather than a soft one. The fact that libvirt currently doesn't behave that way is a bug, although one that users are quite unlikely to actually hit outside of KubeVirt. This same reason also makes me think that we should have a configuration option for this after all, rather than basing the behavior exclusively on whether CAP_SYS_RESOURCE is available: if someone is running libvirt in a containerized environment but without an external entity such as KubeVirt taking care of modifying the memlock limit, then all failures should result in hard errors. > Regarding the configuration option: I would prefer to see an option > to create a domain with a "dummy" VFIO device or a field that tells > libvirt to always assume there is a VFIO device in the config. > This should also solve the problem in cases where we are interested to > detach/attach, be it locally or through a migration. > But this is only optional, and can even be an alternative to the prev > logic. Note that, while things are relatively simple for x86, they can be much more nuanced on other architectures: see for example what happens on ppc64[1], where such a binary option would not be effective because having zero, one, two or more VFIO devices assigned to the VM result in different memlock limits. In addition to that, note that using mdev and NVMe devices might also require a higher-than-normal memlock limit, so you can't just solve everything with a dummy VFIO device. > > As sort of a side-question, can you recommend convenient ways to test > > VFIO scenarios for KubeVirt? Ideally something that doesn't require > > having actual VFIO-capable hardware handy, like assigning emulated > > devices to a nested VM or something along those lines? > > I think that passing a simple NIC to the VM should do the trick, no? > I do not know if it can be passed through two levels. My laptop doesn't have any VFIO-capable devices in it, so that rules that out. Either way, I would assume the KubeVirt functest suite to have some test that ensure VFIO functionality is not broken by code changes, no? And that would run even on cloud machines which don't have assignable PCI devices of their own? I think pkg/virt-handler/device-manager/pci_device_test.go might be close to what I'm looking for, but I was hoping you could just point me straight to the interesting part so that I could avoid having to navigate the uncharted territory first :) [1] https://gitlab.com/libvirt/libvirt/-/blob/f28a652a32e0764e1ab528a004c33a49937a0c45/src/qemu/qemu_domain.c#L8935-9060
Patches posted upstream: https://listman.redhat.com/archives/libvir-list/2021-March/msg00293.html Sorry this took so long! I got kinda lost in the woods trying to make sense of our internal APIs, and ended up spending a significant chunk of time making them more reasonable. Edward, if I provided you with a build of libvirt that includes this new feature, would you be able to work on the KubeVirt changes and validate that it does in fact solve your issue?
(In reply to Andrea Bolognani from comment #17) > Edward, if I provided you with a build of libvirt that includes this > new feature, would you be able to work on the KubeVirt changes and > validate that it does in fact solve your issue? Yes, we can validate this. We will need it based on libvirt 6.6 as this is what we are currently deploying.
(In reply to Edward Haas from comment #18) > (In reply to Andrea Bolognani from comment #17) > > Edward, if I provided you with a build of libvirt that includes this > > new feature, would you be able to work on the KubeVirt changes and > > validate that it does in fact solve your issue? > > Yes, we can validate this. > We will need it based on libvirt 6.6 as this is what we are currently > deploying. Great! Please grab the branch at https://github.com/andreabolognani/kubevirt/tree/memlock-test and let me know whether that improves things for you.
(In reply to Andrea Bolognani from comment #19) > (In reply to Edward Haas from comment #18) > > Yes, we can validate this. > > We will need it based on libvirt 6.6 as this is what we are currently > > deploying. > > Great! > > Please grab the branch at > > https://github.com/andreabolognani/kubevirt/tree/memlock-test > > and let me know whether that improves things for you. Edward, did you get a chance to validate the libvirt changes? They have now been accepted upstream, but it would be great if I could get a thumbs up from the KubeVirt side before actually merging them, in the off chance that further tweaks are necessary. Thanks!
Hi Andrea I can confirm the version on 'memlock-test' branch works,verified by: creating kubevirt VM with SRIOV VF device connected to the non-privileged container detach SRIOV VF hostdev re-attach the device Finished successfully w/o the Operation not permitted errors. Also checked libvirtd and qemu-kvm processes memlock rlimits and they wont change on detach/attach the device. Looking forward for the release with those bits Thanks you very much! :)
(In reply to omergi from comment #21) > I can confirm the version on 'memlock-test' branch works,verified by: > creating kubevirt VM with SRIOV VF device > connected to the non-privileged container > detach SRIOV VF hostdev > re-attach the device > Finished successfully w/o the Operation not permitted errors. > Also checked libvirtd and qemu-kvm processes memlock rlimits > and they wont change on detach/attach the device. Thanks for testing! I have now pushed this as commit af41380672a1c1c7ce5531f761bf8834452ff58e Author: Andrea Bolognani <abologna> Date: Tue Mar 9 11:40:21 2021 +0100 qemu: Only raise memlock limit if necessary Attempting to set the memlock limit might fail if we're running in a containerized environment where CAP_SYS_RESOURCE is not available, and if the limit is already high enough there's no point in trying to raise it anyway. https://bugzilla.redhat.com/show_bug.cgi?id=1916346 Signed-off-by: Andrea Bolognani <abologna> Reviewed-by: Michal Privoznik <mprivozn> v7.1.0-280-gaf41380672 and it will be included in the upcoming libvirt 7.2.0 release. Have you also tested the scenario where a VFIO device is hotplugged into a VM that did *not* have one assigned to it when it was created? This should work fine from the libvirt point of view, but I think KubeVirt might need to be modified so that it pokes inside the virt-launcher container, looks for the QEMU process and raises its memlock limit the same way it would for the libvirtd process if a VFIO device had been configured from the start.
> and it will be included in the upcoming libvirt 7.2.0 release. > > Have you also tested the scenario where a VFIO device is hotplugged > into a VM that did *not* have one assigned to it when it was created? This is not a supported scenario in KubeVirt. But I think we actually tested it as part of the migration flow which starts a domain on the target without any VFIO (SR-IOV in our case) devices, then attaches them. I think this is pretty much the same flow as you described except that it is done on a migrated domain. > > This should work fine from the libvirt point of view, but I think > KubeVirt might need to be modified so that it pokes inside the > virt-launcher container, looks for the QEMU process and raises its > memlock limit the same way it would for the libvirtd process if a > VFIO device had been configured from the start. Yes, this is what was done by Or M. when testing.
(In reply to Edward Haas from comment #25) > > and it will be included in the upcoming libvirt 7.2.0 release. > > > > Have you also tested the scenario where a VFIO device is hotplugged > > into a VM that did *not* have one assigned to it when it was created? > > This is not a supported scenario in KubeVirt. But I think we actually tested > it as part of the migration flow which starts a domain on the target without > any VFIO (SR-IOV in our case) devices, then attaches them. > I think this is pretty much the same flow as you described except that > it is done on a migrated domain. I think the difference is that, in the migration case, KubeVirt knows in advance that a VFIO device is going to be added to the VM, and so it will raise the memlock limit for libvirtd on the target host before the VM is started; in a pure hotplug scenario, the requirement to increase the memlock limit would be introduced during the lifetime of the VM, and so the limit would need to be increased for the qemu process rather than the libvirtd process. That said, if hotplug of VFIO devices is not supported in KubeVirt then we don't need to change anything I guess :) (In reply to Edward Haas from comment #26) > [...] even with the current > state, we have a workaround (an nasty one) that gives the CAP_SYS_RESOURCE capability > to the container *if* SR-IOV live migration feature is enabled (it is feature-gated). Was this workaround disabled when testing the libvirt changes? If not, we might not have validated that the fix actually does what it's intended to. IIUC, most of https://github.com/kubevirt/kubevirt/pull/4874 would have to be reverted, and the feature gate (if retained) would become a no-op.
> I think the difference is that, in the migration case, KubeVirt knows > in advance that a VFIO device is going to be added to the VM, and so > it will raise the memlock limit for libvirtd on the target host > before the VM is started; in a pure hotplug scenario, the requirement > to increase the memlock limit would be introduced during the lifetime > of the VM, and so the limit would need to be increased for the qemu > process rather than the libvirtd process. > > That said, if hotplug of VFIO devices is not supported in KubeVirt > then we don't need to change anything I guess :) > When I tested migration with your patch I had to do adjustments to Kubevirt so it will change QEMU process memlock limit as well as libvirtd process. > Was this workaround disabled when testing the libvirt changes? If > not, we might not have validated that the fix actually does what it's > intended to. > > IIUC, most of > > https://github.com/kubevirt/kubevirt/pull/4874 > > would have to be reverted, and the feature gate (if retained) would > become a no-op. When I tested it I removed the feature-gate code (the PR you mentioned) and verified the VM pod runs without SYS_RESOURCE capability.
(In reply to omergi from comment #28) > > I think the difference is that, in the migration case, KubeVirt knows > > in advance that a VFIO device is going to be added to the VM, and so > > it will raise the memlock limit for libvirtd on the target host > > before the VM is started; in a pure hotplug scenario, the requirement > > to increase the memlock limit would be introduced during the lifetime > > of the VM, and so the limit would need to be increased for the qemu > > process rather than the libvirtd process. > > > > That said, if hotplug of VFIO devices is not supported in KubeVirt > > then we don't need to change anything I guess :) > > When I tested migration with your patch I had to do adjustments to Kubevirt > so it will change QEMU process memlock limit as well as libvirtd process. Oh, great! > > Was this workaround disabled when testing the libvirt changes? If > > not, we might not have validated that the fix actually does what it's > > intended to. > > > > IIUC, most of > > > > https://github.com/kubevirt/kubevirt/pull/4874 > > > > would have to be reverted, and the feature gate (if retained) would > > become a no-op. > > When I tested it I removed the feature-gate code (the PR you mentioned) > and verified the VM pod runs without SYS_RESOURCE capability. Very cool! I assume you'll send in both these changes as a follow-up PR once libvirt has been updated to contain the fix?
Setting UPSTREAM keyword based on comment 22.
Hi Andrea, I have tried to reproduce it on libvirt-libs-7.0.0-11.module+el8.4.0+10505+3a8d753f.x86_64. The details are as below, could you please help to check if the configuration and steps are correct? Thank you! 1. Prepare a running vm with hostdev device: # virsh nodedev-detach pci_0000_82_10_7 Device pci_0000_82_10_7 detached Start a vm with hostdev device as below: <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/> </source> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> Stop libvirtd service: # systemctl stop libvirtd 2.Modify qemu.conf to start qemu process with root: # cat /etc/libvirt/qemu.conf user = "root" group = "root" 3.Create a normal user: # useradd test # passwd test 4.Add permission of libvirt related dirs for user test: # chown -R test:test /etc/libvirt # chown -R test:test /var/lib/libvirt # chown -R test:test /var/cache/libvirt # chown -R test:test /run/ # chown -R test:test /var/run 5.Start a container with test user and then start virtqemud process with root user in container: $ unshare --map-root-user -U -- virtqemud --verbose 6.Open another terminal, set libvirt.conf to connect to virtqemud: $ cat /etc/libvirt/libvirt.conf uri_default = "qemu:///system" remote_mode = "direct" 7.After virtqemud started, open another terminal and check the device for the vm: $ unshare --map-root-user -U -- virsh -c qemu:///system edit rhel ... <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/> </source> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> … 8.prepare the xml and do hot unplug and hotplug: $ cat net.xml <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </hostdev> $ unshare --map-root-user -U -- virsh -c qemu:///system detach-device rhel net.xml Device detached successfully $ unshare --map-root-user -U -- virsh -c qemu:///system attach-device rhel net.xml error: Failed to attach device from net.xml error: cannot limit locked memory of process 32852 to 2147483648: Operation not permitted
(In reply to yalzhang from comment #36) > Hi Andrea, I have tried to reproduce it on > libvirt-libs-7.0.0-11.module+el8.4.0+10505+3a8d753f.x86_64. The details are > as below, could you please help to check if the configuration and steps are > correct? Thank you! Do I understand correctly that the steps above are intended to reproduce the issue in the current official build of libvirt, so that you can use that as a baseline when verifying that the backport did successfully address the bug? > 1. Prepare a running vm with hostdev device: > # virsh nodedev-detach pci_0000_82_10_7 > Device pci_0000_82_10_7 detached > > Start a vm with hostdev device as below: > <hostdev mode='subsystem' type='pci' managed='no'> > <driver name='vfio'/> > <source> > <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/> > </source> > <address type='pci' domain='0x0000' bus='0x04' slot='0x00' > function='0x0'/> > </hostdev> > > Stop libvirtd service: > # systemctl stop libvirtd > > 2.Modify qemu.conf to start qemu process with root: > # cat /etc/libvirt/qemu.conf > user = "root" > group = "root" This doesn't match the configuration used by KubeVirt or RHEL, both of which use qemu:qemu for the QEMU process. Why are you changing this? Do things fail otherwise? The VM is already running at this point, so changing the configuration file is not going to have any effect as far as I can tell. > 3.Create a normal user: > # useradd test > # passwd test > > 4.Add permission of libvirt related dirs for user test: > # chown -R test:test /etc/libvirt > # chown -R test:test /var/lib/libvirt > # chown -R test:test /var/cache/libvirt > # chown -R test:test /run/ > # chown -R test:test /var/run Hopefully you're doing this in a throwaway test environment, because it's definitely not the kind of change one wants to make to a regular machine :) > 5.Start a container with test user and then start virtqemud process with > root user in container: > $ unshare --map-root-user -U -- virtqemud --verbose May I ask why you're using virtqemud here? While the expectation is that everything will work just fine when using the modular daemons, the reality is that what's currently used by KubeVirt, as well as the default setup for RHEL, is the monolithic daemon, so it seems to me that it would make more sense to follow the path of least resistance and use that one? Especially since that's how you started the VM in the first place, so you're sort of mixing and matching. > 6.Open another terminal, set libvirt.conf to connect to virtqemud: > $ cat /etc/libvirt/libvirt.conf > uri_default = "qemu:///system" This is unnecessary, as you're passing the URL to virsh every single time. > remote_mode = "direct" virsh should be smart enough to connect to the standalone virtqemud daemon if it can't connect to libvirtd, but you can also force that without editing libvirt.conf by using qemu:///system?mode=direct as the connection URI. However, see the point made above for using virtqemud as opposed to libvirtd. > 7.After virtqemud started, open another terminal and check the device for > the vm: > $ unshare --map-root-user -U -- virsh -c qemu:///system edit rhel > ... > <hostdev mode='subsystem' type='pci' managed='no'> > <driver name='vfio'/> > <source> > <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/> > </source> > <address type='pci' domain='0x0000' bus='0x04' slot='0x00' > function='0x0'/> > </hostdev> > … > > 8.prepare the xml and do hot unplug and hotplug: > $ cat net.xml > <hostdev mode='subsystem' type='pci' managed='no'> > <driver name='vfio'/> > <source> > <address domain='0x0000' bus='0x82' slot='0x10' function='0x7'/> > </source> > <alias name='hostdev0'/> > <address type='pci' domain='0x0000' bus='0x01' slot='0x00' > function='0x0'/> > </hostdev> > > $ unshare --map-root-user -U -- virsh -c qemu:///system detach-device rhel > net.xml > Device detached successfully > > $ unshare --map-root-user -U -- virsh -c qemu:///system attach-device rhel > net.xml > error: Failed to attach device from net.xml > error: cannot limit locked memory of process 32852 to 2147483648: Operation > not permitted This last part looks about right. Taking a step back, what you're doing to reproduce the issue seems to be more complicated than I would assume to be necessary. Is starting the VM with the regular, fully-privileged libvirtd something that you need to do in order to be able to use the VFIO device initially? Another thing: since the whole issue is about libvirtd not having the CAP_SYS_RESOURCE capability inside KubeVirt's virt-launcher container, it might make sense to drop that capability explicitly when (re-)launching the daemon, using something like $ unshare --map-root-user -U -- \ capsh --drop=cap_sys_resource -- \ libvirtd When using KubeVirt, the set of capabilities is actually restricted further, and looks like cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill, cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service, cap_net_admin,cap_sys_chroot,cap_sys_nice,cap_mknod, cap_audit_write,cap_setfcap+eip but I haven't managed to figure out the exact unshare/capsh invocation that would result in the same set.
Hi Andrea, thank you for your feedback about this. Yes, I tried to reproduce the bug in comment 36. And all the settings are to make the virtqemud run successfully in the namespace. The reason to use virtqemud instead of libvirtd is that libvirtd needs more permissions to set. The method you have mentioned is very convenient to use, but I can not run it successfully, could you provide some detailed configurations and preparations? We have seen such bugs about kubevirt before(Bug 1901685), but we do not have a kubevirt environment. How to emulate the environment to modify and check a specific permission is a problem. Do you know how to check if a specific permission is available or not, and if it is expected when we start libvirtd in the namespace? There are my steps: Start libvirtd service, detach the nodedev device, then start the vm with hostdev device, then stop the libvirtd service, switch to test user, try to run libvirtd without any configurations for the directories or config files: [test@dell-per730-36 ~]$ unshare --map-root-user -U -- capsh --drop=cap_sys_resource -- libvirtd /usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file [test@dell-per730-36 ~]$ unshare --map-root-user -U -- libvirtd 2021-03-30 09:23:50.703+0000: 2704: info : libvirt version: 7.0.0, package: 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, ) 2021-03-30 09:23:50.703+0000: 2704: info : hostname: dell-per730-36.lab.eng.pek2.redhat.com 2021-03-30 09:23:50.703+0000: 2704: error : main:915 : Can't load config file: Failed to open file '/etc/libvirt/libvirtd.conf': Permission denied: /etc/libvirt/libvirtd.conf 2. Then I modified the permission: [root@dell-per730-36 ~]# chown -R test:test /etc/libvirt/ [root@dell-per730-36 ~]# chown -R test:test /run chown: changing ownership of '/run/user/0/gvfs': Function not implemented [root@dell-per730-36 ~]# chown -R test:test /var/lib/ [root@dell-per730-36 ~]# chown -R test:test /var/cache/libvirt/ 3. After these modifications, the libvirtd can start but with error: [test@dell-per730-36 ~]$ unshare --map-root-user -U -- libvirtd 2021-03-30 09:27:11.504+0000: 2969: info : libvirt version: 7.0.0, package: 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, ) 2021-03-30 09:27:11.504+0000: 2969: info : hostname: dell-per730-36.lab.eng.pek2.redhat.com 2021-03-30 09:27:11.504+0000: 2969: error : virGDBusGetSystemBus:101 : internal error: Unable to get system bus connection: Exhausted all available authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL) 2021-03-30 09:27:11.504+0000: 2969: warning : networkStateInitialize:796 : DBus not available, disabling firewalld support in bridge_network_driver: internal error: Unable to get system bus connection: Exhausted all available authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL) 2021-03-30 09:27:11.522+0000: 2969: error : virCommandWait:2760 : internal error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit status 1: /sys/firmware/dmi/tables/smbios_entry_point: Permission denied /dev/mem: Permission denied 2021-03-30 09:27:11.589+0000: 2982: error : virCgroupSetValueRaw:535 : Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d2\x2drhel.scope/libvirt/cpuset.mems': Permission denied ^C 4. But when I run with the option “-- capsh --drop=cap_sys_resource”, it still fails. [test@dell-per730-36 ~]$ unshare --map-root-user -U -- capsh --drop=cap_sys_resource -- libvirtd /usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file 5. So I tried to start it without the option: $ unshare --map-root-user -U -- libvirtd On another terminal: [test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c qemu:///system detach-device rhel net.xml Device detached successfully [test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c qemu:///system attach-device rhel net.xml error: Failed to attach device from net.xml error: cannot limit locked memory of process 2314 to 2147483648: Operation not permitted Does this show that the cap_sys_resource is dropped by default? If these steps are proper for testing and verifying the bug?
(In reply to yalzhang from comment #38) > Hi Andrea, thank you for your feedback about this. Yes, I tried to reproduce > the bug in comment 36. And all the settings are to make the virtqemud run > successfully in the namespace. The reason to use virtqemud instead of > libvirtd is that libvirtd needs more permissions to set. > The method you have mentioned is very convenient to use, but I can not run > it successfully, could you provide some detailed configurations and > preparations? Sorry if I was not clear on this, but the unshare/capsh command line I provided was not one that I have actually tested but rather my guess at what sort of combination might do the trick! I wish I had ready-made instructions to share with you, but the harsh truth is that I simply don't :( > We have seen such bugs about kubevirt before(Bug 1901685), but we do not > have a kubevirt environment. How to emulate the environment to modify and > check a specific permission is a problem. Do you know how to check if a > specific permission is available or not, and if it is expected when we start > libvirtd in the namespace? This is another question that I unfortunately don't have a very good answer for :( What I usually do is start a VM in KubeVirt, then enter the corresponding container via `kubectl exec` and poke around. To find out what capabilities a certain process has, I use `getpcap`. > There are my steps: > Start libvirtd service, detach the nodedev device, then start the vm with > hostdev device, then stop the libvirtd service, switch to test user, try to > run libvirtd without any configurations for the directories or config files: > [test@dell-per730-36 ~]$ unshare --map-root-user -U -- capsh > --drop=cap_sys_resource -- libvirtd > /usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file > > [test@dell-per730-36 ~]$ unshare --map-root-user -U -- libvirtd > 2021-03-30 09:23:50.703+0000: 2704: info : libvirt version: 7.0.0, package: > 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc. > <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, ) > 2021-03-30 09:23:50.703+0000: 2704: info : hostname: > dell-per730-36.lab.eng.pek2.redhat.com > 2021-03-30 09:23:50.703+0000: 2704: error : main:915 : Can't load config > file: Failed to open file '/etc/libvirt/libvirtd.conf': Permission denied: > /etc/libvirt/libvirtd.conf > > 2. Then I modified the permission: > [root@dell-per730-36 ~]# chown -R test:test /etc/libvirt/ > [root@dell-per730-36 ~]# chown -R test:test /run > chown: changing ownership of '/run/user/0/gvfs': Function not implemented > [root@dell-per730-36 ~]# chown -R test:test /var/lib/ > [root@dell-per730-36 ~]# chown -R test:test /var/cache/libvirt/ > > 3. After these modifications, the libvirtd can start but with error: > [test@dell-per730-36 ~]$ unshare --map-root-user -U -- libvirtd > 2021-03-30 09:27:11.504+0000: 2969: info : libvirt version: 7.0.0, package: > 11.module+el8.4.0+10505+3a8d753f (Red Hat, Inc. > <http://bugzilla.redhat.com/bugzilla>, 2021-03-27-04:28:24, ) > 2021-03-30 09:27:11.504+0000: 2969: info : hostname: > dell-per730-36.lab.eng.pek2.redhat.com > 2021-03-30 09:27:11.504+0000: 2969: error : virGDBusGetSystemBus:101 : > internal error: Unable to get system bus connection: Exhausted all available > authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL) > 2021-03-30 09:27:11.504+0000: 2969: warning : networkStateInitialize:796 : > DBus not available, disabling firewalld support in bridge_network_driver: > internal error: Unable to get system bus connection: Exhausted all available > authentication mechanisms (tried: EXTERNAL) (available: EXTERNAL) > 2021-03-30 09:27:11.522+0000: 2969: error : virCommandWait:2760 : internal > error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit > status 1: /sys/firmware/dmi/tables/smbios_entry_point: Permission denied > /dev/mem: Permission denied > 2021-03-30 09:27:11.589+0000: 2982: error : virCgroupSetValueRaw:535 : > Unable to write to > '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d2\x2drhel.scope/libvirt/ > cpuset.mems': Permission denied > ^C > > 4. But when I run with the option “-- capsh --drop=cap_sys_resource”, it > still fails. > [test@dell-per730-36 ~]$ unshare --map-root-user -U -- capsh > --drop=cap_sys_resource -- libvirtd > /usr/sbin/libvirtd: /usr/sbin/libvirtd: cannot execute binary file > > 5. So I tried to start it without the option: > $ unshare --map-root-user -U -- libvirtd > > On another terminal: > [test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c > qemu:///system detach-device rhel net.xml > Device detached successfully > > [test@dell-per730-36 ~]$ unshare --map-root-user -U -- virsh -c > qemu:///system attach-device rhel net.xml > error: Failed to attach device from net.xml > error: cannot limit locked memory of process 2314 to 2147483648: Operation > not permitted > > Does this show that the cap_sys_resource is dropped by default? I think it just shows that, regardless of whether or not the container runtime presents CAP_SYS_RESOURCE as available to the process, the reality is that you're running libvirtd as an unprivileged process and your user simply lacks the necessary permissions to execute that operation. > If these > steps are proper for testing and verifying the bug? I don't think they are, because the environment libvirt is running in looks significantly different from the one it'd be running as part of KubeVirt. Back when the base container for virt-launcher was still built separately instead of along with the rest of KubeVirt, the README file for the repository contained the following command line: $ docker run \ --name libvirtd \ --rm \ --net=host \ --pid=host \ --ipc=host \ --user=root \ --privileged \ --volume=/:/host:Z \ --volume=/dev:/host-dev \ --volume=/sys:/host-sys \ --volume=/var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \ --volume=/var/run/libvirt:/var/run/libvirt \ --tty=true \ --detached=true \ kubevirt/libvirt@sha256:1234567890abcdef... I wonder if it could be adapted so that you could have a simple Dockerfile along the lines of FROM registry.access.redhat.com/ubi8/ubi RUN dnf install libvirt-daemon-kvm and then run libvirtd inside the container in a way that's reasonably close to how KubeVirt would run it? That would help a lot both libvirt developers when it comes to preparing fixes / features targeted at KubeVirt, and libvirt QEs when it comes the time to validate them. @rmohr, what do you think?
@abologna, @yalzhang I agree that having a simple container to make these kinds of verification easier should be the end goal. For the meantime, is there another way for you to test this with libvirt alone? Would it help if we gave you access to a cluster that runs this patched version of libvirt, so you can poke around there?
(In reply to Petr Horáček from comment #42) > @abologna, @yalzhang I agree that having a simple > container to make these kinds of verification easier should be the end goal. > > For the meantime, is there another way for you to test this with libvirt > alone? Would it help if we gave you access to a cluster that runs this > patched version of libvirt, so you can poke around there? Yes, it will be great that we can use the cluster to test the bug as it's difficult for us to prepare such a env. in a short time. Or we may ask @edwardh's help to verify the bug.
(In reply to yalzhang from comment #43) > (In reply to Petr Horáček from comment #42) > > @abologna, @yalzhang I agree that having a simple > > container to make these kinds of verification easier should be the end goal. > > > > For the meantime, is there another way for you to test this with libvirt > > alone? Would it help if we gave you access to a cluster that runs this > > patched version of libvirt, so you can poke around there? > > Yes, it will be great that we can use the cluster to test the bug as it's > difficult for us to prepare such a env. in a short time. Or we may ask > @edwardh's help to verify the bug. How is this different from what @omergi performed here: https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c28 ?
(In reply to Edward Haas from comment #44) > (In reply to yalzhang from comment #43) > > (In reply to Petr Horáček from comment #42) > > > @abologna, @yalzhang I agree that having a simple > > > container to make these kinds of verification easier should be the end goal. > > > > > > For the meantime, is there another way for you to test this with libvirt > > > alone? Would it help if we gave you access to a cluster that runs this > > > patched version of libvirt, so you can poke around there? > > > > Yes, it will be great that we can use the cluster to test the bug as it's > > difficult for us to prepare such a env. in a short time. Or we may ask > > @edwardh's help to verify the bug. > > How is this different from what @omergi performed here: > https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c28 ? In that case the libvirt package used was based on 6.6.0, which is the version used upstream at least until [kubevirt/kubevirt#5328][] has been merged, with the memlock fix patches backported by addressing a non-trivial amount of merge conflicts; the way the fix will land in RHEL AV 8.4, on the other hand, is by performing a very straightforward backport on top of the much newer libvirt 7.0.0. Maybe we should approach this from a different angle: can we split QE responsibilities between libvirt QE and OpenShift Virtualization QE? The former would verify that, in a RHEL AV environment, the changes don't cause any regressions, whereas the latter would make sure that the fix actually addresses the issue KubeVirt faces, just like @omergi did earlier (but with a proper scratch build this time). This way, each QE person would be able to work in an environment they're comfortable with and getting the bug to VERIFIED state should not see unnecessary delays. Does that make sense? [kubevirt/kubevirt#5328]: https://github.com/kubevirt/kubevirt/pull/5328
(In reply to Andrea Bolognani from comment #47) > (In reply to Edward Haas from comment #44) > > How is this different from what @omergi performed here: > > https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c28 ? > > In that case the libvirt package used was based on 6.6.0, which is > the version used upstream at least until [kubevirt/kubevirt#5328][] > has been merged, with the memlock fix patches backported by > addressing a non-trivial amount of merge conflicts; the way the fix > will land in RHEL AV 8.4, on the other hand, is by performing a very > straightforward backport on top of the much newer libvirt 7.0.0. > > Maybe we should approach this from a different angle: can we split QE > responsibilities between libvirt QE and OpenShift Virtualization QE? > > The former would verify that, in a RHEL AV environment, the changes > don't cause any regressions, whereas the latter would make sure that > the fix actually addresses the issue KubeVirt faces, just like > @omergi did earlier (but with a proper scratch build this time). > > This way, each QE person would be able to work in an environment > they're comfortable with and getting the bug to VERIFIED state should > not see unnecessary delays. > > Does that make sense? Aha, so that was libvirt 6.6 and we now talking about 7.0. It is pretty complicated to test this on RHEL. @omergi can test this in the same manner you requested it last time, that should be close enough to what we need. Right?
(In reply to Edward Haas from comment #48) > (In reply to Andrea Bolognani from comment #47) > > Maybe we should approach this from a different angle: can we split QE > > responsibilities between libvirt QE and OpenShift Virtualization QE? > > > > The former would verify that, in a RHEL AV environment, the changes > > don't cause any regressions, whereas the latter would make sure that > > the fix actually addresses the issue KubeVirt faces, just like > > @omergi did earlier (but with a proper scratch build this time). > > > > This way, each QE person would be able to work in an environment > > they're comfortable with and getting the bug to VERIFIED state should > > not see unnecessary delays. > > > > Does that make sense? > > Aha, so that was libvirt 6.6 and we now talking about 7.0. > > It is pretty complicated to test this on RHEL. > @omergi can test this in the same manner you requested it last time, that > should > be close enough to what we need. Right? Yes, that's exactly what I had in mind! And @yalzhang can simply go through the existing test scenarios for VFIO with libvirt on RHEL and verify that the changes don't introduce any regression there. I'm not familiar with the build process for OpenShift Virtualization, so I wouldn't know where to start with creating a downstream branch that can be used for testing, but I assume @omergi will know how to do that when provided with a bunch of fresh libvirt RPMs. Does that sound good? The scratch build that I made a few days ago appears to be still available. I'll repost the link below.
(In reply to Andrea Bolognani from comment #14) > As sort of a side-question, can you recommend convenient ways to test > VFIO scenarios for KubeVirt? Ideally something that doesn't require > having actual VFIO-capable hardware handy, like assigning emulated > devices to a nested VM or something along those lines? @omergi shared the following steps outside of bugzilla, and I'm reporting them here for reference: export KUBEVIRT_PROVIDER=kind-k8s-sriov-1.17.0 export KUBEVIRT_NUM_NODES=3 export KUBEVIRT_E2E_FOCUS=SRIOV # Create containerized k8s cluster with one control-plane and two workers # with sriov-device-plugin and sriov-cni make cluster-up # Build and deploy Kubevirt operator make cluster-sync # Run Kubevirt SRIOV E2E tests including the migration test: E2E_FOCUS=SRIOV make functest When I tried following them on my machine, I immediately got ./cluster-up/up.sh SR-IOV cluster can be only started with 2 nodes make: *** [Makefile:122: cluster-up] Error 1 which was easy to work around by setting KUBEVIRT_NUM_NODES to 2 instead of 3; even after I did that, however, 'make cluster-up' failed with + SRIOV_NODE=sriov-worker + NODE_PFS=($(move_sriov_pfs_netns_to_node "$SRIOV_NODE" "$PF_COUNT_PER_NODE")) ++ move_sriov_pfs_netns_to_node sriov-worker 1 ++ local -r node=sriov-worker ++ local -r pf_count_per_node=1 +++ docker inspect -f '{{.State.Pid}}' sriov-worker ++ local -r pid=499048 ++ pf_array=() ++ local pf_array ++ mkdir -p /var/run/netns/ mkdir: cannot create directory ‘/var/run/netns/’: Permission denied ++ ln -sf /proc/499048/ns/net /var/run/netns/sriov-worker ln: failed to create symbolic link '/var/run/netns/sriov-worker': No such file or directory ++ sriov_pfs=($(find /sys/class/net/*/device/sriov_numvfs)) +++ find '/sys/class/net/*/device/sriov_numvfs' find: ‘/sys/class/net/*/device/sriov_numvfs’: No such file or directory ++ local -r sriov_pfs ++ '[' 0 -eq 0 ']' ++ echo 'FATAL: Could not find available sriov PFs' FATAL: Could not find available sriov PFs ++ return 1 make: *** [Makefile:122: cluster-up] Error 1 So I guess it's necessary for the machine this is run on to have at least one SRIOV-capable device after all? I'll try to see if I can get my hands on a suitable machine and try again.
(In reply to Andrea Bolognani from comment #49) > (In reply to Edward Haas from comment #48) > > Aha, so that was libvirt 6.6 and we now talking about 7.0. > > > > It is pretty complicated to test this on RHEL. > > @omergi can test this in the same manner you requested it last time, that > > should > > be close enough to what we need. Right? > > Yes, that's exactly what I had in mind! And @yalzhang can simply go > through the existing test scenarios for VFIO with libvirt on RHEL and > verify that the changes don't introduce any regression there. > > I'm not familiar with the build process for OpenShift Virtualization, > so I wouldn't know where to start with creating a downstream branch > that can be used for testing, but I assume @omergi will know how to > do that when provided with a bunch of fresh libvirt RPMs. > > Does that sound good? We are able to test this against kubevirt, this is what @omergi has done previously. I suggest we focus there, the same as we did when you provided the PR for libvirt 6.6
I have tested with scratch build in comment 50 with basic scenarios with VFIO on rhel, no regressions found till now. The scenarios includes: 1) start vm with hostdev interface or hostdev device, then do hotplug and unplug; 2) start vm without any hostdev interface or device, then do hotplug and then unplug; I will also trigger one round auto test with it.
(In reply to Edward Haas from comment #52) > (In reply to Andrea Bolognani from comment #49) > > I'm not familiar with the build process for OpenShift Virtualization, > > so I wouldn't know where to start with creating a downstream branch > > that can be used for testing, but I assume @omergi will know how to > > do that when provided with a bunch of fresh libvirt RPMs. > > We are able to test this against kubevirt, this is what @omergi has done > previously. > I suggest we focus there, the same as we did when you provided the PR for > libvirt 6.6 Gotcha! I have prepared a KubeVirt branch that consumes a libvirt build made from the same dist-git tree used for the scratch build: https://github.com/andreabolognani/kubevirt/tree/bz1916346.memlock Note that QEMU has *not* been updated, so you'll be driving an old QEMU with a new libvirt: this is a supported scenario from the upstream point of view, but not necessarily a recommended one and also not one that matches what is going to be shipped in whatever version of KubeVirt switches over to RHEL AV 8.4 contents, but I thought it would be better to isolate the the libvirt update from the QEMU one for testing purposes. A COPR that contains a newer QEMU version is also available, and you can switch to it locally for additional testing if you want to: https://copr.fedorainfracloud.org/coprs/g/kubevirt/qemu-5.2.0-14.el8/ That's the one that is consumed by my ongoing PR for updating libvirt and QEMU in KubeVirt: https://github.com/kubevirt/kubevirt/pull/5328 (In reply to yalzhang from comment #53) > I have tested with scratch build in comment 50 with basic scenarios with > VFIO on rhel, no regressions found till now. The scenarios includes: 1) > start vm with hostdev interface or hostdev device, then do hotplug and > unplug; 2) start vm without any hostdev interface or device, then do hotplug > and then unplug; I will also trigger one round auto test with it. This is excellent news, thank you!
> Gotcha! I have prepared a KubeVirt branch that consumes a libvirt > build made from the same dist-git tree used for the scratch build: > > https://github.com/andreabolognani/kubevirt/tree/bz1916346.memlock > > Note that QEMU has *not* been updated, so you'll be driving an old > QEMU with a new libvirt: this is a supported scenario from the > upstream point of view, but not necessarily a recommended one and > also not one that matches what is going to be shipped in whatever > version of KubeVirt switches over to RHEL AV 8.4 contents, but I > thought it would be better to isolate the the libvirt update from the > QEMU one for testing purposes. > > A COPR that contains a newer QEMU version is also available, and you > can switch to it locally for additional testing if you want to: > > https://copr.fedorainfracloud.org/coprs/g/kubevirt/qemu-5.2.0-14.el8/ > > That's the one that is consumed by my ongoing PR for updating libvirt > and QEMU in KubeVirt: > > https://github.com/kubevirt/kubevirt/pull/5328 > @abologna I tested again now with the new branch and it works, I verified it the same as I did before https://bugzilla.redhat.com/show_bug.cgi?id=1916346#c21 I Also updated my branch its rebased on latest Kubevirt with changes from https://github.com/andreabolognani/kubevirt/tree/bz1916346.memlock It also includes: - Adjusting QEMU process memory lock limit - feature-gate code is removed (no SYS_RESOURCE capability)
Test on libvirt-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 with basic scenarios: 1) start vm with hostdev, then do hotplug-> hotunplug; 2) start vm without any hostdev interface, and do hotplug and unplug, no regression issue found. And the auto job finished successfully.
Reproduced on libvirt-daemon-driver-qemu-6.6.0-13.module+el8.3.1+9548+0a8fede5.x86_64 on CNV2.6.1, Tested on libvirt-daemon-driver-qemu-7.0.0-12.module+el8.4.0+10596+32ba7df3.x86_64 on CNV2.6.1(with RHEL-AV 8.4.0 libvirt,qemu-kvm) Hot-unplug VF, then hotplug VF back successfully. Test steps: 1. Start VM with SR-IOV VF in CNV2.6.1 environment with RHEL-AV 8.4.0 libvirt,qemu-kvm 2. Check the VM is running, login to guest console, run ifconfig, touch file, can ping guest from other host. Check the hostdev xml in VM. #./oc get vmi NAME AGE PHASE IP NODENAME rhel8 7m23s Running 10.73.33.123 bootp-73-33-166.lab.eng.pek2.redhat.com #./oc get pod|grep virt-lau virt-launcher-rhel8-89dhj 1/1 Running 0 17m #./oc rsh virt-launcher-rhel8-89dhj sh-4.4# virsh list --all Id Name State ------------------------------------- 1 openshift-cnv_rhel8 running sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8 <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> 3. Hot-unplug VF successfully, check the VM xml, no hostdev device, check in VM console, no interface. sh-4.4# virsh detach-device openshift-cnv_rhel8 hostdev.xml Device detached successfully sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8 sh-4.4# 4. Hotplug VF back successfully. check in VM xml, there is hostdev device, check in VM console, there is "Ethernet Virtual Funciton 700 Series [8086:154c]" device, can ping VM from other host. sh-4.4# cat hostdev.xml <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> sh-4.4# virsh attach-device openshift-cnv_rhel8 hostdev.xml --live Device attached successfully sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8 <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> [root@dell-per730-45 ocp4.7.4]# ./oc get vmi NAME AGE PHASE IP NODENAME rhel8 24m Running 10.73.33.123 bootp-73-33-166.lab.eng.pek2.redhat.com [root@dell-per730-45 ocp4.7.4]# ping 10.73.33.123 PING 10.73.33.123 (10.73.33.123) 56(84) bytes of data. 64 bytes from 10.73.33.123: icmp_seq=1 ttl=61 time=0.206 ms ......
Run other testings on this environment, I think the behaviors are as design of current CNV2.6.1 1. Restart the VM with VF successfully in CNV web console. The virt-launcher pod is terminated and created a new pod. Check the VF device is changed, and the network ip address is changed, can ping VM from other host after rebooting. #./oc get vmi NAME AGE PHASE IP NODENAME rhel8 6m41s Running 10.73.33.152 bootp-73-33-166.lab.eng.pek2.redhat.com #./oc get pod|grep virt-launcher virt-launcher-rhel8-7r4hs 1/1 Running 0 17s : it was virt-launcher-rhel8-89dhj before restart VM #./oc rsh virt-launcher-rhel8-7r4hs sh-4.4# virsh list --all Id Name State ------------------------------------- 1 openshift-cnv_rhel8 running sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 6 <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x04' slot='0x02' function='0x3'/> : it was function='0x1' before restart VM </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> 2. Stop and star VM with VF successfully. The virt-launcher pod is terminated and created a new pod, check the VF device is changed, and the network ipaddress is changed, can ping VM from other host. 3. Attach another VF to VM failed. I think this is as design in current CNV2.6.1. sh-4.4# virsh dumpxml openshift-cnv_rhel8|grep hostdev -A 8 <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x04' slot='0x02' function='0x3'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> sh-4.4# cat hostdev.xml <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x04' slot='0x02' function='0x1'/> </source> <alias name='hostdev0'/> </hostdev> sh-4.4# virsh attach-device openshift-cnv_rhel8 hostdev.xml --live error: Failed to attach device from hostdev.xml error: unable to stat: /dev/vfio/63: No such file or directory
Hi, Andrea Do you think the testing in comment 55,59,60,61 is enough or not ? Would you like to us to check some other test scenarios in CNV env with RHEL-AV 8.4.0 libvirt/qemu-kvm for this bug ? Regards, Chenli Hu
(In reply to chhu from comment #62) > Hi, Andrea > > Do you think the testing in comment 55,59,60,61 is enough or not ? > Would you like to us to check some other test scenarios in CNV env with > RHEL-AV 8.4.0 libvirt/qemu-kvm for this bug ? The libvirt part (Comment 59) looks reasonable, as does the KubeVirt test from Comment 55. As for your own tests (Comment 60 and Comment 61), the details are a bit too KubeVirt-specific for me to judge - perhaps @omergi can take a look? And if there are no objections from that side, we can move the bug to VERIFIED.
Considering the schedule, we will verify the bug now. And we will do more tests if needed after @omergi's feedback. Thank you!
@chhu Thanks for testing this no objections from my side
(In reply to omergi from comment #67) > @chhu Thanks for testing this no objections from my side @omergi, thank you!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098