Created attachment 1510562 [details] UI Error Message Description of problem: VFs are not released at the source host after migrating a VM with Passthrough vNIC Version-Release number of selected component (if applicable): 4.3.0-0.2.master.20181121071050.gita8fcd23.el7 How reproducible: 100% Steps to Reproduce: 1. Use two hosts with SR-IOV supported 2. Enable 2 VFs at each host 3. Create a logical network 4. Enable Passthrough and Migratable at the assigned vNIC profile 5. Attached the vNIC profile to a VM 6. Start the VM 7. Migrate the VM to the second host 8. Verify VFs at the source host are released as followed: - Go to Setup Host Networks | Show virtual functions and verify that 2 VFs are displayed - Try to reduce the number of VFs from 2 to 0 Actual results: - One VF is displayed instead of 2 - Can't reduce the number of VFs to zero, got the error: "Cannot edit host NIC VFs configuration. The selected network interface enp5s0f0 has VFs that are in use" Expected results: - Two VFs should be displayed - Reduce the number of VFs from2 to 0 should success Additional info: - To release the VFs it is required to reboot the host. - See attached error message
Version: 4.3.0-0.2.master.20181121071050.gita8fcd23.el7 Host version info: [root@cinteg24 vdsm]# rpm -qa |grep qemu qemu-kvm-common-ev-2.12.0-18.el7_6.1.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-10.el7.x86_64 qemu-kvm-ev-2.12.0-18.el7_6.1.1.x86_64 qemu-img-ev-2.12.0-18.el7_6.1.1.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch [root@cinteg24 vdsm]# uname -r 3.10.0-957.el7.x86_64 root@cinteg24 vdsm]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) complete logs are attached... Source host logs: ----------------- 018-12-03 14:09:58,001+0200 INFO (jsonrpc/6) [api.virt] START hotunplugNic(params={u'xml': u'<?xml version="1.0" encoding="UTF-8" standalone="yes"?><hotunplug><devices><interface><alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31"/></interface></devices></hotunplug>', u'vmId': u'02f1b936-de2c-4c0b-a0d6-695ac9711890'}) from=::ffff:10.35.162.63,42572, flow_id=46b285a3, vmId=02f1b936-de2c-4c0b-a0d6-695ac9711890 (api:48) 2018-12-03 14:09:58,005+0200 INFO (jsonrpc/6) [virt.vm] (vmId='02f1b936-de2c-4c0b-a0d6-695ac9711890') Hotunplug NIC xml: <?xml version='1.0' encoding='utf-8'?> <interface managed="no" type="hostdev"> <address bus="0x00" domain="0x0000" function="0x0" slot="0x09" type="pci" /> <mac address="00:1a:4a:16:20:b4" /> <source> <address bus="0x08" domain="0x0000" function="0x0" slot="0x10" type="pci" /> </source> <link state="up" /> <driver name="vfio" /> <alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31" /> </interface> (vm:3309) 2018-12-03 14:09:58,212+0200 INFO (libvirt/events) [virt.vm] (vmId='02f1b936-de2c-4c0b-a0d6-695ac9711890') Device removal reported: ua-a16720d0-bf85-4b34-97a9-1ea093658f31 (vm:5996) 2018-12-03 14:09:58,212+0200 INFO (libvirt/events) [virt.vm] (vmId='02f1b936-de2c-4c0b-a0d6-695ac9711890') Reattaching device pci_0000_08_10_0 to host. (network:375) 2018-12-03 14:09:58,218+0200 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:688) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 678, in dispatchLibvirtEvents v.onDeviceRemoved(device_alias) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6007, in onDeviceRemoved device.teardown() File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/network.py", line 378, in teardown reattach_detachable(self.hostdev) File "/usr/lib/python2.7/site-packages/vdsm/common/hostdev.py", line 695, in reattach_detachable libvirt_device.reAttach() File "/usr/lib64/python2.7/site-packages/libvirt.py", line 5624, in reAttach if ret == -1: raise libvirtError ('virNodeDeviceReAttach() failed') libvirtError: Requested operation is not valid: PCI device 0000:08:10.0 is in use by driver QEMU, domain golden_env_mixed_virtio_6
Created attachment 1510911 [details] [Logs] VM Migration With Passthrough
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Roni, can you please check on an RHEL 7.6 host if hotplugging of SR-IOV vNICs via oVirt fails, and if hotplugging SR-IOV is working via plain libvirt on command line?
(In reply to Dominik Holler from comment #4) > Roni, can you please check on an RHEL 7.6 host if hotplugging of SR-IOV > vNICs via oVirt fails, and if hotplugging SR-IOV is working via plain > libvirt on command line? Hi Dominik, SR-IOV working fine on 4.2 with 7.6, the regression is only for 4.3. I don't think this is related to rhel7.6, we testing this for some time now.
The bug also reproduced with only plug/unplug scenario, after few plug/unplug the VF is leaked. But again, it has nothing to do with rhel, the bug is on our side. 4.2 work as expected.
Milan, might this issue be related to the hotunplug-cleanup?
I suspect it is related. Not that the cleanup would be broken, we just use a different (better) mechanism to handle device removal and the related libvirt/QEMU behavior is suspicious. Let's see what happens in the log snippet from comment 1. This device is hot unplugged: <interface managed="no" type="hostdev"> ... <address bus="0x08" domain="0x0000" function="0x0" slot="0x10" type="pci" /> ... <alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31" /> </interface> We receive a device removal event for that device from libvirt: Device removal reported: ua-a16720d0-bf85-4b34-97a9-1ea093658f31 But when we try to return the device to the host we get: libvirtError: Requested operation is not valid: PCI device 0000:08:10.0 is in use by driver QEMU Considering what Michael writes in comment 6, it looks like a timing issue in libvirt/QEMU. We have seen various problems with event inconsistencies in the past. I think that could be confirmed by adding something like time.sleep(1) at some place in the traceback above, before device reattachment is performed. If it helps then a libvirt bug should be reported and some workaround should be added to Vdsm. Something like a check for a device disappearance in domain XML before device removal or re-attempting the device removal after a short delay.
(In reply to Milan Zamazal from comment #8) > I suspect it is related. Not that the cleanup would be broken, we just use a > different (better) mechanism to handle device removal and the related > libvirt/QEMU behavior is suspicious. Let's see what happens in the log > snippet from comment 1. This device is hot unplugged: > > <interface managed="no" type="hostdev"> > ... > <address bus="0x08" domain="0x0000" function="0x0" slot="0x10" > type="pci" /> > ... > <alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31" /> > </interface> > > We receive a device removal event for that device from libvirt: > > Device removal reported: ua-a16720d0-bf85-4b34-97a9-1ea093658f31 > > But when we try to return the device to the host we get: > > libvirtError: Requested operation is not valid: PCI device 0000:08:10.0 is > in use by driver QEMU > > Considering what Michael writes in comment 6, it looks like a timing issue > in libvirt/QEMU. We have seen various problems with event inconsistencies in > the past. > Did you consider that this bug only occurs on oVirt 4.3, but not on 4.2? > I think that could be confirmed by adding something like time.sleep(1) at > some place in the traceback above, before device reattachment is performed. > If it helps then a libvirt bug should be reported and some workaround should > be added to Vdsm. Something like a check for a device disappearance in > domain XML before device removal or re-attempting the device removal after a > short delay.
(In reply to Dominik Holler from comment #9) > Did you consider that this bug only occurs on oVirt 4.3, but not on 4.2? Yes, as I write above, we use a different mechanism in 4.3 device removal. We used polling in 4.2 while events are used in 4.3.
Hi Milan, it seems that your patch (https://gerrit.ovirt.org/96092) fixes the issue, I've tried plug/unplug and also Migration, but the problem was not reproduced.
Hi Roni, thank you for testing. So it indeed looks like a libvirt bug. Now we need a reproducer with libvirt debug log and to report a bug on libvirt.
Created attachment 1513364 [details] libvirt_logs_during_unplug_and_plug
Open libvirt bug: https://bugzilla.redhat.com/show_bug.cgi?id=1658198
Due to the underlying bug, Milan suggested to retry remove(device) if it failed once.
Do we need to introduce a workaround into Vdsm or can we simply wait for libvirt fix?
Dan, see Milan's question in comment 18. Any how please keep this bug for our internal tracking(we use it as skip in our automation)
(In reply to Milan Zamazal from comment #18) > Do we need to introduce a workaround into Vdsm or can we simply wait for > libvirt fix? We wanted to release 4.3.0 by year end. This is a blocker. I think we need a workaround.
Vdsm patch is merged, but it's just a workaround that should be removed once libvirt is fixed.
It is good enough to renew testing.
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Tag 'v4.30.5' doesn't contain patch 'https://gerrit.ovirt.org/96652'] gitweb: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.30.5 For more info please contact: infra
The SR-IOV flow has been verified and no new regression is found. We did found new bug for the 'TestSriovVm06.test_vm_with_sriov_network_and_hostdev' If using a VF as host device, on VM shut down this VF is leaking and considered as taken by the engine. See BZ 1672252 Verified on - vdsm-4.30.7-1.el7ev, vdsm-4.30.8-1.el7 and 4.3.0.4-0.1.el7
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.