Bug 1655276 - [Work Around only] [SR-IOV] VFs are not released on hotunplug due to a premature libvirt event
Summary: [Work Around only] [SR-IOV] VFs are not released on hotunplug due to a premat...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.30.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.3.0
: ---
Assignee: Milan Zamazal
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On: 1658198
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-02 10:04 UTC by Roni
Modified: 2019-03-24 06:29 UTC (History)
8 users (show)

Fixed In Version: v4.30.6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-13 07:43:15 UTC
oVirt Team: Network
Embargoed:
rule-engine: ovirt-4.3+
rule-engine: blocker+


Attachments (Terms of Use)
UI Error Message (91.79 KB, image/png)
2018-12-02 10:04 UTC, Roni
no flags Details
[Logs] VM Migration With Passthrough (2.03 MB, application/x-gzip)
2018-12-03 13:14 UTC, Roni
no flags Details
libvirt_logs_during_unplug_and_plug (167.17 KB, application/gzip)
2018-12-11 12:34 UTC, Roni
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 96092 0 master MERGED virt: Hot unplug workaround of misplaced libvirt events 2020-12-14 09:22:44 UTC
oVirt gerrit 96652 0 master MERGED virt: Set device hotunplug_event only after device teardown 2020-12-14 09:22:46 UTC

Description Roni 2018-12-02 10:04:51 UTC
Created attachment 1510562 [details]
UI Error Message

Description of problem:
VFs are not released at the source host after migrating a VM with Passthrough vNIC


Version-Release number of selected component (if applicable):
4.3.0-0.2.master.20181121071050.gita8fcd23.el7

How reproducible:
100%

Steps to Reproduce:
1. Use two hosts with SR-IOV supported
2. Enable 2 VFs at each host
3. Create a logical network
4. Enable Passthrough and Migratable at the assigned vNIC profile
5. Attached the vNIC profile to a VM
6. Start the VM
7. Migrate the VM to the second host
8. Verify VFs at the source host are released as followed:
   - Go to Setup Host Networks | Show virtual functions
     and verify that 2 VFs are displayed
   - Try to reduce the number of VFs from 2 to 0

Actual results:
- One VF is displayed instead of 2
- Can't reduce the number of VFs to zero, got the error:
  "Cannot edit host NIC VFs configuration. The selected network interface 
  enp5s0f0 has VFs that are in use"

Expected results:
- Two VFs should be displayed
- Reduce the number of VFs from2 to 0 should success

Additional info:
- To release the VFs it is required to reboot the host.
- See attached error message

Comment 1 Roni 2018-12-03 13:10:44 UTC
Version:
4.3.0-0.2.master.20181121071050.gita8fcd23.el7

Host version info:

[root@cinteg24 vdsm]# rpm -qa |grep qemu
qemu-kvm-common-ev-2.12.0-18.el7_6.1.1.x86_64
libvirt-daemon-driver-qemu-4.5.0-10.el7.x86_64
qemu-kvm-ev-2.12.0-18.el7_6.1.1.x86_64
qemu-img-ev-2.12.0-18.el7_6.1.1.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch

[root@cinteg24 vdsm]# uname -r
3.10.0-957.el7.x86_64

root@cinteg24 vdsm]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

complete logs are attached...

Source host logs:
-----------------
018-12-03 14:09:58,001+0200 INFO  (jsonrpc/6) [api.virt] START hotunplugNic(params={u'xml': u'<?xml version="1.0" encoding="UTF-8" standalone="yes"?><hotunplug><devices><interface><alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31"/></interface></devices></hotunplug>', u'vmId': u'02f1b936-de2c-4c0b-a0d6-695ac9711890'}) from=::ffff:10.35.162.63,42572, flow_id=46b285a3, vmId=02f1b936-de2c-4c0b-a0d6-695ac9711890 (api:48)
2018-12-03 14:09:58,005+0200 INFO  (jsonrpc/6) [virt.vm] (vmId='02f1b936-de2c-4c0b-a0d6-695ac9711890') Hotunplug NIC xml: <?xml version='1.0' encoding='utf-8'?>
<interface managed="no" type="hostdev">
    <address bus="0x00" domain="0x0000" function="0x0" slot="0x09" type="pci" />
    <mac address="00:1a:4a:16:20:b4" />
    <source>
        <address bus="0x08" domain="0x0000" function="0x0" slot="0x10" type="pci" />
    </source>
    <link state="up" />
    <driver name="vfio" />
    <alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31" />
</interface>
 (vm:3309)
2018-12-03 14:09:58,212+0200 INFO  (libvirt/events) [virt.vm] (vmId='02f1b936-de2c-4c0b-a0d6-695ac9711890') Device removal reported: ua-a16720d0-bf85-4b34-97a9-1ea093658f31 (vm:5996)
2018-12-03 14:09:58,212+0200 INFO  (libvirt/events) [virt.vm] (vmId='02f1b936-de2c-4c0b-a0d6-695ac9711890') Reattaching device pci_0000_08_10_0 to host. (network:375)
2018-12-03 14:09:58,218+0200 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:688)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 678, in dispatchLibvirtEvents
    v.onDeviceRemoved(device_alias)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6007, in onDeviceRemoved
    device.teardown()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/network.py", line 378, in teardown
    reattach_detachable(self.hostdev)
  File "/usr/lib/python2.7/site-packages/vdsm/common/hostdev.py", line 695, in reattach_detachable
    libvirt_device.reAttach()
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 5624, in reAttach
    if ret == -1: raise libvirtError ('virNodeDeviceReAttach() failed')
libvirtError: Requested operation is not valid: PCI device 0000:08:10.0 is in use by driver QEMU, domain golden_env_mixed_virtio_6

Comment 2 Roni 2018-12-03 13:14:13 UTC
Created attachment 1510911 [details]
[Logs] VM Migration With Passthrough

Comment 3 Red Hat Bugzilla Rules Engine 2018-12-04 11:11:52 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 4 Dominik Holler 2018-12-05 14:08:04 UTC
Roni, can you please check on an RHEL 7.6 host if hotplugging of SR-IOV vNICs via oVirt fails, and if hotplugging SR-IOV is working via plain libvirt on command line?

Comment 5 Michael Burman 2018-12-05 14:14:23 UTC
(In reply to Dominik Holler from comment #4)
> Roni, can you please check on an RHEL 7.6 host if hotplugging of SR-IOV
> vNICs via oVirt fails, and if hotplugging SR-IOV is working via plain
> libvirt on command line?

Hi Dominik, SR-IOV working fine on 4.2 with 7.6, the regression is only for 4.3.
I don't think this is related to rhel7.6, we testing this for some time now.

Comment 6 Michael Burman 2018-12-05 14:26:02 UTC
The bug also reproduced with only plug/unplug scenario, after few plug/unplug the VF is leaked.
But again, it has nothing to do with rhel, the bug is on our side. 4.2 work as expected.

Comment 7 Dominik Holler 2018-12-05 16:54:45 UTC
Milan, might this issue be related to the hotunplug-cleanup?

Comment 8 Milan Zamazal 2018-12-07 21:00:47 UTC
I suspect it is related. Not that the cleanup would be broken, we just use a different (better) mechanism to handle device removal and the related libvirt/QEMU behavior is suspicious. Let's see what happens in the log snippet from comment 1. This device is hot unplugged:

  <interface managed="no" type="hostdev">
    ...
        <address bus="0x08" domain="0x0000" function="0x0" slot="0x10" type="pci" />
    ...
    <alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31" />
  </interface>

We receive a device removal event for that device from libvirt:

  Device removal reported: ua-a16720d0-bf85-4b34-97a9-1ea093658f31

But when we try to return the device to the host we get:

  libvirtError: Requested operation is not valid: PCI device 0000:08:10.0 is in use by driver QEMU

Considering what Michael writes in comment 6, it looks like a timing issue in libvirt/QEMU. We have seen various problems with event inconsistencies in the past.

I think that could be confirmed by adding something like time.sleep(1) at some place in the traceback above, before device reattachment is performed. If it helps then a libvirt bug should be reported and some workaround should be added to Vdsm. Something like a check for a device disappearance in domain XML before device removal or re-attempting the device removal after a short delay.

Comment 9 Dominik Holler 2018-12-10 08:24:52 UTC
(In reply to Milan Zamazal from comment #8)
> I suspect it is related. Not that the cleanup would be broken, we just use a
> different (better) mechanism to handle device removal and the related
> libvirt/QEMU behavior is suspicious. Let's see what happens in the log
> snippet from comment 1. This device is hot unplugged:
> 
>   <interface managed="no" type="hostdev">
>     ...
>         <address bus="0x08" domain="0x0000" function="0x0" slot="0x10"
> type="pci" />
>     ...
>     <alias name="ua-a16720d0-bf85-4b34-97a9-1ea093658f31" />
>   </interface>
> 
> We receive a device removal event for that device from libvirt:
> 
>   Device removal reported: ua-a16720d0-bf85-4b34-97a9-1ea093658f31
> 
> But when we try to return the device to the host we get:
> 
>   libvirtError: Requested operation is not valid: PCI device 0000:08:10.0 is
> in use by driver QEMU
> 
> Considering what Michael writes in comment 6, it looks like a timing issue
> in libvirt/QEMU. We have seen various problems with event inconsistencies in
> the past.
> 

Did you consider that this bug only occurs on oVirt 4.3, but not on 4.2?

> I think that could be confirmed by adding something like time.sleep(1) at
> some place in the traceback above, before device reattachment is performed.
> If it helps then a libvirt bug should be reported and some workaround should
> be added to Vdsm. Something like a check for a device disappearance in
> domain XML before device removal or re-attempting the device removal after a
> short delay.

Comment 10 Milan Zamazal 2018-12-10 08:46:49 UTC
(In reply to Dominik Holler from comment #9)

> Did you consider that this bug only occurs on oVirt 4.3, but not on 4.2?

Yes, as I write above, we use a different mechanism in 4.3 device removal. We used polling in 4.2 while events are used in 4.3.

Comment 13 Roni 2018-12-11 11:29:48 UTC
Hi Milan, it seems that your patch (https://gerrit.ovirt.org/96092) fixes the issue,
I've tried plug/unplug and also Migration, but the problem was not reproduced.

Comment 14 Milan Zamazal 2018-12-11 11:47:10 UTC
Hi Roni, thank you for testing. So it indeed looks like a libvirt bug. Now we need a reproducer with libvirt debug log and to report a bug on libvirt.

Comment 15 Roni 2018-12-11 12:34:57 UTC
Created attachment 1513364 [details]
libvirt_logs_during_unplug_and_plug

Comment 16 Roni 2018-12-11 13:51:30 UTC
Open libvirt bug: https://bugzilla.redhat.com/show_bug.cgi?id=1658198

Comment 17 Dan Kenigsberg 2018-12-12 10:38:39 UTC
Due to the underlying bug, Milan suggested to retry remove(device) if it failed once.

Comment 18 Milan Zamazal 2018-12-12 14:37:09 UTC
Do we need to introduce a workaround into Vdsm or can we simply wait for libvirt fix?

Comment 19 Michael Burman 2018-12-12 14:51:26 UTC
Dan, see Milan's question in comment 18. Any how please keep this bug for our internal tracking(we use it as skip in our automation)

Comment 20 Dan Kenigsberg 2018-12-12 15:45:38 UTC
(In reply to Milan Zamazal from comment #18)
> Do we need to introduce a workaround into Vdsm or can we simply wait for
> libvirt fix?

We wanted to release 4.3.0 by year end. This is a blocker. I think we need a workaround.

Comment 21 Milan Zamazal 2018-12-14 14:33:11 UTC
Vdsm patch is merged, but it's just a workaround that should be removed once libvirt is fixed.

Comment 22 Dan Kenigsberg 2018-12-26 11:36:10 UTC
It is good enough to renew testing.

Comment 32 RHV bug bot 2019-01-15 23:36:18 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Tag 'v4.30.5' doesn't contain patch 'https://gerrit.ovirt.org/96652']
gitweb: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.30.5

For more info please contact: infra

Comment 35 Michael Burman 2019-02-04 11:27:13 UTC
The SR-IOV flow has been verified and no new regression is found.

We did found new bug for the 'TestSriovVm06.test_vm_with_sriov_network_and_hostdev' 
If using a VF as host device, on VM shut down this VF is leaking and considered as taken by the engine. See BZ 1672252

Verified on - vdsm-4.30.7-1.el7ev, vdsm-4.30.8-1.el7 and 4.3.0.4-0.1.el7

Comment 36 Sandro Bonazzola 2019-02-13 07:43:15 UTC
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.