Bug 1672252 - [Host Device] - VF that is used as Host device leaking on VM shut down
Summary: [Host Device] - VF that is used as Host device leaking on VM shut down
Keywords:
Status: CLOSED DUPLICATE of bug 1446058
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.3.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.3.4
: ---
Assignee: Nobody
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-04 11:22 UTC by Michael Burman
Modified: 2019-04-15 04:43 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-04-15 04:43:04 UTC
oVirt Team: Network
Embargoed:
mtessun: ovirt-4.3?
rule-engine: blocker?
mtessun: planning_ack+
mtessun: devel_ack?
mburman: testing_ack-


Attachments (Terms of Use)
logs (590.66 KB, application/gzip)
2019-02-04 11:22 UTC, Michael Burman
no flags Details

Description Michael Burman 2019-02-04 11:22:33 UTC
Created attachment 1526737 [details]
logs

Description of problem:
[Host Device] - VF that is used as Host Device leaking on VM shut down

In our SR-IOV test run plan we have a test which includes a run of a VM with SR-IOV vNIC(passthroguh) + VF as a Host device.
"TestSriovVm06.test_vm_with_sriov_network_and_hostdev"

This test is failed and new bug found, on VM shutdown, the VF host device wasn't released on the host and got leaked. The VF considered as non-free by the engine, only host reboot will release the VF. The bug reproduced 100%.

Version-Release number of selected component (if applicable):
4.3.0.4-0.1.el7
vdsm-4.30.7-1.el7ev

How reproducible:
100%

Steps to Reproduce:
1. Enable 1 VF on a SR-IOV host
2. Add this VF to a VM as a host device - VM > Host Devices sub tab > choose the 
82599 Ethernet Controller Virtual Function (0x10ed) - ixgbevf(driver)
3. Start the VM - The VF disappeared from the host
4. Shut down the VM

Actual results:
The VF wasn't released as expected and leaked. Host device wasn't released on VM shutdown.

Expected results:
VM shutdown should release the host devices

Comment 1 Red Hat Bugzilla Rules Engine 2019-02-05 11:17:49 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Dominik Holler 2019-02-19 09:31:58 UTC
Hi Milan, can you please have a look?

Comment 3 Milan Zamazal 2019-02-22 15:15:15 UTC
Hi Dominik, I can't reproduce the error and I can't see any related error in the provided vdsm.log. Nor can I see any obvious reason why the device shouldn't be detached. Maybe libvirt debug log could reveal whether the detach call was performed and succeeded.

Comment 4 Milan Zamazal 2019-03-01 14:28:39 UTC
For the record, after further discussion with Dominik: After playing with it, I ended up with "Enabled virtual functions: 4, Free virtual functions: 3" being displayed in Setup Host Networks dialog. This is with a rebooted host and host capabilities refresh doesn't help. Host.hostdevListByCaps result from Vdsm looks fine to me. So it doesn't seem to be a host problem.

Comment 5 Michael Burman 2019-03-31 12:15:27 UTC
Milan, i'm not sure what you exactly did, but it doesn't sound ok. This is indeed a host problem, the reason i think so is that the VF doesn't appear on the host after VM shutdown, this is an evidence that this is indeed a host issue(vdsm-virt, libvirt, don't know..)

Before i start the VM, i see the VF with ip link. After VM shutdown the VF doesn't exist on the host when i run ip link, so how exactly this is not a host issue..it is a host issue for sure. The VF is visible again only after host reboot.

Comment 6 Milan Zamazal 2019-04-02 13:23:09 UTC
Hi Michael, I can't reproduce what you describe. Aleš and I ran your scenario together and are findings are:

- `ip link' always reports all the VFs, whether the VM is not yet running, already running, or stopped. Maybe your problem is specific to certain hardware?
- All the VFs are reported from Vdsm as expected.
- All the VFs are displayed in Engine host devices as expected, including their assignments.
- The VM can be restarted normally, even after removing and re-adding the host device(s). And the devices are visible in the guest.
- However number of VFs shown in Setup Host Networks dialog tooltip is reduced by the number of different VFs previously used (but it doesn't seem to influence anything).
- The driver displayed in Engine changes from igvbf to vfio-pci on VM start and remains as such. The driver is reported from libvirt. It seems to be consistent with what Engine asks for in the VM domain XML: <hostdev ...><driver name='vfio'/>...</hostdev>.
- The number of VFs shown in Setup Host Networks and igvbf return back (only) after host reboot (but they don't cause any real problems before reboot).

So everything looks fine on my host and there is at least one minor problem in Engine. Since your VF disappears from `ip link' why my VFs are always present in `ip link', it looks like a difference in OS drivers or other problem below Vdsm. 

I'd suggest enabling libvirt debug logs and looking there whether libvirt device reattach to host is called and succeeds on VM shutdown.

Comment 7 Michael Burman 2019-04-03 04:20:08 UTC
(In reply to Milan Zamazal from comment #6)
> Hi Michael, I can't reproduce what you describe. Aleš and I ran your
> scenario together and are findings are:
> 
> - `ip link' always reports all the VFs, whether the VM is not yet running,
> already running, or stopped. Maybe your problem is specific to certain
> hardware?
NO. Issue reproduced on several HW machines. Both manually ant automatically tests. This flow was tested over 2 years(since we test sr-iov and it always pass. this is how we discovered this issue.)  
> - All the VFs are reported from Vdsm as expected.
> - All the VFs are displayed in Engine host devices as expected, including
> their assignments.
> - The VM can be restarted normally, even after removing and re-adding the
> host device(s). And the devices are visible in the guest.
> - However number of VFs shown in Setup Host Networks dialog tooltip is
> reduced by the number of different VFs previously used (but it doesn't seem
> to influence anything).
Don't know why you think this is OK. It means that 1 VF is in use.  
> - The driver displayed in Engine changes from igvbf to vfio-pci on VM start
> and remains as such. The driver is reported from libvirt. It seems to be
> consistent with what Engine asks for in the VM domain XML: <hostdev
> ...><driver name='vfio'/>...</hostdev>.
> - The number of VFs shown in Setup Host Networks and igvbf return back
> (only) after host reboot (but they don't cause any real problems before
> reboot).
> 
> So everything looks fine on my host and there is at least one minor problem
> in Engine. Since your VF disappears from `ip link' why my VFs are always
> present in `ip link', it looks like a difference in OS drivers or other
> problem below Vdsm. 
> 
> I'd suggest enabling libvirt debug logs and looking there whether libvirt
> device reattach to host is called and succeeds on VM shutdown.

Hi Milan, 
I will try to reach you out when issue is live. I believe that you not doing what our test is doing.

Comment 8 Milan Zamazal 2019-04-03 07:33:52 UTC
(In reply to Michael Burman from comment #7)
> (In reply to Milan Zamazal from comment #6)

> > - However number of VFs shown in Setup Host Networks dialog tooltip is
> > reduced by the number of different VFs previously used (but it doesn't seem
> > to influence anything).
> Don't know why you think this is OK. 

I don't think it's OK -- Engine reports wrong number of free VFs there.

> It means that 1 VF is in use.

No, the VF is not in use on my setup and Engine apparently knows it.

> I will try to reach you out when issue is live. I believe that you not doing
> what our test is doing.

OK, let's see if we do anything differently.

Comment 9 Milan Zamazal 2019-04-03 10:59:02 UTC
We have clarified the confusion with Michael: PCI VF host devices are fine, the problems are with the corresponding network interfaces. I can observe disappearing them on my setup too.

I can't see any obvious failure in libvirtd.log. On VM start:

2019-04-03 07:42:04.736+0000: 19681: debug : virPCIDeviceNew:1807 : 8086 10ca 0000:05:10.0: initialized
2019-04-03 07:42:04.736+0000: 19681: debug : virPCIDeviceConfigOpen:318 : 8086 10ca 0000:05:10.0: opened /sys/bus/pci/devices/0000:05:10.0/config
2019-04-03 07:42:04.736+0000: 19681: debug : virFileClose:111 : Closed fd 26
2019-04-03 07:42:04.736+0000: 19681: debug : virHostdevPreparePCIDevices:774 : Not detaching unmanaged PCI device 0000:05:10.0
2019-04-03 07:42:04.736+0000: 19681: debug : virHostdevPreparePCIDevices:828 : Resetting PCI device 0000:05:10.0
2019-04-03 07:42:04.736+0000: 19681: debug : virPCIDeviceConfigOpen:318 : 8086 10ca 0000:05:10.0: opened /sys/bus/pci/devices/0000:05:10.0/config
2019-04-03 07:42:04.736+0000: 19681: debug : virFileClose:111 : Closed fd 26
2019-04-03 07:42:04.736+0000: 19681: debug : virPCIDeviceReset:963 : Device 0000:05:10.0 is bound to vfio-pci - skip reset
2019-04-03 07:42:04.736+0000: 19681: debug : virHostdevPreparePCIDevices:850 : Removing PCI device 0000:05:10.0 from inactive list
2019-04-03 07:42:04.736+0000: 19681: debug : virHostdevPreparePCIDevices:854 : Adding PCI device 0000:05:10.0 to active list
2019-04-03 07:42:04.736+0000: 19681: debug : virHostdevPreparePCIDevices:870 : Setting driver and domain information for PCI device 0000:05:10.0
2019-04-03 07:42:04.737+0000: 19681: debug : virHostdevPreparePCIDevices:900 : Saving network configuration of PCI device 0000:05:10.0

On VM shutdown:

2019-04-03 07:44:41.238+0000: 20677: debug : virPCIDeviceNew:1807 : 8086 10ca 0000:05:10.0: initialized
2019-04-03 07:44:41.238+0000: 20677: debug : virHostdevReAttachPCIDevices:1061 : Removing PCI device 0000:05:10.0 from active list
2019-04-03 07:44:41.238+0000: 20677: debug : virHostdevReAttachPCIDevices:1065 : Adding PCI device 0000:05:10.0 to inactive list
2019-04-03 07:44:41.238+0000: 20677: debug : virHostdevReAttachPCIDevices:1110 : Resetting PCI device 0000:05:10.0
2019-04-03 07:44:41.238+0000: 20677: debug : virPCIDeviceConfigOpen:318 : 8086 10ca 0000:05:10.0: opened /sys/bus/pci/devices/0000:05:10.0/config
2019-04-03 07:44:41.238+0000: 20677: debug : virFileClose:111 : Closed fd 26
2019-04-03 07:44:41.238+0000: 20677: debug : virPCIDeviceReset:963 : Device 0000:05:10.0 is bound to vfio-pci - skip reset
2019-04-03 07:44:41.238+0000: 20677: debug : virHostdevReAttachPCIDevices:1134 : Not reattaching unmanaged PCI device 0000:05:10.0

Both device detachment and reattachment are called as expected from Vdsm and I don't know if anything else is needed for the network interfaces. I can't see "Saving network configuration" counterpart on VM shutdown, which may or may not be the problem. But I don't know how networking and VF interface setup works in libvirt or elsewhere, someone from the networking team should take a look.

Comment 10 Milan Zamazal 2019-04-04 11:35:22 UTC
I looked at wrong places again. The log messages cited above are from VM shutdown, not from reAttach call.

The fact is that reattach is not called for PCI devices in Vdsm. I couldn't find any comment why it is so, but it's clearly intentional (see https://gerrit.ovirt.org/73145). I've heard there were problems with reattaching some PCI devices so reattach mustn't be called. It is generally harmless since the host devices keep working. But in case of VF PCI host devices it means the corresponding net link is not returned to the host when the VM is stopped and that confuses Engine, as discussed above.

Comment 11 Milan Zamazal 2019-04-04 12:03:01 UTC
Indeed, PCI host device reattaching was disabled because it used to crash hosts with some devices.

Dominik, I think the network team should decide what to do with it and to make a proper fix, such as disabling VF host device attachment in Engine or handling VF devices in a different way in Vdsm or something else.

Comment 12 Dominik Holler 2019-04-04 12:43:34 UTC
(In reply to Milan Zamazal from comment #11)
> Indeed, PCI host device reattaching was disabled because it used to crash
> hosts with some devices.
> 

Can you please help to find the related bug?
I understand that the related code changes introduced this regression.

> Dominik, I think the network team should decide what to do with it and to
> make a proper fix, 

Before we have to understand, what was changed between 4.2 and 4.3 why it was changed.
Milan, can you please help me to find the related code changes or the related deveolpers?

> such as disabling VF host device attachment in Engine or
> handling VF devices in a different way in Vdsm or something else.

Comment 13 Dominik Holler 2019-04-04 12:48:35 UTC
Michael. the description of bug 1672252 sounds like that the bug was detected
by an automated test.
Do you know if 4.2.8 passed this test?
Do you know since which time or build 4.3 does not pass this test?

Comment 14 Milan Zamazal 2019-04-04 12:59:19 UTC
(In reply to Dominik Holler from comment #12)
> (In reply to Milan Zamazal from comment #11)
> > Indeed, PCI host device reattaching was disabled because it used to crash
> > hosts with some devices.
> > 
> 
> Can you please help to find the related bug?

Michal, can you help?

> I understand that the related code changes introduced this regression.
> 
> > Dominik, I think the network team should decide what to do with it and to
> > make a proper fix, 
> 
> Before we have to understand, what was changed between 4.2 and 4.3 why it
> was changed.
> Milan, can you please help me to find the related code changes or the
> related deveolpers?

The corresponding code change is https://gerrit.ovirt.org/73145 (and related patches). It was made by Martin who no longer works on Vdsm, but Michal should know what and why was done.

> > such as disabling VF host device attachment in Engine or
> > handling VF devices in a different way in Vdsm or something else.

Comment 15 Michal Skrivanek 2019-04-04 13:38:14 UTC
(In reply to Milan Zamazal from comment #14)
> (In reply to Dominik Holler from comment #12)
> > (In reply to Milan Zamazal from comment #11)
> > > Indeed, PCI host device reattaching was disabled because it used to crash
> > > hosts with some devices.
> > > 
> > 
> > Can you please help to find the related bug?
> 
> Michal, can you help?

no bug. that was part of the original implementation. PCI pt was primarily done for GPUs and they don't generally support reinitialization.

> 
> > I understand that the related code changes introduced this regression.
> > 
> > > Dominik, I think the network team should decide what to do with it and to
> > > make a proper fix, 
> > 
> > Before we have to understand, what was changed between 4.2 and 4.3 why it
> > was changed.
> > Milan, can you please help me to find the related code changes or the
> > related deveolpers?
> 
> The corresponding code change is https://gerrit.ovirt.org/73145 (and related
> patches). It was made by Martin who no longer works on Vdsm, but Michal
> should know what and why was done.

I do not think there was any change in vdsm between 4.2 and 4.3. No one touched that part for 2 years.

> 
> > > such as disabling VF host device attachment in Engine or
> > > handling VF devices in a different way in Vdsm or something else.

Comment 16 Milan Zamazal 2019-04-04 14:07:52 UTC
(In reply to Michal Skrivanek from comment #15)
> (In reply to Milan Zamazal from comment #14)

> > The corresponding code change is https://gerrit.ovirt.org/73145 (and related
> > patches). It was made by Martin who no longer works on Vdsm, but Michal
> > should know what and why was done.
> 
> I do not think there was any change in vdsm between 4.2 and 4.3. No one
> touched that part for 2 years.

Yes, the code in 4.2 is the same. Change in libvirt behavior is still an option.

Comment 17 Michael Burman 2019-04-04 14:52:25 UTC
(In reply to Dominik Holler from comment #13)
> Michael. the description of bug 1672252 sounds like that the bug was detected
> by an automated test.
Yes. 
> Do you know if 4.2.8 passed this test?
4.2.8 did pass
> Do you know since which time or build 4.3 does not pass this test?
we first saw this failure after 4.3.0-14 release
But, i did managed to reproduce it manually just now on 4.2.8 , which confirm your comments. Maybe it was masked, i need to check previous 4.2 executions.

If this is the case, I'm for disabling VF host device attachment in RHV. There is no point to allow it if the VF won't release on VM shutdown.
I'm also going to remove the host device SR-IOV tests(2 of them) from our suite.

Comment 18 Michael Burman 2019-04-04 15:04:04 UTC
Ok, so everything is clear now. The test was skipped on 4.2, becasue we found the issue on 4.1 :) , we requested a RFE to have an option to re-attach devices and it was closed as wont fix BZ 1446058
Which means that this bug is a duplicate and we already aware of this issue.  For some reason the test was running on our 4.3 suite.

Comment 19 Michael Burman 2019-04-15 04:43:04 UTC
All host device + SR-IOV tests were removed. This issue/bug/request is not going to be fixed nor addressed. Closing as duplicate of BZ 1446058

See comment 15 above:
"no bug. that was part of the original implementation. PCI pt was primarily done for GPUs and they don't generally support reinitialization."

*** This bug has been marked as a duplicate of bug 1446058 ***


Note You need to log in before you can comment on or make changes to this bug.