1817001 – [SR-IOV] [I40E] Hotunplug doesn't release the VF on the host

Bug 1817001 - [SR-IOV] [I40E] Hotunplug doesn't release the VF on the host

Summary: [SR-IOV] [I40E] Hotunplug doesn't release the VF on the host

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.40.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Ales Musil
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-25 11:31 UTC by Michael Burman
Modified:	2020-05-20 20:00 UTC (History)
CC List:	10 users (show)
Fixed In Version:	vdsm-4.40.12
Clone Of:
Environment:
Last Closed:	2020-05-20 20:00:50 UTC
oVirt Team:	Network
Embargoed:
Dependent Products:
Flags:	michal.skrivanek: ovirt-4.4?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	108106	0	master	MERGED	virt, net: Wait for link up after hostdev reattach	2020-12-14 09:33:39 UTC
oVirt gerrit	108125	0	master	MERGED	net: Add pci link up monitor	2020-12-14 09:33:38 UTC

Description Michael Burman 2020-03-25 11:31:06 UTC

Description of problem:
[SR-IOV] [I40E] Hotunplug doesn't release the VF on the host

We face new SR-IOV regression with i40e driver HW.
The hotunplug doesn't release the VF back on the host and VF leak in engine and considered as taken.

Version-Release number of selected component (if applicable):


How reproducible:
1005 on i40e HW

Steps to Reproduce:
1. Enable 1 VF on i40e host
2. Run VM using this VF as passthourgh vNIC
3. Unplug the vNIC from the VM

Actual results:
VF is not released 

Expected results:
VF must released on hotplug

Additional info:
On shutdown the VF released as expected
On intel igb SR_IOV host the VF released as expected on VM shutdown

Comment 1 Michael Burman 2020-03-25 11:36:22 UTC

nmstate-0.2.6-4.8.el8.noarch
vdsm-4.40.7-1.el8ev.x86_64
NetworkManager-1.22.8-4.el8.x86_64
kernel-4.18.0-191.el8.x86_64
libvirt-daemon-6.0.0-14.module+el8.2.0+6069+78a1cb09.x86_64

Comment 3 Michael Burman 2020-03-25 11:48:40 UTC

We had similar issue in the past. The VF released on the host, but engine not aware of it. Refresh caps fix it.

Comment 5 Ryan Barry 2020-03-25 15:38:08 UTC

Well, if it were a race, it wouldn't be 100% reproducible. This is just timing, as the operation seemingly takes longer with this driver. Milan, any thoughts?

Comment 11 Milan Zamazal 2020-03-26 14:11:32 UTC

I checked that the libvirt device reattach is called before the NIC hot unplug API call returns. The corresponding libvirt function documentation (https://libvirt.org/html/libvirt-libvirt-nodedev.html#virNodeDeviceReAttach) doesn't tell us what the success of the call means, most likely the returned device is not visible immediately after the call returns. If it works similarly as device hot unplug then maybe node device lifecycle event could be used to get notification about completion; again, libvirt documentation is very vague (https://libvirt.org/html/libvirt-libvirt-nodedev.html#virConnectNodeDeviceEventLifecycleCallback). I think it should be clarified with libvirt developers how it is supposed to work. Then we can discuss possible solution.

On closer inspection, I can't recall any change between 4.3 and 4.4 that would cause a regression. It's a timing issue and everything is different in 4.4 (kernel, QEMU, libvirt, Vdsm, Python, ...), something perhaps faster, something slower, which may cause the change. The hot unplug performs the same actions as before.

Comment 12 Laine Stump 2020-03-26 15:54:15 UTC

It would be helpful for me to know exactly what vdsm is doing (and in response to what) in libvirt terms. In particular, what triggers the "hotplug event" and "refresh caps", and what does vdsm do in response to those events.

I see from the libvirt XML that vdsm is using managed='no', so if there is any "virNodeDeviceReAttach" happening, then vdsm is doing it. A few points:

1) if you're only using the VFs in guests, then there is no need to constantly do virNodeDeviceDetach()/virNodeDeviceReattach(); that actually does funky things with the host's networking stack, as it keeps seeing netdevs disappear and reappear.

2) vdsm should not be calling virNodeDeviceReAttach() until it sees the DEVICE_DELETED event from libvirt. Until that event is seen, the device hasn't yet been released by QEMU, and so libvirt still holds it on its list of in-use devices, and won't be able to reattach the host net driver (which is what virNodeDeviceReAttach() is doing)

Note that there have been several reports of problems related to timing with the i40e cards/drivers, and I can't keep track of the current state.. Bug 1376907 for a very early instance of the general problem. I think there was later an instance that was particularly with the i40e driver, but  don't have the search fu to find it right now - sassmann may be able to help there.


As for what virNodeDeviceReattach() does - it is essentially the same thing as this short shell script (assuming a PCI device at address 0000:03.00.0) :

 echo driver-name > /sys/bus/pci/devices/0000:03:00.0/driver_override
 echo 0000:03:00.0 > /sys/bus/pci/devices/0000:03:00.0/driver/unbind
 echo 0000:03:00.0 > /sys/bus/pci/drivers_probe

(libvirt just directly opens the files and writes to them rather than calling a shell script). It doesn't return until the drivers_probe file has been closed. I recall being told by someone [kernel people?] in the past that everything with the device should be ready by the time drivers_probe has closed, but I will include my standard disclaimer that sometimes I remember exactly the opposite of what actually happened :-)

If the driver probe isn't synchronous, I don't think there is any libvirt nodedevice event that could help you (unless maybe you knew the name of the netdev associated with the PCI device, and watch for the DEVICE_EVENT_CREATED for that device? I haven't ever use these events, so I'm just guessing).

In the end, the fact that it's working properly with igb but not with ixgbe says that kernel driver people should at least be called in to look (sorry sassmann :-))

Comment 13 Michal Skrivanek 2020-03-26 16:12:05 UTC

IIUC it's not leaking, it is really released eventually and a subsequent refresh caps fixes that.

For automation you can wait and issue another refresh later perhaps. Not sure how long does it take, but you have the hardware so use whatever value seems to work there.

Comment 15 Milan Zamazal 2020-03-26 16:48:37 UTC

(In reply to Laine Stump from comment #12)
> It would be helpful for me to know exactly what vdsm is doing (and in
> response to what) in libvirt terms. In particular, what triggers the
> "hotplug event" and "refresh caps", and what does vdsm do in response to
> those events.

Engine makes an API call on Vdsm to hot unplug the device. Vdsm calls virDomainDetachDevice and waits for VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED event. Once the event arrives, Vdsm looks up the device using virNodeDeviceLookupByName and calls virNodeDeviceReAttach on it. Then Vdsm replies to the Engine API call. Then Engine makes another API call on Vdsm to ask about the status of the host and its devices. As part of that, Vdsm networking code gathers information about available interfaces etc., the networking team can provide details on how this works.

> I see from the libvirt XML that vdsm is using managed='no', so if there is
> any "virNodeDeviceReAttach" happening, then vdsm is doing it. A few points:
> 
> 1) if you're only using the VFs in guests, then there is no need to
> constantly do virNodeDeviceDetach()/virNodeDeviceReattach(); that actually
> does funky things with the host's networking stack, as it keeps seeing
> netdevs disappear and reappear.

I can't comment on this, the networking team should know what they need.

> 2) vdsm should not be calling virNodeDeviceReAttach() until it sees the
> DEVICE_DELETED event from libvirt. Until that event is seen, the device
> hasn't yet been released by QEMU, and so libvirt still holds it on its list
> of in-use devices, and won't be able to reattach the host net driver (which
> is what virNodeDeviceReAttach() is doing)

Re-attach is called within the callback on VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED, so it can't be called before the event occurs.

Comment 16 Stefan Assmann 2020-03-27 08:15:23 UTC

Thanks Laine for dragging me into this! ;)

I'll check what's going on, but first I need a little bit more info on what exactly is not working as expected.
"VF must released on hotplug" is not something I can work with.

Can you provide a reproducer, i.e. shell script, that demonstrates what does not work as expected?
Also, has it ever worked as you expected? If so, please provide the last working kernel version.
Also please provide a sosreport of the system having issues.

Comment 17 Michael Burman 2020-03-27 09:15:37 UTC

(In reply to Stefan Assmann from comment #16)
> Thanks Laine for dragging me into this! ;)
> 
> I'll check what's going on, but first I need a little bit more info on what
> exactly is not working as expected.
> "VF must released on hotplug" is not something I can work with.
> 
> Can you provide a reproducer, i.e. shell script, that demonstrates what does
> not work as expected?
> Also, has it ever worked as you expected? If so, please provide the last
> working kernel version.
> Also please provide a sosreport of the system having issues.

Hi Stefan and thank you.

When we trigger a hotunplug vNIC to a VM that uses a SR-IOV passthroguh nic, we expect that the VF will be released on the host and reflected in the RHV engine UI. 
When a passthrough vNIC is attached to a running VM, the VF is taken directly from the host to the guest and it's can't be used by other VMs, it is also disappears from the engine UI. 
When we hotunplug the vNIC from the VM, the VF expected to be visible again on the host, free to use and the VF should be visible in the engine UI.

What is not working on rhel8.2 kernel with this driver, is the the event of the successful hotunplug comes after we execute the refresh caps of the host devices to the engine and this way we miss to reflect the engine UI the real VF state on the host, even if the VF is free and visible on the host again, the change not visible in the RHV engine UI. And this is what not working. 
This is working fine on the HW with rhel7.7, but not rhel8. Same i40e driver. Maybe with this driver on rhel8 it behaves slower now.
So what expected on VM hotunplug nic, is that it will succeed, the event come to vdsm and then we initiate another event to refresh the capabilities of the host devices to the engine UI. 
Now the event is coming we already initiated the refresh caps, so it looks like the hotunplug taking more time on rhel8 than on rhel7 with this driver.

Also explained the flow by Milan in comment 15.

Why do you need a full sos report, which logs are required? i dont have the setup at the moment, but i will have it and give you a reproducer.

Comment 20 Stefan Assmann 2020-03-27 10:09:12 UTC

(In reply to Michael Burman from comment #17)
[...]
> Hi Stefan and thank you.
> 
> When we trigger a hotunplug vNIC to a VM that uses a SR-IOV passthroguh nic,
> we expect that the VF will be released on the host and reflected in the RHV
> engine UI. 
> When a passthrough vNIC is attached to a running VM, the VF is taken
> directly from the host to the guest and it's can't be used by other VMs, it
> is also disappears from the engine UI.

I get the idea of what you're saying, still this is too high level for me to work with. I need to know exactly what RHV engine is doing, then we need to isolate that into a reproducer I can run on a test host to debug. Similar to what Laine was describing in comment #12.

> When we hotunplug the vNIC from the VM, the VF expected to be visible again
> on the host, free to use and the VF should be visible in the engine UI.

I'm pretty sure the VF reappears on the host once released by the guest.
 
> What is not working on rhel8.2 kernel with this driver, is the the event of
> the successful hotunplug comes after we execute the refresh caps of the host
> devices to the engine and this way we miss to reflect the engine UI the real
> VF state on the host, even if the VF is free and visible on the host again,
> the change not visible in the RHV engine UI. And this is what not working. 
> This is working fine on the HW with rhel7.7, but not rhel8. Same i40e
> driver. Maybe with this driver on rhel8 it behaves slower now.

As mentioned previously by Laine and in 
https://bugzilla.redhat.com/show_bug.cgi?id=1376907
afaik the procedure we're dealing with is asynchronous. Which means, there are no guarantees for how long it takes for the VF device to reappear in the host. RHV, vdsm need to cope with this one way or another.

> So what expected on VM hotunplug nic, is that it will succeed, the event
> come to vdsm and then we initiate another event to refresh the capabilities
> of the host devices to the engine UI. 
> Now the event is coming we already initiated the refresh caps, so it looks
> like the hotunplug taking more time on rhel8 than on rhel7 with this driver.
> 
> Also explained the flow by Milan in comment 15.
> 
> Why do you need a full sos report, which logs are required? i dont have the
> setup at the moment, but i will have it and give you a reproducer.

A sosreport is better in this case as we can look at information we currently don't think of later on, w/o the need to request individual logs. I'm currently interested in dmesg from after the problem occurred, lspci -nnvv, ethtool -i <PF>

Comment 21 Stefan Assmann 2020-03-27 10:11:22 UTC

Also, as mentioned earlier, please test if this ever worked with any RHEL8 kernel. Having a known good RHEL8 kernel would be a great help.

Comment 22 Michael Burman 2020-03-27 10:18:30 UTC

(In reply to Stefan Assmann from comment #20)
> (In reply to Michael Burman from comment #17)
> [...]
> > Hi Stefan and thank you.
> > 
> > When we trigger a hotunplug vNIC to a VM that uses a SR-IOV passthroguh nic,
> > we expect that the VF will be released on the host and reflected in the RHV
> > engine UI. 
> > When a passthrough vNIC is attached to a running VM, the VF is taken
> > directly from the host to the guest and it's can't be used by other VMs, it
> > is also disappears from the engine UI.
> 
> I get the idea of what you're saying, still this is too high level for me to
> work with. I need to know exactly what RHV engine is doing, then we need to
> isolate that into a reproducer I can run on a test host to debug. Similar to
> what Laine was describing in comment #12.
> 
> > When we hotunplug the vNIC from the VM, the VF expected to be visible again
> > on the host, free to use and the VF should be visible in the engine UI.
> 
> I'm pretty sure the VF reappears on the host once released by the guest.
Yes, it reappears on the host, but not visible in the engine UI.
>  
> > What is not working on rhel8.2 kernel with this driver, is the the event of
> > the successful hotunplug comes after we execute the refresh caps of the host
> > devices to the engine and this way we miss to reflect the engine UI the real
> > VF state on the host, even if the VF is free and visible on the host again,
> > the change not visible in the RHV engine UI. And this is what not working. 
> > This is working fine on the HW with rhel7.7, but not rhel8. Same i40e
> > driver. Maybe with this driver on rhel8 it behaves slower now.
> 
> As mentioned previously by Laine and in 
> https://bugzilla.redhat.com/show_bug.cgi?id=1376907
> afaik the procedure we're dealing with is asynchronous. Which means, there
> are no guarantees for how long it takes for the VF device to reappear in the
> host. RHV, vdsm need to cope with this one way or another.
Agree, RHV/vdsm need to handle with this.
> 
> > So what expected on VM hotunplug nic, is that it will succeed, the event
> > come to vdsm and then we initiate another event to refresh the capabilities
> > of the host devices to the engine UI. 
> > Now the event is coming we already initiated the refresh caps, so it looks
> > like the hotunplug taking more time on rhel8 than on rhel7 with this driver.
> > 
> > Also explained the flow by Milan in comment 15.
> > 
> > Why do you need a full sos report, which logs are required? i dont have the
> > setup at the moment, but i will have it and give you a reproducer.
> 
> A sosreport is better in this case as we can look at information we
> currently don't think of later on, w/o the need to request individual logs.
> I'm currently interested in dmesg from after the problem occurred, lspci
> -nnvv, ethtool -i <PF>
driver: i40e
version: 2.8.20-k
firmware-version: 6.80 0x80003d1e 18.8.9
expansion-rom-version: 
bus-info: 0000:b3:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

I have no proof it worked fine on rhel8, becasue we started testing very late in our automation.
But i have a proof it working on rhel7.

Comment 24 Stefan Assmann 2020-03-27 10:22:13 UTC

(In reply to Michael Burman from comment #22)
> I have no proof it worked fine on rhel8, becasue we started testing very
> late in our automation.
> But i have a proof it working on rhel7.

Understood, but RHEL7 doesn't help us here. Please test the RHEL8.0 and RHEL8.1 kernel.

Comment 25 Milan Zamazal 2020-03-27 10:47:48 UTC

(In reply to Michael Burman from comment #17)

> What is not working on rhel8.2 kernel with this driver, is the the event of
> the successful hotunplug comes after we execute the refresh caps of the host
> devices to the engine and this way we miss to reflect the engine UI the real
> VF state on the host,

The hot unplug event comes before refresh and before device reattachment. And refresh caps comes only after we return from libvirt reattachment call.

As I understand Laine's and Stefan's comments and the referred bug, some reattachment related actions are asynchronous and even udev notifications may come before everything is ready. It's a bit unfortunate that all upper layers must deal with that and perform polling/retries, but unless it can be fixed, we have indeed to deal with it in Vdsm or Engine. I think Vdsm networking code could add a wait loop (with a proper timeout etc.) checking the interface availability to the NIC device teardown, after the device reattachment call. If it wasn't possible (e.g. in case the corresponding interface is not known) then some (ugly) workaround such as a repeated caps refresh call from Engine after a small delay could be acceptable. Or maybe the network team can find another solution.

Comment 33 Lukas Svaty 2020-04-01 10:00:04 UTC

Targetting to 4.4.0 as it's a Regression

Comment 34 Michael Burman 2020-04-05 13:19:23 UTC

Unfortunately  vdsm-4.40.11-1.el8ev.x86_64 doesn't include the fix

Comment 35 Michael Burman 2020-04-12 11:19:21 UTC

Verified on - vdsm-4.40.13-1.el8ev.x86_64 with nmstate-0.2.6-6.el8.noarch

Comment 36 Sandro Bonazzola 2020-05-20 20:00:50 UTC

This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.