Bug 2084530 - Vm vnic with port-mirroring hot-plug succeeds but is not reported in RHV DB for over 60 sec
Summary: Vm vnic with port-mirroring hot-plug succeeds but is not reported in RHV DB f...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Network
Version: 4.5.0.6
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.5.2
: ---
Assignee: eraviv
QA Contact: msheena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-12 11:26 UTC by msheena
Modified: 2022-08-30 08:47 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.5.2
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-30 08:47:42 UTC
oVirt Team: Network
Embargoed:
mperina: ovirt-4.5+
lsvaty: blocker-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 419 0 None open core: vnic hot un\plug - avoid race 2022-06-07 11:26:04 UTC
Github oVirt ovirt-system-tests pull 175 0 None Merged network: ensure multiple hot plug\unplug vnics is safe 2022-07-13 15:25:27 UTC
Red Hat Issue Tracker RHV-46016 0 None None None 2022-05-12 11:38:54 UTC

Description msheena 2022-05-12 11:26:02 UTC
Description of problem
======================
Given I have a vm with an unplugged vnic with port-mirroring
When I hot-plug the vnic via REST API
Then for at least 60 seconds while GET-ting the vnic 'plugged' state it is reported as 'false'

Version-Release number of selected component (if applicable)
============================================================
ovirt-engine-4.5.0.6-0.7.el8ev.noarch

How reproducible
================
Roughly 30% in automated test

Steps to Reproduce
==================
1. Create a VLAN tagged '5' network 'net'
2. Create a vnic-profile 'vnic-profile' pointing to 'net' with port-mirroring
3. Attach 'net' to an empty nic on a host
4. Start a vm on the host
5. Add 3 vnics to a vm (which already has 1 vnic pointing to ovirtmgmt) with all 3 vnics pointing to 'vnic-profile'
6. Unplug all 3 vnics
5. Hot-plug the 3rd vnic 

Actual results
==============
RHV replies with OK to the hot-plug request, yet for 60 seconds (and possibly more) when querying /ovirt-engine/api/vms/<VM_ID>/nics then the nic is reported with <plugged>false</plugged>

Expected results
================
RHV replies with OK and the nic is reported as <plugged>false</plugged> within 5 seconds

Additional info
===============
* The vm having 3 vnics pointing to 'net' is a result of our test suite creating these for purpose of shared setup. While the specific test that caught this regression is executed the first 2 vnics pointing to 'net' are unplugged.

* The expected result of the vnic being reported unplugged in 5 seconds is debatable. Eventually this test used to run on 4.4 with a waiting time of 10 seconds and the wait would usually amount to 1-2 seconds or less.

* Both RHV and VDSM report the hot-plug as successful immediately so there might be an issue with the changing of the 'plugged' attribute of the vnic in RHV DB

Comment 2 Martin Perina 2022-05-12 11:30:22 UTC
(In reply to msheena from comment #0)
> Description of problem
> ======================
> Given I have a vm with an unplugged vnic with port-mirroring
> When I hot-plug the vnic via REST API
> Then for at least 60 seconds while GET-ting the vnic 'plugged' state it is
> reported as 'false'
> 
> Version-Release number of selected component (if applicable)
> ============================================================
> ovirt-engine-4.5.0.6-0.7.el8ev.noarch
> 
> How reproducible
> ================
> Roughly 30% in automated test
> 
> Steps to Reproduce
> ==================
> 1. Create a VLAN tagged '5' network 'net'
> 2. Create a vnic-profile 'vnic-profile' pointing to 'net' with port-mirroring
> 3. Attach 'net' to an empty nic on a host
> 4. Start a vm on the host
> 5. Add 3 vnics to a vm (which already has 1 vnic pointing to ovirtmgmt) with
> all 3 vnics pointing to 'vnic-profile'
> 6. Unplug all 3 vnics
> 5. Hot-plug the 3rd vnic 
> 
> Actual results
> ==============
> RHV replies with OK to the hot-plug request, yet for 60 seconds (and
> possibly more) when querying /ovirt-engine/api/vms/<VM_ID>/nics then the nic
> is reported with <plugged>false</plugged>

Is this really a bug or just hotplug takes long time? I don't think that we are waiting for hotplug action to finish or do we? And even if we do isn't the status changed only after another round of VM monitoring?

Comment 3 Michael Burman 2022-05-12 11:36:42 UTC
(In reply to Martin Perina from comment #2)
> (In reply to msheena from comment #0)
> > Description of problem
> > ======================
> > Given I have a vm with an unplugged vnic with port-mirroring
> > When I hot-plug the vnic via REST API
> > Then for at least 60 seconds while GET-ting the vnic 'plugged' state it is
> > reported as 'false'
> > 
> > Version-Release number of selected component (if applicable)
> > ============================================================
> > ovirt-engine-4.5.0.6-0.7.el8ev.noarch
> > 
> > How reproducible
> > ================
> > Roughly 30% in automated test
> > 
> > Steps to Reproduce
> > ==================
> > 1. Create a VLAN tagged '5' network 'net'
> > 2. Create a vnic-profile 'vnic-profile' pointing to 'net' with port-mirroring
> > 3. Attach 'net' to an empty nic on a host
> > 4. Start a vm on the host
> > 5. Add 3 vnics to a vm (which already has 1 vnic pointing to ovirtmgmt) with
> > all 3 vnics pointing to 'vnic-profile'
> > 6. Unplug all 3 vnics
> > 5. Hot-plug the 3rd vnic 
> > 
> > Actual results
> > ==============
> > RHV replies with OK to the hot-plug request, yet for 60 seconds (and
> > possibly more) when querying /ovirt-engine/api/vms/<VM_ID>/nics then the nic
> > is reported with <plugged>false</plugged>
> 
> Is this really a bug or just hotplug takes long time? I don't think that we
> are waiting for hotplug action to finish or do we? And even if we do isn't
> the status changed only after another round of VM monitoring?

HI Martin, we suspect this is an engine bug. We started to see it on 4.5 only
Engine doesn't behave as expected and doesn't report the correct state for many minutes..investigation from DEV needed here..thanks!

Comment 4 Michal Skrivanek 2022-05-13 11:21:52 UTC
- "doesn't report the correct state for many minutes" - IIUC your test kills the VM after 1 minute. So how many minutes you've seen this not reported?
- does it happen with simple NIC plug? no mirroring, nothing fancy
- what is the OS in the VM? Does the VM actually appear in guest OS?
- in which build was it last succeeding? did it pass in any 4.5 test or does it fail since the first one - which one?

Comment 6 msheena 2022-05-15 13:46:33 UTC
(In reply to Michal Skrivanek from comment #4)
> - "doesn't report the correct state for many minutes" - IIUC your test kills
> the VM after 1 minute. So how many minutes you've seen this not reported?
This will need to be tested as the furthest we've gone is 1 minute (this test was used to passing with 10 seconds wait where usually it took RHV 1-2 seconds or even less to change the plugged attribute)
I felt like for RHV and VDSM to report nothing wrong in the action, whilst for an entire minute the plugged attribute for the vnic remains false, this is enough to open a bug.
> - does it happen with simple NIC plug? no mirroring, nothing fancy
No this does not reproduce for a simple nic plug - it seems port-mirroring is the only flow for which it reproduces
> - what is the OS in the VM? Does the VM actually appear in guest OS?
RHEL 8.6
> - in which build was it last succeeding? did it pass in any 4.5 test or does
> it fail since the first one - which one?
We saw it fail like so on 4.4 builds: 4.4.8-3 , 4.4.9-9
For 4.5 it seems to be reproducing in a significantly higher rate, though it is flaky since it passed on ovirt-engine-4.5.0.6-0.7.el8ev.noarch

Comment 7 Michal Skrivanek 2022-05-16 11:45:00 UTC
(In reply to msheena from comment #6)
> (In reply to Michal Skrivanek from comment #4)
> > - "doesn't report the correct state for many minutes" - IIUC your test kills
> > the VM after 1 minute. So how many minutes you've seen this not reported?
> This will need to be tested as the furthest we've gone is 1 minute (this
> test was used to passing with 10 seconds wait where usually it took RHV 1-2
> seconds or even less to change the plugged attribute)

can you try to increase it then to see if it's just slow? 

> I felt like for RHV and VDSM to report nothing wrong in the action, whilst
> for an entire minute the plugged attribute for the vnic remains false, this
> is enough to open a bug.
> > - does it happen with simple NIC plug? no mirroring, nothing fancy
> No this does not reproduce for a simple nic plug - it seems port-mirroring
> is the only flow for which it reproduces

interesting. Can you aslo add supervdsm.log?

> > - what is the OS in the VM? Does the VM actually appear in guest OS?
> RHEL 8.6

I meant "Does the NIC actually appear in guest OS". I.e. was it actually plugged?
if it was not, i guess you can try hotplug again?

> > - in which build was it last succeeding? did it pass in any 4.5 test or does
> > it fail since the first one - which one?
> We saw it fail like so on 4.4 builds: 4.4.8-3 , 4.4.9-9
> For 4.5 it seems to be reproducing in a significantly higher rate, though it
> is flaky since it passed on ovirt-engine-4.5.0.6-0.7.el8ev.noarch

good to know it's happening on various RHELs.

Comment 8 msheena 2022-05-16 14:58:22 UTC
(In reply to Michal Skrivanek from comment #7)
> (In reply to msheena from comment #6)
> > (In reply to Michal Skrivanek from comment #4)
> > > - "doesn't report the correct state for many minutes" - IIUC your test kills
> > > the VM after 1 minute. So how many minutes you've seen this not reported?
> > This will need to be tested as the furthest we've gone is 1 minute (this
> > test was used to passing with 10 seconds wait where usually it took RHV 1-2
> > seconds or even less to change the plugged attribute)
> 
> can you try to increase it then to see if it's just slow? 

I was able to notice that after several minutes the 'plugged' attribute did change to true

> 
> > I felt like for RHV and VDSM to report nothing wrong in the action, whilst
> > for an entire minute the plugged attribute for the vnic remains false, this
> > is enough to open a bug.
> > > - does it happen with simple NIC plug? no mirroring, nothing fancy
> > No this does not reproduce for a simple nic plug - it seems port-mirroring
> > is the only flow for which it reproduces
> 
> interesting. Can you aslo add supervdsm.log?
Will look for one with it from today's reproduction
> 
> > > - what is the OS in the VM? Does the VM actually appear in guest OS?
> > RHEL 8.6
> 
> I meant "Does the NIC actually appear in guest OS". I.e. was it actually
> plugged?
> if it was not, i guess you can try hotplug again?

When I reproduced it today I opened a console to the guest and noticed the vnic is there and is plugged. UI was reporting it as unplugged.

> 
> > > - in which build was it last succeeding? did it pass in any 4.5 test or does
> > > it fail since the first one - which one?
> > We saw it fail like so on 4.4 builds: 4.4.8-3 , 4.4.9-9
> > For 4.5 it seems to be reproducing in a significantly higher rate, though it
> > is flaky since it passed on ovirt-engine-4.5.0.6-0.7.el8ev.noarch
> 
> good to know it's happening on various RHELs.

Comment 9 Lukas Svaty 2022-05-17 07:50:53 UTC
removing automaton blocker, only 1 TC is blocked.

Comment 10 msheena 2022-05-18 09:58:08 UTC
@mperina we suspect that we are seeing this issue due to the monitoring mechanism (which may explain the delayed reporting of the vnic plugged attribute).
As of now we are investigating further, and looking for a deeper understanding of this issue.

Comment 11 Martin Perina 2022-05-18 11:05:18 UTC
OK, so please provide correct reproducer or close if it cannot be reproduced clearly

Comment 13 Martin Perina 2022-05-18 11:15:05 UTC
OK, lowering the severity till investigation is finished

Comment 16 eraviv 2022-06-07 10:40:19 UTC
The most likely cause of the above symptoms is a race condition in the engine process between plug \ unplug commands because there is no lock preventing ActivateDeactivateVmNicCommand commands from running in parallel. Such a lock exists for working with other devices [1], but not with the vNICs. So our first attempt at solving the problem will be to apply the existing device lock to the vNICs as well.

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/HotPlugDiskToVmCommand.java#L171

Comment 19 msheena 2022-07-27 07:13:32 UTC
Verified on ovirt-engine-4.5.2-0.3.el8ev.noarch

Comment 20 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.