Description of problem ====================== Given I have a vm with an unplugged vnic with port-mirroring When I hot-plug the vnic via REST API Then for at least 60 seconds while GET-ting the vnic 'plugged' state it is reported as 'false' Version-Release number of selected component (if applicable) ============================================================ ovirt-engine-4.5.0.6-0.7.el8ev.noarch How reproducible ================ Roughly 30% in automated test Steps to Reproduce ================== 1. Create a VLAN tagged '5' network 'net' 2. Create a vnic-profile 'vnic-profile' pointing to 'net' with port-mirroring 3. Attach 'net' to an empty nic on a host 4. Start a vm on the host 5. Add 3 vnics to a vm (which already has 1 vnic pointing to ovirtmgmt) with all 3 vnics pointing to 'vnic-profile' 6. Unplug all 3 vnics 5. Hot-plug the 3rd vnic Actual results ============== RHV replies with OK to the hot-plug request, yet for 60 seconds (and possibly more) when querying /ovirt-engine/api/vms/<VM_ID>/nics then the nic is reported with <plugged>false</plugged> Expected results ================ RHV replies with OK and the nic is reported as <plugged>false</plugged> within 5 seconds Additional info =============== * The vm having 3 vnics pointing to 'net' is a result of our test suite creating these for purpose of shared setup. While the specific test that caught this regression is executed the first 2 vnics pointing to 'net' are unplugged. * The expected result of the vnic being reported unplugged in 5 seconds is debatable. Eventually this test used to run on 4.4 with a waiting time of 10 seconds and the wait would usually amount to 1-2 seconds or less. * Both RHV and VDSM report the hot-plug as successful immediately so there might be an issue with the changing of the 'plugged' attribute of the vnic in RHV DB
(In reply to msheena from comment #0) > Description of problem > ====================== > Given I have a vm with an unplugged vnic with port-mirroring > When I hot-plug the vnic via REST API > Then for at least 60 seconds while GET-ting the vnic 'plugged' state it is > reported as 'false' > > Version-Release number of selected component (if applicable) > ============================================================ > ovirt-engine-4.5.0.6-0.7.el8ev.noarch > > How reproducible > ================ > Roughly 30% in automated test > > Steps to Reproduce > ================== > 1. Create a VLAN tagged '5' network 'net' > 2. Create a vnic-profile 'vnic-profile' pointing to 'net' with port-mirroring > 3. Attach 'net' to an empty nic on a host > 4. Start a vm on the host > 5. Add 3 vnics to a vm (which already has 1 vnic pointing to ovirtmgmt) with > all 3 vnics pointing to 'vnic-profile' > 6. Unplug all 3 vnics > 5. Hot-plug the 3rd vnic > > Actual results > ============== > RHV replies with OK to the hot-plug request, yet for 60 seconds (and > possibly more) when querying /ovirt-engine/api/vms/<VM_ID>/nics then the nic > is reported with <plugged>false</plugged> Is this really a bug or just hotplug takes long time? I don't think that we are waiting for hotplug action to finish or do we? And even if we do isn't the status changed only after another round of VM monitoring?
(In reply to Martin Perina from comment #2) > (In reply to msheena from comment #0) > > Description of problem > > ====================== > > Given I have a vm with an unplugged vnic with port-mirroring > > When I hot-plug the vnic via REST API > > Then for at least 60 seconds while GET-ting the vnic 'plugged' state it is > > reported as 'false' > > > > Version-Release number of selected component (if applicable) > > ============================================================ > > ovirt-engine-4.5.0.6-0.7.el8ev.noarch > > > > How reproducible > > ================ > > Roughly 30% in automated test > > > > Steps to Reproduce > > ================== > > 1. Create a VLAN tagged '5' network 'net' > > 2. Create a vnic-profile 'vnic-profile' pointing to 'net' with port-mirroring > > 3. Attach 'net' to an empty nic on a host > > 4. Start a vm on the host > > 5. Add 3 vnics to a vm (which already has 1 vnic pointing to ovirtmgmt) with > > all 3 vnics pointing to 'vnic-profile' > > 6. Unplug all 3 vnics > > 5. Hot-plug the 3rd vnic > > > > Actual results > > ============== > > RHV replies with OK to the hot-plug request, yet for 60 seconds (and > > possibly more) when querying /ovirt-engine/api/vms/<VM_ID>/nics then the nic > > is reported with <plugged>false</plugged> > > Is this really a bug or just hotplug takes long time? I don't think that we > are waiting for hotplug action to finish or do we? And even if we do isn't > the status changed only after another round of VM monitoring? HI Martin, we suspect this is an engine bug. We started to see it on 4.5 only Engine doesn't behave as expected and doesn't report the correct state for many minutes..investigation from DEV needed here..thanks!
- "doesn't report the correct state for many minutes" - IIUC your test kills the VM after 1 minute. So how many minutes you've seen this not reported? - does it happen with simple NIC plug? no mirroring, nothing fancy - what is the OS in the VM? Does the VM actually appear in guest OS? - in which build was it last succeeding? did it pass in any 4.5 test or does it fail since the first one - which one?
(In reply to Michal Skrivanek from comment #4) > - "doesn't report the correct state for many minutes" - IIUC your test kills > the VM after 1 minute. So how many minutes you've seen this not reported? This will need to be tested as the furthest we've gone is 1 minute (this test was used to passing with 10 seconds wait where usually it took RHV 1-2 seconds or even less to change the plugged attribute) I felt like for RHV and VDSM to report nothing wrong in the action, whilst for an entire minute the plugged attribute for the vnic remains false, this is enough to open a bug. > - does it happen with simple NIC plug? no mirroring, nothing fancy No this does not reproduce for a simple nic plug - it seems port-mirroring is the only flow for which it reproduces > - what is the OS in the VM? Does the VM actually appear in guest OS? RHEL 8.6 > - in which build was it last succeeding? did it pass in any 4.5 test or does > it fail since the first one - which one? We saw it fail like so on 4.4 builds: 4.4.8-3 , 4.4.9-9 For 4.5 it seems to be reproducing in a significantly higher rate, though it is flaky since it passed on ovirt-engine-4.5.0.6-0.7.el8ev.noarch
(In reply to msheena from comment #6) > (In reply to Michal Skrivanek from comment #4) > > - "doesn't report the correct state for many minutes" - IIUC your test kills > > the VM after 1 minute. So how many minutes you've seen this not reported? > This will need to be tested as the furthest we've gone is 1 minute (this > test was used to passing with 10 seconds wait where usually it took RHV 1-2 > seconds or even less to change the plugged attribute) can you try to increase it then to see if it's just slow? > I felt like for RHV and VDSM to report nothing wrong in the action, whilst > for an entire minute the plugged attribute for the vnic remains false, this > is enough to open a bug. > > - does it happen with simple NIC plug? no mirroring, nothing fancy > No this does not reproduce for a simple nic plug - it seems port-mirroring > is the only flow for which it reproduces interesting. Can you aslo add supervdsm.log? > > - what is the OS in the VM? Does the VM actually appear in guest OS? > RHEL 8.6 I meant "Does the NIC actually appear in guest OS". I.e. was it actually plugged? if it was not, i guess you can try hotplug again? > > - in which build was it last succeeding? did it pass in any 4.5 test or does > > it fail since the first one - which one? > We saw it fail like so on 4.4 builds: 4.4.8-3 , 4.4.9-9 > For 4.5 it seems to be reproducing in a significantly higher rate, though it > is flaky since it passed on ovirt-engine-4.5.0.6-0.7.el8ev.noarch good to know it's happening on various RHELs.
(In reply to Michal Skrivanek from comment #7) > (In reply to msheena from comment #6) > > (In reply to Michal Skrivanek from comment #4) > > > - "doesn't report the correct state for many minutes" - IIUC your test kills > > > the VM after 1 minute. So how many minutes you've seen this not reported? > > This will need to be tested as the furthest we've gone is 1 minute (this > > test was used to passing with 10 seconds wait where usually it took RHV 1-2 > > seconds or even less to change the plugged attribute) > > can you try to increase it then to see if it's just slow? I was able to notice that after several minutes the 'plugged' attribute did change to true > > > I felt like for RHV and VDSM to report nothing wrong in the action, whilst > > for an entire minute the plugged attribute for the vnic remains false, this > > is enough to open a bug. > > > - does it happen with simple NIC plug? no mirroring, nothing fancy > > No this does not reproduce for a simple nic plug - it seems port-mirroring > > is the only flow for which it reproduces > > interesting. Can you aslo add supervdsm.log? Will look for one with it from today's reproduction > > > > - what is the OS in the VM? Does the VM actually appear in guest OS? > > RHEL 8.6 > > I meant "Does the NIC actually appear in guest OS". I.e. was it actually > plugged? > if it was not, i guess you can try hotplug again? When I reproduced it today I opened a console to the guest and noticed the vnic is there and is plugged. UI was reporting it as unplugged. > > > > - in which build was it last succeeding? did it pass in any 4.5 test or does > > > it fail since the first one - which one? > > We saw it fail like so on 4.4 builds: 4.4.8-3 , 4.4.9-9 > > For 4.5 it seems to be reproducing in a significantly higher rate, though it > > is flaky since it passed on ovirt-engine-4.5.0.6-0.7.el8ev.noarch > > good to know it's happening on various RHELs.
removing automaton blocker, only 1 TC is blocked.
@mperina we suspect that we are seeing this issue due to the monitoring mechanism (which may explain the delayed reporting of the vnic plugged attribute). As of now we are investigating further, and looking for a deeper understanding of this issue.
OK, so please provide correct reproducer or close if it cannot be reproduced clearly
OK, lowering the severity till investigation is finished
The most likely cause of the above symptoms is a race condition in the engine process between plug \ unplug commands because there is no lock preventing ActivateDeactivateVmNicCommand commands from running in parallel. Such a lock exists for working with other devices [1], but not with the vNICs. So our first attempt at solving the problem will be to apply the existing device lock to the vNICs as well. [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/HotPlugDiskToVmCommand.java#L171
Verified on ovirt-engine-4.5.2-0.3.el8ev.noarch
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.