Bug 1717390 - [REST] VM interface hot-unplug right after VM boot up fails over missing vnic alias name
Summary: [REST] VM interface hot-unplug right after VM boot up fails over missing vnic...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.3.4.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: eraviv
QA Contact: msheena
URL:
Whiteboard:
Depends On:
Blocks: 1766666
TreeView+ depends on / blocked
 
Reported: 2019-06-05 11:53 UTC by msheena
Modified: 2020-06-22 19:59 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, if you hot-unplugged a virtual machine interface shortly after booting the virtual machine, the unplugging action failed with an error. When this happened, it was because VM monitoring did not report the alias of the interface soon enough; and VDSM could not identify the vNIC to unplug. The current release fixes this issue: If if the alias is missing during hot unplug, the Engine generates one on the fly.
Clone Of:
: 1766666 (view as bug list)
Environment:
Last Closed: 2020-05-20 20:03:27 UTC
oVirt Team: Network
Embargoed:
pm-rhel: ovirt-4.4+
pm-rhel: blocker?
pelauter: planning_ack+
dholler: devel_ack+
mburman: testing_ack+


Attachments (Terms of Use)
ENGINE and VDSM logs (1.53 MB, application/zip)
2019-06-05 11:53 UTC, msheena
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 100630 0 master ABANDONED core: Validate alias in ActivateDeactivateVmNicCommand 2020-11-02 13:58:28 UTC
oVirt gerrit 103553 0 'None' MERGED core: generate vnic alias if missing on hot unplug 2020-11-02 13:58:12 UTC
oVirt gerrit 103560 0 'None' MERGED core: generate vnic alias if missing on hot unplug 2020-11-02 13:58:28 UTC

Description msheena 2019-06-05 11:53:54 UTC
Created attachment 1577540 [details]
ENGINE and VDSM logs

Description of problem:
------------------------
Given I have a VM which is 'down' with a plugged virtIO interface bridged to the cluster management network
When I start the VM via REST and perform a hot-unplug of the interface
Then the unplug fails when VDSM throws the following Exception:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 417, in hotunplugNic
    return self.vm.hotunplugNic(params)
  File "<string>", line 2, in hotunplugNic
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 101, in method
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 3325, in hotunplugNic
    nicParams = {'macAddr': nic.macAddr}
AttributeError: macAddr


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV 4.3.4.2-0.1.el7
vdsm-4.30.17-1.el7ev.x86_64

How reproducible:
-----------------
Only in automation testing

Steps to Reproduce (in automation test):
-----------------------------------------
1. Provision a VM with a virtIO interface
2. Start the VM
3. HotUnplug the interface

Actual results:
----------------
HotUnplug fails

Expected results:
------------------
HotUnplug should succeed

Additional info:
----------------
I believe there are problems here with Engine and VDSM both.
On Engine side I observed in the logs that the XML sent to VDSM is the following:
<?xml version="1.0" encoding="UTF-8"?><hotunplug>
  <devices>
    <interface>
      <alias name=""/>
    </interface>
  </devices>
</hotunplug>

Clearly there is something wrong here since the alias name is empty.

On VDSM side looking at lib/vdsm/virt/vm.py 3005:9 it seems that VDSM is looking to receive the 'nic' object from the XML sent by engine, and does not defend itself (output a more concise error) in case there is no value as 'macAddr' (as is happening here) or possibly the 'nic' object is empty

Comment 1 Michal Skrivanek 2019-06-06 07:08:04 UTC
I see the unplug request is

  <devices>
    <interface>
      <alias name=""/>
    </interface>
  </devices>
</hotunplug>

which is certainly wrong. Reassigning to network team

Comment 2 msheena 2019-06-06 10:00:07 UTC
Adding some more details:
-------------------------

In the test the exact flow is:
==============================
1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11 (exactly 2 MAC addresses).
2. Update host cluster to use the newly provisioned MAC pool.
3. Add an interface to the guest with ovirtmgmt network and MAC address 00:00:00:10:10:10
4. Create a snapshot of the guest and wait for it to be in status 'OK'.
5. Power up the guest and wait for it to be in status 'up'

Now we send a request to engine to unplug the guest's interface as such:
PUT /ovirt-engine/api/vms/<ID>/nics/<ID>
<nic>
    <mac>
        <address>00:00:00:10:10:13</address>  # intentional out-of-scope MAC
    </mac>
    <plugged>false</plugged>
</nic>

And that's when we come across the error: (and all of the above in the bug description)
Status: 400
Reason: Bad Request
Detail: [General Exception]

Comment 3 Dominik Holler 2019-06-06 10:11:57 UTC
(In reply to msheena from comment #2)
> Adding some more details:
> -------------------------
> 
> In the test the exact flow is:
> ==============================
> 1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11
> (exactly 2 MAC addresses).
> 2. Update host cluster to use the newly provisioned MAC pool.
> 3. Add an interface to the guest with ovirtmgmt network and MAC address
> 00:00:00:10:10:10
> 4. Create a snapshot of the guest and wait for it to be in status 'OK'.
> 5. Power up the guest and wait for it to be in status 'up'
> 
> Now we send a request to engine to unplug the guest's interface as such:
> PUT /ovirt-engine/api/vms/<ID>/nics/<ID>
> <nic>
>     <mac>
>         <address>00:00:00:10:10:13</address>  # intentional out-of-scope MAC
>     </mac>
>     <plugged>false</plugged>
> </nic>
> 
> And that's when we come across the error: (and all of the above in the bug
> description)
> Status: 400
> Reason: Bad Request
> Detail: [General Exception]

Does this flow triggers the VDSM problem "AttributeError: macAddr", or is this flow handled inside Engine?

Comment 5 msheena 2019-06-06 11:03:28 UTC
(In reply to Dominik Holler from comment #3)
> (In reply to msheena from comment #2)
> > Adding some more details:
> > -------------------------
> > 
> > In the test the exact flow is:
> > ==============================
> > 1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11
> > (exactly 2 MAC addresses).
> > 2. Update host cluster to use the newly provisioned MAC pool.
> > 3. Add an interface to the guest with ovirtmgmt network and MAC address
> > 00:00:00:10:10:10
> > 4. Create a snapshot of the guest and wait for it to be in status 'OK'.
> > 5. Power up the guest and wait for it to be in status 'up'
> > 
> > Now we send a request to engine to unplug the guest's interface as such:
> > PUT /ovirt-engine/api/vms/<ID>/nics/<ID>
> > <nic>
> >     <mac>
> >         <address>00:00:00:10:10:13</address>  # intentional out-of-scope MAC
> >     </mac>
> >     <plugged>false</plugged>
> > </nic>
> > 
> > And that's when we come across the error: (and all of the above in the bug
> > description)
> > Status: 400
> > Reason: Bad Request
> > Detail: [General Exception]
> 
> Does this flow triggers the VDSM problem "AttributeError: macAddr", or is
> this flow handled inside Engine?

This is the flow of the test that in my opinion causes engine to send a missing xml.
About VDSM this is due to what I wrote in the description where VDSM code simply does not handle empty objects, and throws this error instead of throwing something in the like of: "Hotunplug NIC failed - NIC not found"

Comment 6 Dominik Holler 2019-06-07 08:25:51 UTC
(In reply to msheena from comment #4)
> Created attachment 1577845 [details]
> engine log in debug level

Looks like the flow in this logfile is
1. libvirtXml, including the alias for the VM interface, is generated for the first VM and send to the host
2. the first VM is started
3. another VM is started the same way
4. VM monitoring fetches data of the second VM
5. XML for hotunplug on the first VM is generated without alias, because the alias is not yet in the db, because VMMonitoring is not yet triggered for the first VM.

From my understanding, the events from the host ("processing event for host") are missing in the flow, which should trigger the VM monitoring for the first VM.

Michal, what is your view on this logfile?

Comment 7 Michal Skrivanek 2019-06-07 08:36:46 UTC
Seems 5 is wrong, the alias is defined at the VM start by LibvirtVmXmlBuilder::writeAlias() for each device (since 4.2). If the command would use the right alias then the operation should work even early before monitoring reports anything.
Additionally, blocking hotplug/unplug early might make sense too. It's usually cooperative, and you may want to avoid it during bios or OS boot. In other places we have a crude but simple check on UP VM state.

Comment 8 Dominik Holler 2019-06-07 08:46:43 UTC
(In reply to Michal Skrivanek from comment #7)
> Seems 5 is wrong, the alias is defined at the VM start by
> LibvirtVmXmlBuilder::writeAlias() for each device (since 4.2).
>  If the
> command would use the right alias then the operation should work even early
> before monitoring reports anything.

Ack, this would work but might hide the problem that VMMonitoring is not working as expected.

> Additionally, blocking hotplug/unplug early might make sense too. It's
> usually cooperative, and you may want to avoid it during bios or OS boot. In
> other places we have a crude but simple check on UP VM state.

In this case, the VM is in state 'up', see line 18517.

Comment 9 Michal Skrivanek 2019-06-07 09:44:05 UTC
the only interface in VM has alias <alias name='ua-838f52d8-6806-4189-8554-9e7635edd383'/> which means the device id is 838f52d8-6806-4189-8554-9e7635edd383. Later it is deviceId='b6ac415b-49f9-4ec0-bb6d-1304e644c3bf' which is trying to be unplugged. That's a different device then

Comment 10 Dominik Holler 2019-06-07 09:50:54 UTC
(In reply to Michal Skrivanek from comment #9)
> the only interface in VM

There seems to be two VMs.

>  has alias <alias
> name='ua-838f52d8-6806-4189-8554-9e7635edd383'/> which means the device id
> is 838f52d8-6806-4189-8554-9e7635edd383. 

This is in the second VM mac_pool_vm_1.

> Later it is
> deviceId='b6ac415b-49f9-4ec0-bb6d-1304e644c3bf' which is trying to be
> unplugged. That's a different device then

This is in the first VM mac_pool_vm_0.

Comment 11 Michal Skrivanek 2019-06-07 10:04:06 UTC
oh, I'm sorry! Yeah, that's correct and the command matches for mac_pool_vm_0.

Still, doesn't change a thing that the problem is in ActivateDeactivateVmNicCommand most likely...you just shouldn't send an invalid(empty) device to unplug

Comment 12 Dominik Holler 2019-06-07 10:11:36 UTC
(In reply to Michal Skrivanek from comment #11)
> oh, I'm sorry! Yeah, that's correct and the command matches for
> mac_pool_vm_0.
> 
> Still, doesn't change a thing that the problem is in
> ActivateDeactivateVmNicCommand most likely...you just shouldn't send an
> invalid(empty) device to unplug

Ack, I will provide a code change which extends the validation of
ActivateDeactivateVmNicCommand. We can discuss on the code change about
the error message.

But never the less the reported flow should work, and the validation
rule will change only the error message.

Comment 13 RHEL Program Management 2019-06-10 10:50:16 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 15 Dominik Holler 2019-08-06 13:37:21 UTC
Moshe, can you please try to have a sleep before unplugging the vNIC as a workaround?

Comment 17 msheena 2019-08-20 11:12:57 UTC
Hi Dominik,
Unfortunately waiting will not solve the issue.
Tried waiting for 60, 90 and 120 seconds without any success.

Comment 18 eraviv 2019-09-09 09:58:44 UTC
Could not reproduce manually on master in following cases:

A. flow as in description of this bug:
"Steps to Reproduce (in automation test):
-----------------------------------------
1. Provision a VM with a virtIO interface
2. Start the VM
3. HotUnplug the interface"

tried without mac address, with mac address in range, with mac address out of range


B. flow as in comment #1

1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11 (exactly 2 MAC addresses).
2. Update host cluster to use the newly provisioned MAC pool.
3. Add an interface to the guest with ovirtmgmt network and MAC address 00:00:00:10:10:10
4. Create a snapshot of the guest and wait for it to be in status 'OK'.
5. Power up the guest and wait for it to be in status 'up'
Now we send a request to engine to unplug the guest's interface...

tried with mac address out of range, in range, no mac address

Comment 26 Dominik Holler 2019-10-29 16:00:53 UTC
*** Bug 1766666 has been marked as a duplicate of this bug. ***

Comment 27 msheena 2020-03-22 10:59:40 UTC
Verified on
===========
ovirt-engine-4.4.0-0.25.master.el8ev.noarch
vdsm-4.40.5-1.el8ev.x86_64

Comment 28 Sandro Bonazzola 2020-05-20 20:03:27 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 29 Rolfe Dlugy-Hegwer 2020-06-22 19:51:12 UTC
Notes: Previously, if you hot-unplugged a virtual machine interface shortly after booting the virtual machine, the unplugging action failed with an error. 

Cause: happened because the vNIC alias was not set for hot unplug to accommodate old VMs which have been running in a user's environment with a libvirt alias. But for new VMs, the vNIC alias name is generated by the engine, sent to the host, reported back by VM monitoring, and only then set in the DB. So in some cases, the alias was not reported by host monitoring in time for the hot unplug to have it on hand, which caused an empty alias to be sent. 

Result: The missing alias name prevented VDSM from identifying the vNIC that required hot unplug, thereby failing the execution thereof.

Fix: If no alias name exists in the Engine during hot unplug, the Engine generates it on the fly.


Note You need to log in before you can comment on or make changes to this bug.