Bug 1766666

Summary: [z-stream clone - 4.3.7] [REST] VM interface hot-unplug right after VM boot up fails over missing vnic alias name
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: ovirt-engineAssignee: eraviv
Status: CLOSED ERRATA QA Contact: msheena
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: bugs, dholler, eraviv, mburman, michal.skrivanek, nhalevy, pelauter, rbarry, Rhev-m-bugs
Target Milestone: ovirt-4.3.7Keywords: Automation, Regression, ZStream
Target Release: 4.3.7   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.7.2 Doc Type: Bug Fix
Doc Text:
A missing alias name prevented the Virtual Desktop Server Manager from identifying the VNIC which required a hot unplug. As a result, the hot unplug failed. In this release, if an alias name is not defined in the RHV Manager, it will be generated on the fly, and the hot unplug will succeed.
Story Points: ---
Clone Of: 1717390 Environment:
Last Closed: 2019-12-12 10:36:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1717390    
Bug Blocks:    

Description RHV bug bot 2019-10-29 15:46:05 UTC
+++ This bug is an upstream to downstream clone. The original bug is: +++
+++   bug 1717390 +++
======================================================================

Created attachment 1577540 [details]
ENGINE and VDSM logs

Description of problem:
------------------------
Given I have a VM which is 'down' with a plugged virtIO interface bridged to the cluster management network
When I start the VM via REST and perform a hot-unplug of the interface
Then the unplug fails when VDSM throws the following Exception:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 417, in hotunplugNic
    return self.vm.hotunplugNic(params)
  File "<string>", line 2, in hotunplugNic
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 101, in method
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 3325, in hotunplugNic
    nicParams = {'macAddr': nic.macAddr}
AttributeError: macAddr


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV 4.3.4.2-0.1.el7
vdsm-4.30.17-1.el7ev.x86_64

How reproducible:
-----------------
Only in automation testing

Steps to Reproduce (in automation test):
-----------------------------------------
1. Provision a VM with a virtIO interface
2. Start the VM
3. HotUnplug the interface

Actual results:
----------------
HotUnplug fails

Expected results:
------------------
HotUnplug should succeed

Additional info:
----------------
I believe there are problems here with Engine and VDSM both.
On Engine side I observed in the logs that the XML sent to VDSM is the following:
<?xml version="1.0" encoding="UTF-8"?><hotunplug>
  <devices>
    <interface>
      <alias name=""/>
    </interface>
  </devices>
</hotunplug>

Clearly there is something wrong here since the alias name is empty.

On VDSM side looking at lib/vdsm/virt/vm.py 3005:9 it seems that VDSM is looking to receive the 'nic' object from the XML sent by engine, and does not defend itself (output a more concise error) in case there is no value as 'macAddr' (as is happening here) or possibly the 'nic' object is empty

(Originally by Moshe Sheena)

Comment 1 RHV bug bot 2019-10-29 15:46:09 UTC
I see the unplug request is

  <devices>
    <interface>
      <alias name=""/>
    </interface>
  </devices>
</hotunplug>

which is certainly wrong. Reassigning to network team

(Originally by michal.skrivanek)

Comment 4 RHV bug bot 2019-10-29 15:46:16 UTC
Adding some more details:
-------------------------

In the test the exact flow is:
==============================
1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11 (exactly 2 MAC addresses).
2. Update host cluster to use the newly provisioned MAC pool.
3. Add an interface to the guest with ovirtmgmt network and MAC address 00:00:00:10:10:10
4. Create a snapshot of the guest and wait for it to be in status 'OK'.
5. Power up the guest and wait for it to be in status 'up'

Now we send a request to engine to unplug the guest's interface as such:
PUT /ovirt-engine/api/vms/<ID>/nics/<ID>
<nic>
    <mac>
        <address>00:00:00:10:10:13</address>  # intentional out-of-scope MAC
    </mac>
    <plugged>false</plugged>
</nic>

And that's when we come across the error: (and all of the above in the bug description)
Status: 400
Reason: Bad Request
Detail: [General Exception]

(Originally by Moshe Sheena)

Comment 5 RHV bug bot 2019-10-29 15:46:19 UTC
(In reply to msheena from comment #2)
> Adding some more details:
> -------------------------
> 
> In the test the exact flow is:
> ==============================
> 1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11
> (exactly 2 MAC addresses).
> 2. Update host cluster to use the newly provisioned MAC pool.
> 3. Add an interface to the guest with ovirtmgmt network and MAC address
> 00:00:00:10:10:10
> 4. Create a snapshot of the guest and wait for it to be in status 'OK'.
> 5. Power up the guest and wait for it to be in status 'up'
> 
> Now we send a request to engine to unplug the guest's interface as such:
> PUT /ovirt-engine/api/vms/<ID>/nics/<ID>
> <nic>
>     <mac>
>         <address>00:00:00:10:10:13</address>  # intentional out-of-scope MAC
>     </mac>
>     <plugged>false</plugged>
> </nic>
> 
> And that's when we come across the error: (and all of the above in the bug
> description)
> Status: 400
> Reason: Bad Request
> Detail: [General Exception]

Does this flow triggers the VDSM problem "AttributeError: macAddr", or is this flow handled inside Engine?

(Originally by Dominik Holler)

Comment 7 RHV bug bot 2019-10-29 15:46:25 UTC
(In reply to Dominik Holler from comment #3)
> (In reply to msheena from comment #2)
> > Adding some more details:
> > -------------------------
> > 
> > In the test the exact flow is:
> > ==============================
> > 1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11
> > (exactly 2 MAC addresses).
> > 2. Update host cluster to use the newly provisioned MAC pool.
> > 3. Add an interface to the guest with ovirtmgmt network and MAC address
> > 00:00:00:10:10:10
> > 4. Create a snapshot of the guest and wait for it to be in status 'OK'.
> > 5. Power up the guest and wait for it to be in status 'up'
> > 
> > Now we send a request to engine to unplug the guest's interface as such:
> > PUT /ovirt-engine/api/vms/<ID>/nics/<ID>
> > <nic>
> >     <mac>
> >         <address>00:00:00:10:10:13</address>  # intentional out-of-scope MAC
> >     </mac>
> >     <plugged>false</plugged>
> > </nic>
> > 
> > And that's when we come across the error: (and all of the above in the bug
> > description)
> > Status: 400
> > Reason: Bad Request
> > Detail: [General Exception]
> 
> Does this flow triggers the VDSM problem "AttributeError: macAddr", or is
> this flow handled inside Engine?

This is the flow of the test that in my opinion causes engine to send a missing xml.
About VDSM this is due to what I wrote in the description where VDSM code simply does not handle empty objects, and throws this error instead of throwing something in the like of: "Hotunplug NIC failed - NIC not found"

(Originally by Moshe Sheena)

Comment 8 RHV bug bot 2019-10-29 15:46:28 UTC
(In reply to msheena from comment #4)
> Created attachment 1577845 [details]
> engine log in debug level

Looks like the flow in this logfile is
1. libvirtXml, including the alias for the VM interface, is generated for the first VM and send to the host
2. the first VM is started
3. another VM is started the same way
4. VM monitoring fetches data of the second VM
5. XML for hotunplug on the first VM is generated without alias, because the alias is not yet in the db, because VMMonitoring is not yet triggered for the first VM.

From my understanding, the events from the host ("processing event for host") are missing in the flow, which should trigger the VM monitoring for the first VM.

Michal, what is your view on this logfile?

(Originally by Dominik Holler)

Comment 9 RHV bug bot 2019-10-29 15:46:30 UTC
Seems 5 is wrong, the alias is defined at the VM start by LibvirtVmXmlBuilder::writeAlias() for each device (since 4.2). If the command would use the right alias then the operation should work even early before monitoring reports anything.
Additionally, blocking hotplug/unplug early might make sense too. It's usually cooperative, and you may want to avoid it during bios or OS boot. In other places we have a crude but simple check on UP VM state.

(Originally by michal.skrivanek)

Comment 10 RHV bug bot 2019-10-29 15:46:33 UTC
(In reply to Michal Skrivanek from comment #7)
> Seems 5 is wrong, the alias is defined at the VM start by
> LibvirtVmXmlBuilder::writeAlias() for each device (since 4.2).
>  If the
> command would use the right alias then the operation should work even early
> before monitoring reports anything.

Ack, this would work but might hide the problem that VMMonitoring is not working as expected.

> Additionally, blocking hotplug/unplug early might make sense too. It's
> usually cooperative, and you may want to avoid it during bios or OS boot. In
> other places we have a crude but simple check on UP VM state.

In this case, the VM is in state 'up', see line 18517.

(Originally by Dominik Holler)

Comment 11 RHV bug bot 2019-10-29 15:46:36 UTC
the only interface in VM has alias <alias name='ua-838f52d8-6806-4189-8554-9e7635edd383'/> which means the device id is 838f52d8-6806-4189-8554-9e7635edd383. Later it is deviceId='b6ac415b-49f9-4ec0-bb6d-1304e644c3bf' which is trying to be unplugged. That's a different device then

(Originally by michal.skrivanek)

Comment 12 RHV bug bot 2019-10-29 15:46:39 UTC
(In reply to Michal Skrivanek from comment #9)
> the only interface in VM

There seems to be two VMs.

>  has alias <alias
> name='ua-838f52d8-6806-4189-8554-9e7635edd383'/> which means the device id
> is 838f52d8-6806-4189-8554-9e7635edd383. 

This is in the second VM mac_pool_vm_1.

> Later it is
> deviceId='b6ac415b-49f9-4ec0-bb6d-1304e644c3bf' which is trying to be
> unplugged. That's a different device then

This is in the first VM mac_pool_vm_0.

(Originally by Dominik Holler)

Comment 13 RHV bug bot 2019-10-29 15:46:42 UTC
oh, I'm sorry! Yeah, that's correct and the command matches for mac_pool_vm_0.

Still, doesn't change a thing that the problem is in ActivateDeactivateVmNicCommand most likely...you just shouldn't send an invalid(empty) device to unplug

(Originally by michal.skrivanek)

Comment 14 RHV bug bot 2019-10-29 15:46:45 UTC
(In reply to Michal Skrivanek from comment #11)
> oh, I'm sorry! Yeah, that's correct and the command matches for
> mac_pool_vm_0.
> 
> Still, doesn't change a thing that the problem is in
> ActivateDeactivateVmNicCommand most likely...you just shouldn't send an
> invalid(empty) device to unplug

Ack, I will provide a code change which extends the validation of
ActivateDeactivateVmNicCommand. We can discuss on the code change about
the error message.

But never the less the reported flow should work, and the validation
rule will change only the error message.

(Originally by Dominik Holler)

Comment 15 RHV bug bot 2019-10-29 15:46:48 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

(Originally by pm-rhel)

Comment 17 RHV bug bot 2019-10-29 15:46:52 UTC
Moshe, can you please try to have a sleep before unplugging the vNIC as a workaround?

(Originally by Dominik Holler)

Comment 19 RHV bug bot 2019-10-29 15:46:56 UTC
Hi Dominik,
Unfortunately waiting will not solve the issue.
Tried waiting for 60, 90 and 120 seconds without any success.

(Originally by Moshe Sheena)

Comment 20 RHV bug bot 2019-10-29 15:47:00 UTC
Could not reproduce manually on master in following cases:

A. flow as in description of this bug:
"Steps to Reproduce (in automation test):
-----------------------------------------
1. Provision a VM with a virtIO interface
2. Start the VM
3. HotUnplug the interface"

tried without mac address, with mac address in range, with mac address out of range


B. flow as in comment #1

1. Create a MAC pool with range: 00:00:00:10:10:10 - 00:00:00:10:10:11 (exactly 2 MAC addresses).
2. Update host cluster to use the newly provisioned MAC pool.
3. Add an interface to the guest with ovirtmgmt network and MAC address 00:00:00:10:10:10
4. Create a snapshot of the guest and wait for it to be in status 'OK'.
5. Power up the guest and wait for it to be in status 'up'
Now we send a request to engine to unplug the guest's interface...

tried with mac address out of range, in range, no mac address

(Originally by Eitan Raviv)

Comment 27 Dominik Holler 2019-10-29 16:00:53 UTC
This bug should have been the z-stream clone instead the downstream clone.

*** This bug has been marked as a duplicate of bug 1717390 ***

Comment 28 Sandro Bonazzola 2019-10-30 07:48:57 UTC
this is zstream for 4.3.7

Comment 30 msheena 2019-11-25 10:36:11 UTC
Verified with
=============
ovirt-engine-4.3.7.2-0.1.el7.noarch

Comment 31 Nadav Halevy 2019-12-10 12:40:34 UTC
Hi Eitan, can you please review this doc text as soon as possible as we need it for errata approval today?

A missing alias name prevented the Virtual Desktop Server Manager from identifying the VNIC which required a hot unplug.
As a result, the hot unplug failed.
In this release, if an alias name is not defined in the RHV Manager, it will be generated on the fly, and the hot unplug will succeed.

Comment 32 eraviv 2019-12-10 13:04:29 UTC
ack

Comment 34 errata-xmlrpc 2019-12-12 10:36:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4229