Bug 1405761

Summary: MAC addresses are not freed when a storage domain is detached from dc
Product: [oVirt] ovirt-engine Reporter: Michael Burman <mburman>
Component: BLL.StorageAssignee: Maor <mlipchuk>
Status: CLOSED CURRENTRELEASE QA Contact: Michael Burman <mburman>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.1.0CC: bugs, danken, mburman, mlipchuk, ratamir, yzaspits
Target Milestone: ovirt-4.1.0-betaKeywords: Regression
Target Release: 4.1.0Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: http://resources.ovirt.org/repos/ovirt/experimental/4.1/latest.tested/rpm/el7/noarch/ovirt-engine-4.1.0-0.4.master.20170101173945.git3d2abd2.el7.centos.noarch.rpm Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-01 14:58:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1475272    
Attachments:
Description Flags
engine log
none
new engine log_
none
mac leakage none

Description Michael Burman 2016-12-18 07:58:44 UTC
Created attachment 1233037 [details]
engine log

Description of problem:
Failed to import VM because engine complains about MAC Address is already in use, while there are no MACs are in use.

It's not possible to import a VM from date domain, bec ause engine think that VMs MAC address is in use on the destination cluster, while there are no VMs at all..


2016-12-18 09:56:48,979+02 WARN  [org.ovirt.engine.core.bll.exportimport.ImportVmFromConfigurationCommand] (default task-13) [5c48763e-973e-4a1f-b8ed-bb6f3d4ffafc] Validation of action 'ImportVmFromConfiguration' failed for user admin@internal-authz. Reasons: VAR__ACTION__IMPORT,VAR__TYPE__VM,NETWORK_MAC_ADDRESS_IN_USE_DETAILED,$NETWORK_MAC_ADDRESS_IN_USE_DETAILED_LIST     00:00:00:00:00:21,$NETWORK_MAC_ADDRESS_IN_USE_DETAILED_LIST_COUNTER 1
2016-12-18 09:56:48,981+02 INFO  [org.ovirt.engine.core.bll.exportimport.ImportVmFromConfigurationCommand] (default task-13) [5c48763e-973e-4a1f-b8ed-bb6f3d4ffafc] Lock freed to object 'EngineLock:{exclusiveLocks='[a8f359d8-5cc8-40df-bfbc-886d7c453eea=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_IMPORTED$VmName t2>, t2=<VM_NAME, ACTION_TYPE_FAILED_NAME_ALREADY_USED>]', sharedLocks='[a8f359d8-5cc8-40df-bfbc-886d7c453eea=<REMOTE_VM, ACTION_TYPE_FAILED_VM_IS_BEING_IMPORTED$VmName t2>]'}'


Maybe related to latest changes on the import VM feature...

Version-Release number of selected component (if applicable):
4.1.0-0.2.master.20161216212250.gitc040969.el7.centos

How reproducible:
100

Steps to Reproduce:
1. Create VM with disk and network interface
2. Set storage domain in maintenance mode, detach from dc and remove(VM should removed as well)
3. Import the data domain back to the DC and try to import the VM back.

Actual results:
Failed with Error while executing action: 

t2:
MAC Address is already in use: 00:00:00:00:00:21.

Engine for some reason complain that the mac is in use. but no VMs are in the destination cluster

Expected results:
Should work as expected.

Comment 1 Yevgeny Zaspitsky 2016-12-18 13:10:51 UTC
Looks like detaching/removing a storage domain doesn't remove the MACs of the VMs (that their disks are on the storage domain and are being removed together with the storage domain) from a mac-pool.

@see StorageHandlingCommandBase.removeEntitiesFromStorageDomain
that uses 
@see Remove_Entities_From_storage_domain @ storages_sp.sql (strored procedure)

IMHO the bug exists from 3.6 and could be reproduced by the following scenario in (3.6 & 4.0):
* create mac-pool of n macs.
* create a VM with n macs and a disk.
* remove the storage that contains the VM's disk.
* attach the storage back and try to import the VM back.
The result would be: no free MACs available.

Comment 2 Yevgeny Zaspitsky 2016-12-18 13:54:43 UTC
The last step in the (proposed for 3.6 & 4.0) scenario could be replaced by create a VM+vNic.

Comment 3 Allon Mureinik 2016-12-18 22:54:37 UTC
Sounds like a reasonable analysis.
However, making sure the mac pool thingy is managed correctly is definitely the network team's domain.

Dan - if Yevgeny can't or won't fix this, please assign to someone who can.

Comment 4 Dan Kenigsberg 2016-12-22 12:01:27 UTC
I understand that Yevgeni and Maor have taken care of the issue together.

Comment 5 Red Hat Bugzilla Rules Engine 2016-12-23 14:20:03 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Michael Burman 2016-12-26 12:21:14 UTC
This fix didn't made the current 4.1 downstream build 4.1.0-0.3.beta2.el7

Comment 7 Michael Burman 2017-01-05 06:55:27 UTC
Verified on - 4.1.0-0.4.master.20170104122005.git51b1bcf.el7.centos

Comment 8 Michael Burman 2017-01-08 07:52:08 UTC
How could it be that i see this bug again on latest.tested 4.1 upstream version??
4.1.0-0.4.master.20170105161132.gitf4e2c11.el7.centos

Comment 9 Dan Kenigsberg 2017-01-08 09:33:41 UTC
Michael, does this reproduce as often (100%) as before? Maybe something else leaks mac addresses?

Comment 10 Maor 2017-01-08 09:37:21 UTC
(In reply to Michael Burman from comment #8)
> How could it be that i see this bug again on latest.tested 4.1 upstream
> version??
> 4.1.0-0.4.master.20170105161132.gitf4e2c11.el7.centos

Hi Michael,

Can you please add the engine logs

Comment 11 Michael Burman 2017-01-08 10:10:49 UTC
No, sometimes. It happen to me few times this morning, since then i can't reproduce it.
Maybe something else leak mac addresses, i really don't know. 
Attaching log, but i don't have reproduction from the last hours.

Comment 12 Michael Burman 2017-01-08 10:11:31 UTC
Created attachment 1238325 [details]
new engine log_

Comment 13 Michael Burman 2017-01-09 07:17:28 UTC
Attaching new log
We have a mac leakage for sure, but i can't understand what triggers it.
I have a situation in which i have no macs in use on the destination cluster, but engine complains about no macs left in the pool when trying to import the vm.
When i add vNIC to VM on the destination cluster, vNIC added with success from the pool, but when i try to import VM to this cluster it complain about that all macs are in use..

Comment 14 Michael Burman 2017-01-09 07:17:52 UTC
Created attachment 1238574 [details]
mac leakage

Comment 15 Michael Burman 2017-01-09 08:10:29 UTC
Ok, so i managed to reproduce it with the next steps, still not sure exactly what is the trigger here, but i think we should consider re-open this bug, as the MAC addresses are not freed when a storage domain is detached in this scenario:

1) Create two MAC pool ranges, 2 MACs in each
2) Create VM with 1 vNIC from pool1
3) Edit cluster and switch to the second mac pool2
4) Add second vNIC to the VM from pool2
5) Detach the storage domain, remove it and import back. 
6) One mac should be warned about out of range. re-assign it and import VM
7) Detach the storage domain again and re-import

Result:
MAC address didn't got freed and MAC addresses are in use on destination, but they should be free.
Failed to import VM 

It is 100% reproducible in the above steps^^

Comment 16 Maor 2017-01-09 09:54:56 UTC
Once a storage domain is detached the engine gets the VM's mac addresses based on the cluster of the VM.
Since this cluster's mac pool was changed the engine was not able to fetch it and free the macs.

Dan and Yevgeny,
is it possible to block the update of the mac pool in a cluster once there are VMs which run on this cluster and consist mac addresses.

Comment 17 Michael Burman 2017-01-10 07:05:51 UTC
Hi

Ok, so the scenario for the mac leakage is simple(and doesn't related to update of mac pool ranges in the cluster).
MAC address that has been re-assigned during import VM from data domain, no matter from what reason(mac in use or out of range), won't get freed next time the domain will be detached.

So it actually looks like another scenario not covered for BZ 1405761 and i believe we should track it here.
'macs that has been re-assigned are not freed when detaching data domain.'
This is 100% reproducible

Please decide how you would like to continue and sorry for the noise.

Comment 18 Michael Burman 2017-01-10 13:10:24 UTC
We decided to track the new issue in a new bug - BZ 1411780