Bug 1497242 - oVirt duplicates MAC addresses
Summary: oVirt duplicates MAC addresses
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Network
Version: 4.1.5.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.8
: ---
Assignee: Alona Kaplan
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-29 14:41 UTC by nicolas
Modified: 2017-10-20 06:14 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-20 06:14:38 UTC
oVirt Team: Network
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
engine.log (299.05 KB, application/x-gzip)
2017-09-29 14:41 UTC, nicolas
no flags Details
Script to find out duplicate IPs (1.23 KB, text/x-python)
2017-10-18 07:35 UTC, nicolas
no flags Details
Script to find out duplicate MACs (1.03 KB, text/x-python)
2017-10-18 07:36 UTC, nicolas
no flags Details
Script to find out duplicate MACs and solve problems (4.74 KB, text/x-python)
2017-10-18 07:36 UTC, nicolas
no flags Details
Read-only MAC pool submenu (28.29 KB, image/png)
2017-10-18 11:27 UTC, nicolas
no flags Details

Description nicolas 2017-09-29 14:41:06 UTC
Created attachment 1332428 [details]
engine.log

Description of problem:

We created a VmPool using Python-SDK with one VM, then we resized the VmPool to 170 machines and soon we started receiving a lot of complaints that users get disconnected soon from their VMs (using SSH). Having a look at the VMs, I found out that a lot of VMs have duplicated MAC addresses, so they had the same IP addresses as well.

We have these MAC ranges on the Cluster:

Default:
00:1a:4a:4d:cc:00 - 00:1a:4a:4d:cc:ff
00:1a:4a:4d:dd:00 - 00:1a:4a:4d:dd:ff
00:1a:4a:97:5f:00 - 00:1a:4a:97:5f:ff
00:1a:4a:97:6e:00 - 00:1a:4a:97:6f:ff

AFAIC, none of them are overlapped.

I created a python script to show VMs that share MAC addresses and this is the result:

LPP1718-142 and LPP1718-10: 00:1a:4a:4d:cc:10
LPP1718-143 and LPP1718-15: 00:1a:4a:97:5f:11
LPP1718-140 and STIC-...-windows7-desktop: 00:1a:4a:4d:cc:0f
LPP1718-141 and LPP1718-13: 00:1a:4a:97:5f:10
LPP1718-147 and LPP1718-19: 00:1a:4a:97:5f:13
LPP1718-144 and LPP1718-12: 00:1a:4a:4d:cc:11
LPP1718-145 and LPP1718-17: 00:1a:4a:97:5f:12
LPP1718-148 and ASR_PROF-Base_Centos: 00:1a:4a:4d:cc:13
LPP1718-149 and LPP1718-21: 00:1a:4a:97:5f:14
ASR_PROF-routercentos and LPP1718-111: 00:1a:4a:97:5f:01
LPP1718-5 and LPP1718-133: 00:1a:4a:97:5f:0c
LPP1718-4 and LPP1718-134: 00:1a:4a:4d:cc:0c
LPP1718-7 and LPP1718-135: 00:1a:4a:97:5f:0d
LPP1718-6 and LPP1718-136: 00:1a:4a:4d:cc:0d
LPP1718-3 and LPP1718-127: 00:1a:4a:97:5f:09
LPP1718-2 and LPP1718-130: 00:1a:4a:4d:cc:0a
LPP1718-9 and LPP1718-137: 00:1a:4a:97:5f:0e
LPP1718-8 and LPP1718-138: 00:1a:4a:4d:cc:0e
ASR_DHCP and LPP1718-114: 00:1a:4a:4d:cc:02
LPP1718-132 and ubuntu-1404: 00:1a:4a:4d:cc:0b
LPP1718-131 and proysiena.ull.es: 00:1a:4a:97:5f:0b
ASR_Maquina1 and LPP1718-117: 00:1a:4a:97:5f:04
LPP1718-139 and LPP1718-11: 00:1a:4a:97:5f:0f
ASR_Prof_Ansible and LPP1718-119: 00:1a:4a:97:5f:05
LPP1718-18 and LPP1718-166: 00:1a:4a:4d:cc:1c
LPP1718-14 and LPP1718-150: 00:1a:4a:4d:cc:14
LPP1718-16 and LPP1718-164: 00:1a:4a:4d:cc:1b
LPP1718-168 and LPP1718-20: 00:1a:4a:4d:cc:1d
LPP1718-169 and FG-w1: 00:1a:4a:97:5f:1e
LPP1718-161 and LPP1718-33: 00:1a:4a:97:5f:1a
LPP1718-162 and ASR_PROF-cliente1: 00:1a:4a:4d:cc:1a
LPP1718-163 and LPP1718-35: 00:1a:4a:97:5f:1b
LPP1718-165 and LPP1718-37: 00:1a:4a:97:5f:1c
LPP1718-167 and LPP1718-39: 00:1a:4a:97:5f:1d
debian-9 and LPP1718-118: 00:1a:4a:4d:cc:04
LPP1718-122 and ASR_PROF-Server_centos: 00:1a:4a:4d:cc:06
LPP1718-123 and DOCKSPARK-tpl: 00:1a:4a:97:5f:07
LPP1718-129 and STIC-...-windows8.1-desktop: 00:1a:4a:97:5f:0a
LPP1718-23 and LPP1718-151: 00:1a:4a:97:5f:15
LPP1718-25 and LPP1718-153: 00:1a:4a:97:5f:16
LPP1718-27 and LPP1718-155: 00:1a:4a:97:5f:17
LPP1718-29 and LPP1718-157: 00:1a:4a:97:5f:18
AS_PROF_CD1_ASVM11 and LPP1718-45: 00:1a:4a:97:5f:21
LPP1718-159 and LPP1718-31: 00:1a:4a:97:5f:19
ASR_PROF-cliente3 and LPP1718-113: 00:1a:4a:97:5f:02
LPP1718-115 and lubuntu-1604-desktop: 00:1a:4a:97:5f:03
LPP1718-116 and ...: 00:1a:4a:4d:cc:03

As you can see, even in the pool there are several VMs with the same MAC, which is a huge issue as they cannot use their VMs normally.

Version-Release number of selected component (if applicable):

We created this VmPool with version 4.1.5.2 but we upgraded to 4.1.6.2 recently.

How reproducible:

We have had this kind of issue previously in BZ #1462198 but in that case I opened it as an issue with 2 Clusters, but in this case you can see the problem happens inside the same Cluster.

I'm also attaching a log of the time of the pool creation (pool name is LPP1718), although I'm not sure if this will be useful at all.

Comment 1 Dan Kenigsberg 2017-09-29 19:23:55 UTC
This sounds a lot like the recently-fixed Bug 1492723, which would be available in the not-yet released ovirt-engine-4.1.7.2.

Do you have any means to test a nightly build from http://plain.resources.ovirt.org/pub/ovirt-4.1-snapshot/rpm/el7/noarch/ ?

Comment 2 nicolas 2017-09-29 19:39:59 UTC
I'm not sure if this is the same issue, since we don't ever unplug any nics. As per what I read in the other BZ, it seems to be a punctual case but in my case as you could see this happened with a lot of VMs.

I'll try to deploy the build in our pre-production environment but I can't tell if this will be soon, as we're having some quite busy days (and the problem happened in production).

Comment 3 Dan Kenigsberg 2017-10-15 18:53:29 UTC
Nicolas, have you managed to reproduce the bug with a 4.1.7 candidate? Or have any further info?

Comment 4 nicolas 2017-10-15 19:15:44 UTC
I'm sorry, I've been unable and possibly I'll be unable in the near future because replicating the very same configuration in our staging environment takes some time which we currently don't have...

However, I created a pool a few days ago and I could reproduce the same behavior (not as much duplicated MACs as the original case, but still...).

Tomorrow or on tuesday I will create a big pool again and I can check if I can reproduce the same behavior. Maybe you find interesting to see some DB values before and after creating the pool?

If so I can do a DB snapshot before and after creating the VmPool and get it back to you, or some other configuration.

Also, if you prefer, we can wait until we upgrade to 4.1.7 (when you release it) and try to reproduce it afterwards.

Comment 5 Dan Kenigsberg 2017-10-16 07:13:05 UTC
I wanted to check if 4.1.7 fixes your issue *before* it is released, as avoiding MAC duplicates in VM pools is one of the major bugs supposedly fixed by it.

If you cannot check 4.1-snapshot, I would appreciate if you can indeed produce a DB snapshot - I hope we can try to reproduce your bug on our side.

Comment 6 nicolas 2017-10-16 09:55:59 UTC
I just sent you a link to both DB snapshots to your e-mail (danken), feel free to share it with other RH staff members.

Just after I created the pool I run the script I implemented to detect duplicate MACs and this is the result:

MAC: 00:1a:4a:4d:cc:a8, VMs: ['TARO-...-42', 'windows-server2012R2']
MAC: 00:1a:4a:4d:cc:a6, VMs: ['AS_PROF_..._inst', 'TARO-...-40']
MAC: 00:1a:4a:4d:cc:9e, VMs: ['TARO-...-38', 'windows-81']
MAC: 00:1a:4a:4d:cc:55, VMs: ['AS_PROF_CD1_ASVM11', 'TARO-...-28']
MAC: 00:1a:4a:4d:cc:54, VMs: ['TARO-...-26', 'windows-10']
MAC: 00:1a:4a:4d:cc:53, VMs: ['ASR-PROF-win81', 'TARO-...-24']
MAC: 00:1a:4a:4d:cc:4e, VMs: ['ASR-PROF-win2012R2', 'TARO-...-20']
MAC: 00:1a:4a:4d:cc:4f, VMs: ['ASR-PROF-win10', 'TARO-...-22']

If you need unmangled data feel free to ask.

Also, anything else you might need don't hesitate to ask as well.

Comment 7 nicolas 2017-10-16 15:30:10 UTC
Hi Dan, I've resent the data to get the DB snapshots to your mails. Not sure if my mails are ending in your spam folder, just in case, please check it.

If you still don't get that e-mail I'll send it again from a different address.

By the way, the new pool is called TARO-ULL.

Comment 8 Dan Kenigsberg 2017-10-17 15:09:05 UTC
Oddly, I got your mail only when you replied to mine; it's not even in my spam box.

Would you be kind to share your script (as an attachment to this bug?) that finds duplicates? Others may find it useful.

Comment 9 nicolas 2017-10-18 07:35:36 UTC
Created attachment 1340010 [details]
Script to find out duplicate IPs

Comment 10 nicolas 2017-10-18 07:36:26 UTC
Created attachment 1340011 [details]
Script to find out duplicate MACs

Comment 11 nicolas 2017-10-18 07:36:56 UTC
Created attachment 1340012 [details]
Script to find out duplicate MACs and solve problems

Comment 12 nicolas 2017-10-18 07:37:53 UTC
I've added 3 scripts:

1) Finds out duplicate IPs (useful when using DHCP and you have duplicate MACs)
2) Finds out duplicate MACs (which is the one I used above)
3) Finds out duplicate MACs and tries to solve conflicts

Comment 13 Alona Kaplan 2017-10-18 10:58:13 UTC
Hi Nicolas,
Looking at your 'after' db dump, it seems that the duplicate macs belong to different mac pools. It is ok and not a bug.

There are 8 pairs of vnics with duplicate macs. In all the pairs one vnic belongs to cluster 'VDI' and the other to cluster 'Cluster-Rojo'. Each one of the clusters is using different mac pool. So there is no duplication inside the mac pools.

The only odd issue I found is that the duplicate macs that belong to 'VDI' cluster ('Adicional-DOCINT2' mac pool) are not in the range of the pool.
This situation can happen in case the range of the pool was changed or the mac address of the vnic was specified manually.

If you want to avoid those duplications, you have to find all the vnics with macs that are outside the range of the mac pool and change them to be in the range of the pool. (And of course you have to make sure the ranges of your mac pools don't overlap, which is ok in your current case).
A script that can help finding the out of range macs - https://github.com/dankenigsberg/ovirt-python-sdk-scripts/blob/master/src/main/org/ovirt/sdk/scipt/externalMacsVmsV4.py

Comment 14 nicolas 2017-10-18 11:06:57 UTC
Hi Alona,

Ok, I'll have a look at it, thanks. Anyway, check the first message where I specified a lot of duplicate MACs, there are cases like:

LPP1718-5 and LPP1718-133: 00:1a:4a:97:5f:0c
LPP1718-4 and LPP1718-134: 00:1a:4a:4d:cc:0c
LPP1718-7 and LPP1718-135: 00:1a:4a:97:5f:0d
LPP1718-6 and LPP1718-136: 00:1a:4a:4d:cc:0d
LPP1718-3 and LPP1718-127: 00:1a:4a:97:5f:09
LPP1718-2 and LPP1718-130: 00:1a:4a:4d:cc:0a
LPP1718-9 and LPP1718-137: 00:1a:4a:97:5f:0e
LPP1718-8 and LPP1718-138: 00:1a:4a:4d:cc:0e

These are from the same cluster (and same VmPool). This might be caused by the bug that Dan mentioned, however, there are no unplugged NICs by our side, so I'm still not sure if it's caused but the same bug, and unfortunately the DB dumps I included don't include a case where 2 machines from the same VmPool have the same MAC.

Comment 15 nicolas 2017-10-18 11:27:10 UTC
Created attachment 1340128 [details]
Read-only MAC pool submenu

A bit of additional info: The VM creators of the VDI Cluster affirm they never specified a MAC manually, and neither do I remember changing the range.

I also see a strange thing that I'm not sure if it's related: As per what I understand from your comment, a MAC pools is associated to a Cluster. When I edit a Cluster, in the MAC Address Pool submenu, I see the same MAC ranges (Default and Adicional-DOCINT2) for both clusters (I mean, no matter what Cluster I edit, I see the two MAC ranges). Shouldn't I only see the ones associated to the Cluster?

Also, I cannot edit anything related to MAC pools. I cannot add a range to an existing MAC pool, nor edit it, nor delete it... not sure but something seems to be wrong here. 

I'm attaching a screenshot FWIW.

Comment 16 Alona Kaplan 2017-10-19 07:08:29 UTC
Hi,
* The mac pool side tab in the cluster dialog is editable only for selecting the desired mac pool. The "allow duplicates" and "ranges" fields are read only.
Creating/deleting/editing the mac pool is done via the "Adminstration->Configure->Mac Address Pools".

Anyway, the read only mac ranges you see in the cluster dialog should represent the actual ranges of the selected pool (should be identical to the ranges you see in the "Adminstration->Configure->Mac Address Pools" dialog). If it is not, it is a (separate) bug.

* Regarding the out of range mac, it may be caused by multiple reasons. Using old snapshots, importing vms, etc. (If you think there is a bug, please open a separate one).

* Regarding the duplicate macs. We think we fixed the scenarios that causes it in 4.1.7.
We are not 100% sure your scenario is covered since we don't have relevant logs or db dump.
It can be really helpful if you whether use 4.1.7 and see if you manage to reproduce it.
Or try to reproduce it (maybe by creating multiple vm pool) on you current environment and attaching log + db dump.

Comment 17 nicolas 2017-10-19 11:42:48 UTC
Hi Alona,

I did a bit further research and things are starting to be clearer now. I think I know what happened and that's why we see strange MACs in the VDI Cluster.

We created another DC (VDI) a time after we created the first one (KVMRojo), having just one MAC Pool (Default) at that time. After keeping this configuration for a time, some students started complaining because they run out of MAC addresses in the pool when deploying machines in VDI, so we created a new MAC Pool for the VDI Cluster and then reassigned it. So there are mixed MACs from both MAC pools and you were right in your assumption and this doesn't is not a bug.

Regarding the other part (duplicate MACs in the same Cluster + pool): I've been creating VmPools all the day long (with 200, 300 VMs) and this time I cannot reproduce the issue... I'm not sure what conditions took place the first time I run the script, but definitely they are not happening this time.

If you wish, you can close the BZ as NOTABUG and after releasing the 4.1.7 version, I can upgrade our oVirt infrastructure and test again. If we find this situation happening again, we can either reopen this one BZ or open another one.

Comment 18 Dan Kenigsberg 2017-10-20 06:14:38 UTC
Based on comment 17 I am closing this bug. Please reopen if the issue reproduces.


Note You need to log in before you can comment on or make changes to this bug.