Red Hat Bugzilla – Bug 1474399
engine auto-selects MAC addresses that are already in use
Last modified: 2017-08-23 03:36:47 EDT
Description of problem:
MACs that are auto-selected for newly added NICs are sometimes already in use even when there are other free MACs in the pool. This causes unnecessary error messages and failures
Version-Release number of selected component (if applicable):
We see this quite often in an environment that has two MAC ranges defined in a single global MAC pool. The first range is almost full while the second is free. About once in 10 added NICs a duplicate MAC is selected from the used range causing an error
Steps to Reproduce:
1. create VM
2. go to Network Interfaces
3. click New, press OK
once every couple tries this fails with error:
Error while executing action:
MAC Address 00:1a:4a:16:13:68 is already in use.
Duplicate MACs are not selected by Engine for new interfaces
As noted above, we have two MAC ranges defined in the Default MAC pool and have stateless desktops as well as VM pools in place.
Even despite these errors on manual creation, we see multiple duplicate MACs in the database. Not sure how that happens, but as soon as the Engine detects this upon VM start it will detach the NIC in question and we have to look through the DB to find out what other VM got this MAC assigned to fix this.
so you have just one MAC pool, starting with 2 ranges, one almost full, one (almost) empty. I tried this reproduction steps:
0. fresh installation.
1. updated mac pool to have 1 range: 00:1a:4a:16:01:00-00:1a:4a:16:01:06
2. created new vm (1) with 5 nics. …00-…04 were selected (as expected)
3. added new range to default pool, 00:1a:4a:16:02:00-00:1a:4a:16:02:0a
4. new vm (2) with 3 macs. …02:00-…02:02 were selected (as expected)
5. added new nic to vm (2), 00:1a:4a:16:02:03 was selected (not ok, relates to unmerged fix in patch 78857, but does not cause any problem)
6. added new nic to vm (1), …:02:04 was selected (not ok … same)
7. added 5 new nics, one by one to vm (1), …:02:05-…02.0a was selected (not ok … same)
8. added 2 new nics, one by one to vm (1), …:01:05-…01.06 was selected (not ok … same)
mac pool should be full now
9. cannot add new nics, mac pool is full.
10. released all four nics from vm (2) — the 'more empty range at start'
11. added all released nics to vm (1) one by one. …02:00-…02:03 was selected (as expected).
mac pool should be full now
12. cannot add new nics, mac pool is full.
==> cannot reproduce, not a single issue.
But lets try to figure out what could be wrong in your setup:
Based on listing:
[org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand] (default task-43) [796f0daf] Validation of action 'ActivateDeactivateVmNic' failed for user email@example.com. Reasons:
I'd guess, that given mac address is free in mac pool, yet there's plugged interface with that address. Check ending in error is driven by following sql:
WHERE mac_addr = v_mac_address
AND is_plugged = true;
meaning, that if you have anywhere in system interface with this mac address, you won't be able to activate it, even if it's in current mac pool free. That can happen, if this mac address is served by another mac pool; then in one mac pool this mac is free, but because of this sql it cannot be activated, and will complain with your error message. Coincidentally, we're currently discussing this behavior and it probably will be replaced. If this is your cause, we probably can backport.
If you really have just one mac pool, then there can be some inconsistency between in-memory mac pool and db. With each version it should be less and less possible to reach this state, but if you're miraculously in it, engine restart should help. Did you restart service lately? If you did, I'm probably out of guesses. Race should not be a problem, and if mac pool claims mac is free it should be free. I'd bet, that you have more macpools and problematic mac is used in other pool
About duplicate macs in DB — this is (wrong) behavior from past, and I'm really trying to block this from happening. Patches should be ready for merging. But this can't cause your problems.
First of all - thanks for the great explanation of what may be triggering this error. We do have just a single MAC pool, and the engine service has been running for at least a few weeks. We do have the DB contaminated by MAC duplicates from earlier Engine versions where VDI pools were causing this yet you noted this shouldn't be causing this issue.
How is a free MAC selected? If I know that I can run both SELECT's in a loop across our MAC pool to check if there are any more such collisions.
1) you should try to get rid of duplicate macs; that way you can solve yourself potential issues. However this shouldn't reflect in engine selecting some used mac. Engine selects (it should) only unused macs, so if something is used twice shouldn't make difference. New duplicates might still occur though, effort to block this is not finished yet.
2) about how free mac is selected. First how mac pool is initialized. Given ranges are taken, potential overlaps and multicast macs are removed, this gives macpool macs to work with. Then all vds_interface records belonging to that mac pool are read from db, and marked as used. That concludes initialization. Now if there's request to get free mac, mac pool loops over all ranges trying to find first unused. Once it find one, it marks it, so next search starts from this one +1 (actually currently there's pending patch, so now it's erroneously +2); this is done for 'fairness' of using macs from all ranges equally. So now we have range selected, which should have some empty macs. Free mac is searched 'from left to right', and last used position is stored again. That's done to assure, that recently released mac isn't used again too soon, which might cause some confusion for services on network.
One more idea, what might cause your problem even if you have just one pool. It just rephrased the old one, but still ;) Based on what was said before - if you have some old data in db, so that some vds_interface record isn't related to single mac pool you have, it's mac won't be registered in mac pool, will be potentially offered as free, but based on (to be removed) select shared in comment #2 it will be refused in nic activation process. Can you check, whether you have such cluster-orphaned vm_interfaces in db?
Thank you for the description. Does this mean that the MAC pool init happens just once, after Engine is started? If yes - at which point are MACs removed from it when a vNIC is created and how are they added back to the pool once it is removed?
If the pool is initialized every time a request for a MAC comes in, where is the MAC pointer stored then? What would happen if it points to a MAC but a user manually assigned MAC+1 to a vNIC?
Sorry for all these questions, but I'm just trying to follow the process to understand which conditions cause this duplicate MAC assignment. Right before filing this BZ I got a duplicate MACs assigned to vNICs like twice in a minute but now the pointer seems to have moved to the empty second MAC range and I'm no longer getting duplicates. Soon it should get back to the start and we may start getting dupes again.
As for existing duplicates, that indeed probably appeared in older versions of the Engine, there's a bunch of them - out of ~200 vNICs we have 13 MACs appear in vm_interfaces more than once (with one appearing 4 times total). Peculiarly, MAC 00:1a:4a:16:13:68 listed in this BZ is listed twice as well, on VMs that were created before my experiment.
(In reply to Evgheni Dereveanchin from comment #5)
> Thank you for the description. Does this mean that the MAC pool init happens
> just once, after Engine is started?
± yes. It get also reinitialized every time you change MAC pool settings, like adding/changing range.
If yes - at which point are MACs removed
> from it when a vNIC is created and how are they added back to the pool once
> it is removed?
it happens withing same transaction. Ie if you see nic added to db or removed from it, MAC pool should be in sync. This wasn't part of initial messy implementation, but there was quite some effort to grant this behaves correctly. So I'd like to say, that you can count on this behavior, however there can be (very, very improbable) violations of this behavior. Still I believe problems is elsewhere.
> If the pool is initialized every time a request for a MAC comes in,
> the MAC pointer stored then? What would happen if it points to a MAC but a
> user manually assigned MAC+1 to a vNIC?
nothing would happen even if it works like you described (which is not true). Every mac has 0/1 flag used, unused. If it's used multiple times, it has also associated counter.
> Sorry for all these questions, but I'm just trying to follow the process to
> understand which conditions cause this duplicate MAC assignment. Right
> before filing this BZ I got a duplicate MACs assigned to vNICs like twice in
> a minute but now the pointer seems to have moved to the empty second MAC
> range and I'm no longer getting duplicates. Soon it should get back to the
> start and we may start getting dupes again.
please monitor which action generates duplicate. I'd bet it has to do something with snapshots, stateless vms. If I'm right it is impossible to get duplicate mac just by adding new vmNic. Flows with snapshots, stateless vms relates to method forceAddMac, which by (former) design willingly destroyed mac pool consistency. I'm trying to remove this method, it's in the review process.
> As for existing duplicates, that indeed probably appeared in older versions
> of the Engine, there's a bunch of them - out of ~200 vNICs we have 13 MACs
> appear in vm_interfaces more than once (with one appearing 4 times total).
> Peculiarly, MAC 00:1a:4a:16:13:68 listed in this BZ is listed twice as well,
> on VMs that were created before my experiment.
I'd suggest following steps:
• if you having problems with duplicate macs, allegedly from former versions, it's time to get rid of them. It's 13 of them, it's not that big deal.
• monitor new duplicates appearance and what created them. If you figure out which exact action did that, it would help us greatly. If you find snapshots/stateless vms are reason, you have to wait for my fix and upgrade. If even single vmnic creates problem, then we have problem we don't know about and we have to fix it (really improbable; I tested this behavior even on really messy env and it worked as expected)