1371699 – dhcp_release isn't ran on a originating compute when live migrating a VM from computeA to computeB

Bug 1371699 - dhcp_release isn't ran on a originating compute when live migrating a VM from computeA to computeB

Summary: dhcp_release isn't ran on a originating compute when live migrating a VM from...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.0 (RHEL 7)
Assignee:	Eoghan Glynn
QA Contact:	Prasanth Anbalagan
Docs Contact:
URL:
Whiteboard:
Depends On:	1335277
Blocks:	1371636 1371664 1371673
TreeView+	depends on / blocked

Reported:	2016-08-30 21:07 UTC by Artom Lifshitz
Modified:	2019-12-16 06:34 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1335277
Environment:
Last Closed:	2017-02-03 16:40:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1585601	0	None	None	None	2016-08-30 21:07:40 UTC
OpenStack gerrit	325361	0	None	None	None	2016-08-30 21:07:40 UTC

Description Artom Lifshitz 2016-08-30 21:07:40 UTC

+++ This bug was initially created as a clone of Bug #1335277 +++

Description of problem:
dhcp_release isn't ran on the originating compute when live migrating a VM from computeA to computeB and if dhcp_lease_time is set to a high value, the lease will never expire and the original compute will retain the DHCP lease in dnsmasq which will fail to re-allocate the same IP on the originating compute once the original VM is destroyed

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1) launch a VM with a known IP on node 1
2) live-migrate the VM away to node 2
3) delete the VM while it's still on node 2
4) launch a new VM on node 1 using the same IP

Node 1 refuses to give the IP to the new VM, thinking that the IP is owned by the old VM.

It's a live-migration bug. After a VM is live-migrated to node 2, the lease cache on node 1 is not cleared.

This is usually not an issue with the default 2-min dhcp lease time. But this environment, dhcp_lease_time is set to 604800s (or 7 days).

Actual results:
IP should be free

Expected results:
IP is still in the lease cache of dnsmasq

Additional info:

--- Additional comment from Artom Lifshitz on 2016-05-24 19:10:23 EDT ---

Hello,

Fist of all, just to make sure, can we explicitly confirm that nova-network is in use here and not Neutron?

I've reproduced what I think is the same behaviour in Nova upstream master. 

1. Boot an instance
2. Live-migrate it
2. Delete it
3. Boot another instance with the same IP

This fails with "Fixed IP address is already in use on instance"

As a control, I tried:

1. Boot an instance
2. Delete it
3. Boot another instance with the same IP

This succeeds.

However, I'm not sure it has anything to do with the DHCP lease not being released. Rather, it seems live-migrating an instance somehow causes its fixed IPs to remain associate with the deleted instance in the database even if the instance itself has been deleted. To confirm this, would it be possible to attach sosreports to this BZ?

If we confirm I've indeed observed the same behaviour in Nova master as you're seeing in RHOS 5 I'll need to submit an upstream bugfix and then do a downstream-only backport to RHOS 5, as Icehouse is no longer supported upstream.

Cheers!

--- Additional comment from David Hill on 2016-05-24 19:13:21 EDT ---

Hello sir,

   I can confirm it is openstack-nova-network that is being used and that killing dnsmasq and restarting nova-network solves this issue.

Thank you very much,

David Hill

--- Additional comment from David Hill on 2016-05-24 19:13:58 EDT ---

Logs are in collab-shell.

--- Additional comment from David Hill on 2016-06-01 10:24:45 EDT ---

Is there a fix for this issue as of yet for this or do we have to create one?

Thank you sir,

Dave

--- Additional comment from Artom Lifshitz on 2016-06-01 10:27:58 EDT ---

There's no fix - the issue is still present in Nova master. I'm working on a fix for master. Once that's submitted for review I'll backport it to RHOS 6.

Cheers!

--- Additional comment from Bryan Yount on 2016-06-07 15:30:02 EDT ---

(In reply to Artom Lifshitz from comment #5)
> There's no fix - the issue is still present in Nova master. I'm working on a
> fix for master. Once that's submitted for review I'll backport it to RHOS 6.

Did you mean RHOS 5? The customer is running RHOS 5 on RHEL 7.

--- Additional comment from Artom Lifshitz on 2016-06-07 15:40:09 EDT ---

Yep, RHOS5 - sorry for the typo.

--- Additional comment from Andreas Karis on 2016-06-20 11:58:34 EDT ---

Hello,

Customer asked for an ETA for this BZ. Can we give them any new info?

Regards,

Andreas

--- Additional comment from Artom Lifshitz on 2016-06-20 15:34:29 EDT ---

Hello,

We still haven't settled on the best approach to fix this in upstream Nova master. For that reason, I can't put forward any ETA. I know I said as soon as a fix is submitted upstream I'll do the backport, but right now the fix that's submitted upstream has a very slim chance of being accepted as is, so it's not backportable.

Cheers!

--- Additional comment from Bryan Yount on 2016-07-22 01:33:13 EDT ---

It's been a month since the last status update. Is there any possibility of a backport?

--- Additional comment from Artom Lifshitz on 2016-07-22 10:59:36 EDT ---

Hi Bryan,

Apologies for the lack of status updates. I've recently submitted a modified patch for upstream review. Next week, once folks are back from the midcycle summit currently taking place, I'll be able to ask for reviews and have a better answer for you here.

Cheers!

--- Additional comment from Artom Lifshitz on 2016-07-28 14:12:24 EDT ---

Good news, the upstream fix has merged, and I've submitted it for review downstream. I've reproduced the bug locally on a two-node devstack and confirmed the fix does indeed resolve the issue, but do you want to get a scratch build out to the customer, to validate that it does indeed fix the bug in their environment?

Cheers!

Comment 1 Artom Lifshitz 2017-02-03 16:40:40 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1335277#c13

Note You need to log in before you can comment on or make changes to this bug.