1654641 – Random failures at host evacuate time (> 30 vms), not live-migration.

Bug 1654641 - Random failures at host evacuate time (> 30 vms), not live-migration.

Summary: Random failures at host evacuate time (> 30 vms), not live-migration.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	12.0 (Pike)
Assignee:	Lee Yarwood
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1649253 (view as bug list)
Depends On:	1656369
Blocks:	1264181 1570429 1582827
TreeView+	depends on / blocked

Reported:	2018-11-29 10:19 UTC by Pierre-Andre MOREY
Modified:	2023-03-21 19:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:	openstack-nova-16.1.5-5.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1656369 (view as bug list)
Environment:
Last Closed:	2019-03-20 15:29:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pierre-Andre MOREY 2018-11-29 10:19:32 UTC

Description of problem:

Our customer is trying to test HA, and when they set more that 25-30 vms on two hosts and shoot one host to force evacuate of vms to the other host, they start to have failures above this number of vms. Host are beefy enough to handle perfectly 100 vms each of this flavor (and those vms are idle as it is for testing purposes).

last try was with 100 vms, 47 of them failed to migrate.

However, nova reports vms in error state, but they are pingable, they exist on the target host (liubvirt/virsh level), but nova report them as in error state on source host, not target one.

Trying to reset state did nothing, as well as tagging them as 'evacuable'. The only solution left is database manual modification that we don't want to get a process in the production environment.

We have the logs from the 100 vms move.

Additionnaly, the vms are fresh new for each test, so this isn't the case of source->target->source issue that we are already tracking there for RHOSP13 (1567606).

Version-Release number of selected component (if applicable):

How reproducible:
Create 50-60 vms on a host, kill it and let the vms be evacuate to another node.

Steps to Reproduce:
1. Create 50-60 vms on a two compute nodes deployment (customer will test with 4)
2. Shoot one node (power, network, whatever)
3. Vms will be evacuated

Actual results:
Random vms get in error state, technically migrated, but nova consider them as still on dead host, libvirt has started them, and they are live (i.e. virsh console, ping, etc....)

Expected results:
All vms database entry are in sync with current vms position.

Case files are yanked on supportshell.

Additional info:
Customer will go-live next week...

Comment 2 Lee Yarwood 2018-11-30 16:10:21 UTC

(In reply to Pierre-Andre MOREY from comment #0)
> Steps to Reproduce:
> 1. Create 50-60 vms on a two compute nodes deployment (customer will test
> with 4)
> 2. Shoot one node (power, network, whatever)
> 3. Vms will be evacuated

This smells like https://bugzilla.redhat.com/show_bug.cgi?id=1567606 *if* the customer is allowing the original compute to power back on before all of the instances have been rebuilt on the destination compute.

Pierre can you confirm if that's the case?

Comment 9 Lee Yarwood 2018-12-13 15:37:59 UTC

*** Bug 1649253 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.