1188854 – An HA VM ended up running on 2 hosts after the engine thought that an apparently successful migration had failed.

Bug 1188854 - An HA VM ended up running on 2 hosts after the engine thought that an apparently successful migration had failed.

Summary: An HA VM ended up running on 2 hosts after the engine thought that an apparen...

Keywords:
Status:	CLOSED DUPLICATE of bug 1112359
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-02-03 20:50 UTC by Gordon Watson
Modified:	2019-06-13 08:17 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-16 09:33:16 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Gordon Watson 2015-02-03 20:50:49 UTC

Description of problem:

An HA VM ended up running on 2 hosts after the engine thought that an apparently successful migration had failed.

The engine reported the following;

2015-01-01 23:11:25,886 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-50) [6eae6da7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM vm-gfw failed. It will be restarted automatically.

2015-01-01 23:11:25,886 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-50) [6eae6da7] Highly Available VM went down. Attempting to restart. VM Name: vm-gfw, VM Id:004e9a3e-a3e2-480f-b757-1bdb72d67555

And then was restarted on another host (since the VM was HA);

2015-01-01 23:13:09,895 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) [78ee5a99] RefreshVmList vm id 004e9a3e-a3e2-480f-b757-1bdb72d67555 status = PoweringUp on vds host-C ignoring it in the refresh until migration is done


However, on the destination host, the migration was successful and the VM was up and running. And on the source host, the migration completed successfully and the VM was 'destroyed'.




Version-Release number of selected component (if applicable):

RHEV 3.3.4
RHEL 6.5 hosts with 'vdsm-4.14.7-3'


How reproducible:

Only seen once so far.


Steps to Reproduce:
1.
2.
3.

Actual results:

The VM in question should not have been seen to have "failed".


Expected results:

The migration should have been handled as a successful one.


Additional info:

Comment 3 Gordon Watson 2015-02-03 21:12:50 UTC

Created attachment 987834 [details]
vdsm log from host 'h0080d'

Comment 9 Omer Frenkel 2015-02-16 09:31:11 UTC

The issue here is that "migrating_to" field of the vm had the wrong host id, in this case it had the source host itself,
so later, when migration succeeded, the hand-over process updated the "run_on" field with the wrong id (of the source) making it think the vm is missing (because it was not running on the source host anymore), and therefor re-starting it because its HA.

this issue was solved by fixing the retry timing of maintenance in
Bug 1104030 - Failed VM migrations do not release VM resource lock properly leading to failures in subsequent migration attempts

and by clearing old migrations information in
Bug 1112359 - Failed to remove host xxxxxxxx

both bugs already merged to latest 3.4.z

Note You need to log in before you can comment on or make changes to this bug.