Bug 1188854

Summary: An HA VM ended up running on 2 hosts after the engine thought that an apparently successful migration had failed.
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.3.0CC: ecohen, gklein, iheim, lpeer, lsurette, ofrenkel, pstehlik, rbalakri, rgolan, Rhev-m-bugs, yeylon
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-16 09:33:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gordon Watson 2015-02-03 20:50:49 UTC
Description of problem:

An HA VM ended up running on 2 hosts after the engine thought that an apparently successful migration had failed.

The engine reported the following;

2015-01-01 23:11:25,886 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-50) [6eae6da7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM vm-gfw failed. It will be restarted automatically.

2015-01-01 23:11:25,886 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-50) [6eae6da7] Highly Available VM went down. Attempting to restart. VM Name: vm-gfw, VM Id:004e9a3e-a3e2-480f-b757-1bdb72d67555

And then was restarted on another host (since the VM was HA);

2015-01-01 23:13:09,895 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) [78ee5a99] RefreshVmList vm id 004e9a3e-a3e2-480f-b757-1bdb72d67555 status = PoweringUp on vds host-C ignoring it in the refresh until migration is done


However, on the destination host, the migration was successful and the VM was up and running. And on the source host, the migration completed successfully and the VM was 'destroyed'.




Version-Release number of selected component (if applicable):

RHEV 3.3.4
RHEL 6.5 hosts with 'vdsm-4.14.7-3'


How reproducible:

Only seen once so far.


Steps to Reproduce:
1.
2.
3.

Actual results:

The VM in question should not have been seen to have "failed".


Expected results:

The migration should have been handled as a successful one.


Additional info:

Comment 3 Gordon Watson 2015-02-03 21:12:50 UTC
Created attachment 987834 [details]
vdsm log from host 'h0080d'

Comment 9 Omer Frenkel 2015-02-16 09:31:11 UTC
The issue here is that "migrating_to" field of the vm had the wrong host id, in this case it had the source host itself,
so later, when migration succeeded, the hand-over process updated the "run_on" field with the wrong id (of the source) making it think the vm is missing (because it was not running on the source host anymore), and therefor re-starting it because its HA.

this issue was solved by fixing the retry timing of maintenance in
Bug 1104030 - Failed VM migrations do not release VM resource lock properly leading to failures in subsequent migration attempts

and by clearing old migrations information in
Bug 1112359 - Failed to remove host xxxxxxxx

both bugs already merged to latest 3.4.z