Description of problem: An HA VM ended up running on 2 hosts after the engine thought that an apparently successful migration had failed. The engine reported the following; 2015-01-01 23:11:25,886 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-50) [6eae6da7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM vm-gfw failed. It will be restarted automatically. 2015-01-01 23:11:25,886 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-50) [6eae6da7] Highly Available VM went down. Attempting to restart. VM Name: vm-gfw, VM Id:004e9a3e-a3e2-480f-b757-1bdb72d67555 And then was restarted on another host (since the VM was HA); 2015-01-01 23:13:09,895 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) [78ee5a99] RefreshVmList vm id 004e9a3e-a3e2-480f-b757-1bdb72d67555 status = PoweringUp on vds host-C ignoring it in the refresh until migration is done However, on the destination host, the migration was successful and the VM was up and running. And on the source host, the migration completed successfully and the VM was 'destroyed'. Version-Release number of selected component (if applicable): RHEV 3.3.4 RHEL 6.5 hosts with 'vdsm-4.14.7-3' How reproducible: Only seen once so far. Steps to Reproduce: 1. 2. 3. Actual results: The VM in question should not have been seen to have "failed". Expected results: The migration should have been handled as a successful one. Additional info:
Created attachment 987834 [details] vdsm log from host 'h0080d'
The issue here is that "migrating_to" field of the vm had the wrong host id, in this case it had the source host itself, so later, when migration succeeded, the hand-over process updated the "run_on" field with the wrong id (of the source) making it think the vm is missing (because it was not running on the source host anymore), and therefor re-starting it because its HA. this issue was solved by fixing the retry timing of maintenance in Bug 1104030 - Failed VM migrations do not release VM resource lock properly leading to failures in subsequent migration attempts and by clearing old migrations information in Bug 1112359 - Failed to remove host xxxxxxxx both bugs already merged to latest 3.4.z