Created attachment 647777 [details] engine log Description of problem: High Available vm that dies during live storage migration is not rerun with CanDoAction on locked disk. 2012-11-19 16:02:05,194 WARN [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISKS_ARE_LOCKED,$diskAliases XP_iSCSI_Disk1 Version-Release number of selected component (if applicable): si24.2 How reproducible: 100% Steps to Reproduce: 1. run HA virtual server and start a live storage migration 2. after the snapshot is created kill -9 the vm's pid on the host 3. Actual results: vm fails to run with CanDoAction for locked disk Expected results: we should be able to run a vm Additional info: log 2012-11-19 16:02:04,929 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-27) vm XP running in db and not running in vds - add to rerun treatment. vds gold-vdsd 2012-11-19 16:02:05,189 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] Lock Acquired to object EngineLock [exclusiveLocks= key: 3f2cb12f-2ffd-4381-81d9-872734db7c00 value: VM , sharedLocks= ] 2012-11-19 16:02:05,194 WARN [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISKS_ARE_LOCKED,$diskAliases XP_iSCSI_Disk1 2012-11-19 16:02:05,194 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] Lock freed to object EngineLock [exclusiveLocks= key: 3f2cb12f-2ffd-4381-81d9-872734db7c00 value: VM , sharedLocks= ]
same thing for live snapshot: 2012-11-19 16:16:28,245 WARN [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-34) [6b4ac8e9] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_DURING_SNAPSHOT adding log for that as well.
Created attachment 647779 [details] log
Afaiu problem is that HA doesn't wait for action to rollback and will stop retrying.
some questions to understand the problem and possible solutions: what is the behaviour of live storage migration / live snapshot when the vm fails during the process: is there roll-back anyway? (or some cases it ignores the failure) is it immediate? or wait for the tasks to end? is it ok (==safe/possible) to start the vm immediately? or need to wait for roll-back to end?
Live Snapshot: You will end with a normal snapshot, and an audit-log message that a live snapshot could not be performed. Live Storage Migration: The action is rolled back. If the snapshot is already created, it will remain (Since there is no rollback for life snapshot).
sounds like anyway we can re-run the vm even if images are locked, can you please verify this is correct also on the sync process during live storage migration?
Omer: Double checked - you cannot safely restart a VM that had started syncing, QEMU does not persist it. You need to wait for Live Storage Migration's rollback to finish, and then re-run it.
Changing title to reflect scope of issue
Andrew, we need your feedback on comment 9 and on another question we have regarding the live snapshot scenario: Since live snapshot is a quick operation, will it be ok to rerun HA VM that went down during live snapshot operation after the operation is finished?
http://gerrit.ovirt.org/#/c/10618/
http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=f4c43030f8c16b889d39fd6e71c9cef54539fe27
verified on sf5 There is a bug that prevents the vm from running when there is more than one snapshot (https://bugzilla.redhat.com/show_bug.cgi?id=903248) I am putting a comment to test HA vm's with more than one snapshot on bug 903248 once its fixed since it will prevent HA from starting.
3.2 has been released