Bug 878041
Summary: | engine: rerun of HA vm fails when vm's pid is killed during live snapshot | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||||
Component: | ovirt-engine | Assignee: | Arik <ahadas> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Dafna Ron <dron> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.1.2 | CC: | abaron, acathrow, amureini, dyasny, hateya, iheim, italkohe, lpeer, michal.skrivanek, ofrenkel, Rhev-m-bugs, sgrinber, yeylon, ykaul | ||||||
Target Milestone: | --- | Keywords: | ZStream | ||||||
Target Release: | 3.2.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | virt | ||||||||
Fixed In Version: | sf4 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 891634 892638 (view as bug list) | Environment: | |||||||
Last Closed: | Type: | Bug | |||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 891634, 892638, 915537 | ||||||||
Attachments: |
|
same thing for live snapshot: 2012-11-19 16:16:28,245 WARN [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-34) [6b4ac8e9] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_DURING_SNAPSHOT adding log for that as well. Created attachment 647779 [details]
log
Afaiu problem is that HA doesn't wait for action to rollback and will stop retrying. some questions to understand the problem and possible solutions: what is the behaviour of live storage migration / live snapshot when the vm fails during the process: is there roll-back anyway? (or some cases it ignores the failure) is it immediate? or wait for the tasks to end? is it ok (==safe/possible) to start the vm immediately? or need to wait for roll-back to end? Live Snapshot: You will end with a normal snapshot, and an audit-log message that a live snapshot could not be performed. Live Storage Migration: The action is rolled back. If the snapshot is already created, it will remain (Since there is no rollback for life snapshot). sounds like anyway we can re-run the vm even if images are locked, can you please verify this is correct also on the sync process during live storage migration? Omer: Double checked - you cannot safely restart a VM that had started syncing, QEMU does not persist it. You need to wait for Live Storage Migration's rollback to finish, and then re-run it. Changing title to reflect scope of issue Andrew, we need your feedback on comment 9 and on another question we have regarding the live snapshot scenario: Since live snapshot is a quick operation, will it be ok to rerun HA VM that went down during live snapshot operation after the operation is finished? verified on sf5 There is a bug that prevents the vm from running when there is more than one snapshot (https://bugzilla.redhat.com/show_bug.cgi?id=903248) I am putting a comment to test HA vm's with more than one snapshot on bug 903248 once its fixed since it will prevent HA from starting. 3.2 has been released 3.2 has been released 3.2 has been released |
Created attachment 647777 [details] engine log Description of problem: High Available vm that dies during live storage migration is not rerun with CanDoAction on locked disk. 2012-11-19 16:02:05,194 WARN [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISKS_ARE_LOCKED,$diskAliases XP_iSCSI_Disk1 Version-Release number of selected component (if applicable): si24.2 How reproducible: 100% Steps to Reproduce: 1. run HA virtual server and start a live storage migration 2. after the snapshot is created kill -9 the vm's pid on the host 3. Actual results: vm fails to run with CanDoAction for locked disk Expected results: we should be able to run a vm Additional info: log 2012-11-19 16:02:04,929 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-27) vm XP running in db and not running in vds - add to rerun treatment. vds gold-vdsd 2012-11-19 16:02:05,189 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] Lock Acquired to object EngineLock [exclusiveLocks= key: 3f2cb12f-2ffd-4381-81d9-872734db7c00 value: VM , sharedLocks= ] 2012-11-19 16:02:05,194 WARN [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISKS_ARE_LOCKED,$diskAliases XP_iSCSI_Disk1 2012-11-19 16:02:05,194 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] Lock freed to object EngineLock [exclusiveLocks= key: 3f2cb12f-2ffd-4381-81d9-872734db7c00 value: VM , sharedLocks= ]