878041 – engine: rerun of HA vm fails when vm's pid is killed during live snapshot

Bug 878041 - engine: rerun of HA vm fails when vm's pid is killed during live snapshot

Summary: engine: rerun of HA vm fails when vm's pid is killed during live snapshot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.1.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Arik
QA Contact:	Dafna Ron
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	891634 892638 915537
TreeView+	depends on / blocked

Reported:	2012-11-19 14:18 UTC by Dafna Ron
Modified:	2013-06-11 09:58 UTC (History)
CC List:	14 users (show)
Fixed In Version:	sf4
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	891634 892638 (view as bug list)
Environment:
Last Closed:
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine log (127.77 KB, application/x-xz) 2012-11-19 14:18 UTC, Dafna Ron	no flags	Details
log (129.08 KB, application/x-xz) 2012-11-19 14:21 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	10618	0	None	None	None	Never

Description Dafna Ron 2012-11-19 14:18:29 UTC

Created attachment 647777 [details]
engine log

Description of problem:

High Available vm that dies during live storage migration is not rerun with CanDoAction on locked disk. 

2012-11-19 16:02:05,194 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISKS_ARE_LOCKED,$diskAliases XP_iSCSI_Disk1


Version-Release number of selected component (if applicable):

si24.2

How reproducible:

100%

Steps to Reproduce:
1. run HA virtual server and start a live storage migration
2. after the snapshot is created kill -9 the vm's pid on the host
3.
  
Actual results:

vm fails to run with CanDoAction for locked disk

Expected results:

we should be able to run a vm

Additional info: log

2012-11-19 16:02:04,929 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-27) vm XP running in db and not running in vds - add to rerun treatment. vds gold-vdsd
2012-11-19 16:02:05,189 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] Lock Acquired to object EngineLock [exclusiveLocks= key: 3f2cb12f-2ffd-4381-81d9-872734db7c00 value: VM
, sharedLocks= ]
2012-11-19 16:02:05,194 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_DISKS_ARE_LOCKED,$diskAliases XP_iSCSI_Disk1
2012-11-19 16:02:05,194 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-27) [6a568762] Lock freed to object EngineLock [exclusiveLocks= key: 3f2cb12f-2ffd-4381-81d9-872734db7c00 value: VM
, sharedLocks= ]

Comment 1 Dafna Ron 2012-11-19 14:21:34 UTC

same thing for live snapshot: 

2012-11-19 16:16:28,245 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-34) [6b4ac8e9] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_DURING_SNAPSHOT

adding log for that as well.

Comment 2 Dafna Ron 2012-11-19 14:21:57 UTC

Created attachment 647779 [details]
log

Comment 3 Ayal Baron 2012-11-21 10:42:01 UTC

Afaiu problem is that HA doesn't wait for action to rollback and will stop retrying.

Comment 4 Omer Frenkel 2012-12-03 12:46:47 UTC

some questions to understand the problem and possible solutions:
what is the behaviour of live storage migration / live snapshot when the vm fails during the process: 
is there roll-back anyway? (or some cases it ignores the failure)
is it immediate? or wait for the tasks to end?
is it ok (==safe/possible) to start the vm immediately? or need to wait for roll-back to end?

Comment 5 Allon Mureinik 2012-12-06 16:06:58 UTC

Live Snapshot:
You will end with a normal snapshot, and an audit-log message that a live snapshot could not be performed.

Live Storage Migration:
The action is rolled back. If the snapshot is already created, it will remain (Since there is no rollback for life snapshot).

Comment 6 Omer Frenkel 2012-12-12 09:08:07 UTC

sounds like anyway we can re-run the vm even if images are locked,
can you please verify this is correct also on the sync process during live storage migration?

Comment 7 Allon Mureinik 2012-12-13 11:08:48 UTC

Omer:
Double checked - you cannot safely restart a VM that had started syncing, QEMU does not persist it.
You need to wait for Live Storage Migration's rollback to finish, and then re-run it.

Comment 8 Andrew Cathrow 2012-12-17 13:11:46 UTC

Changing title to reflect scope of issue

Comment 11 Arik 2012-12-25 20:22:28 UTC

Andrew, we need your feedback on comment 9 and on another question we have regarding the live snapshot scenario:

Since live snapshot is a quick operation, will it be ok to rerun HA VM that went down during live snapshot operation after the operation is finished?

Comment 13 Arik 2013-01-06 10:23:05 UTC

http://gerrit.ovirt.org/#/c/10618/

Comment 15 Arik 2013-01-10 10:18:16 UTC

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=f4c43030f8c16b889d39fd6e71c9cef54539fe27

Comment 16 Dafna Ron 2013-02-04 10:27:05 UTC

verified on sf5

There is a bug that prevents the vm from running when there is more than one snapshot (https://bugzilla.redhat.com/show_bug.cgi?id=903248) 
I am putting a comment to test HA vm's with more than one snapshot on bug 903248 once its fixed since it will prevent HA from starting.

Comment 17 Itamar Heim 2013-06-11 09:50:54 UTC

3.2 has been released

Comment 18 Itamar Heim 2013-06-11 09:51:04 UTC

3.2 has been released

Comment 19 Itamar Heim 2013-06-11 09:58:28 UTC

3.2 has been released

Note You need to log in before you can comment on or make changes to this bug.