1245630 – [RFE] VM.delete() should wait for snapshot deletion

Bug 1245630 - [RFE] VM.delete() should wait for snapshot deletion

Summary: [RFE] VM.delete() should wait for snapshot deletion

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	RFEs
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	bugs@ovirt.org
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-22 12:27 UTC by nicolas
Modified:	2019-04-28 09:53 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-07-06 13:04:45 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-future? ylavi: planning_ack? ylavi: devel_ack? ylavi: testing_ack?

Attachments	(Terms of Use)
ovirt-engine.log when .delete() method is issued (17.76 KB, text/plain) 2015-07-22 12:27 UTC, nicolas	no flags	Details
View All

Description nicolas 2015-07-22 12:27:40 UTC

Created attachment 1054822 [details]
ovirt-engine.log when .delete() method is issued

Description of problem:

We're currently running the following code inside a Django app:

vms2del = []
for vm in kvm.vms.list():
    pool = vm.get_vmpool()
    if pool:
       vm.shutdown()
       vm.shutdown()      # 2 times to force an 'ungraceful' shutdown
       vm.detach()
       vms2del.append(vm)

while vms2del:
    vm = vms2del.pop()
    vm.delete(async=True)

Version-Release number of selected component (if applicable):
3.5.2.1

Steps to reproduce:
1. A pool is created with (assuming kvm is an API object):
   pool = params.VmPool(name=resource, cluster=kvm.clusters.get(name=settings.OVIRTCLUSTERNAME), template=kvm.templates.get(name=ovirt_template), size=num_vms, prestarted_vms=num_vms, max_user_vms=ovirt_max_vm)
   kvm.add(pool)
2. The pool is edited via the webadmin and saved without changes to enforce the prestarting process for VMs.
3. The pool is deleted via the above snippet.
4. Machines are detached but not deleted from oVirt.

How reproducible:
Always

Actual results:
When the .delete() method is invoked, no deletion happens and "Failed to log User null@N/A out." message is shown in the event board.

Expected results:
The VMs should be deleted.

Additional info:
Log attached when the snippet above is run (.delete() method called with async=True, but calling it without the param makes no difference).

Relevant seems to be this line: 2015-07-22 13:04:17,661 WARN  [org.ovirt.engine.core.bll.RemoveVmCommand] (org.ovirt.thread.pool-8-thread-4) [69ed9ff2] CanDoAction of action RemoveVm failed for user admin@internal. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_DURING_SNAPSHOT

However, no snapshot has ever been made on that machine.

Comment 1 Juan Hernández 2015-07-22 13:13:49 UTC

What version of the server are you using? rpm -q ovirt-engine

Comment 2 nicolas 2015-07-22 14:41:37 UTC

Sorry, forgot to mention. The ovirt-engine version is 3.5.3.1-1

Comment 3 Juan Hernández 2015-07-22 14:46:14 UTC

When you use an "automatic" pool each VM is internally created as a snapshot of the template. The, when you shutdown the VM the system has to revert it to its original state, thus it has to delete the snapshot. This snapshot deletion happens asynchronously, regardless of how you call the RESTAPI. As you are removing many VMs there may be many of these operations running, and they can't take a long time, thus, when you try to delete the first VM it may still be recovering its snapshot. I'd suggest that you explicitly wait after the VM is actually "down" before trying to delete it, something like this:

  vm_id = vm.get_id()
  while True:
    vm = api.vms.get(id=vm_id)
    if vm is not None and vm.get_status().get_state() == "down":
       vm.delete()
    time.sleep(1)

Not sure if this is going to solve your problem, as I can't reproduce it, but it is worth trying. Please try it and report the results.

Comment 4 Juan Hernández 2015-07-22 14:46:54 UTC

For got the "break" after "vm.delete()": 

  vm_id = vm.get_id()
  while True:
    vm = api.vms.get(id=vm_id)
    if vm is not None and vm.get_status().get_state() == "down":
       vm.delete()
       break
    time.sleep(1)

Comment 5 nicolas 2015-07-23 10:43:25 UTC

I confirm that checking and waiting for the status of the VM to be down it works.

However,  you mention that the snapshot deletion happens asynchronously, I think it would be a great improvement to make the .delete() method check the state of the VM and run the deletion once it's possible when called with parameter async=True, so the user doesn't need to check the VM state.

Could this be implemented?

Comment 6 Juan Hernández 2015-07-23 11:57:36 UTC

I agree, removing the VM should probably take this into account and wait till these internally created snapshots are removed, regardless of the value of the "async" parameter.

Allon, not sure if this is storage or virt, your take?

Comment 7 Allon Mureinik 2015-07-23 13:50:22 UTC

(In reply to Juan Hernández from comment #6)
> I agree, removing the VM should probably take this into account and wait
> till these internally created snapshots are removed
This differs from how the system generally behaves. 
Today, any action that cannot succeed since another action is in flight just fails - there's no "action queue" or something like that.
I'm not against it per-se, but if we go that route we should consider it in a wider context, not just removing VMs.

> Allon, not sure if this is storage or virt, your take?
Sounds more like virt than storage off the top of my head.

Comment 8 Michal Skrivanek 2015-12-10 11:33:43 UTC

i'd like to avoid queuing, but if the snapshot operation is interruptible we can stop it on delete() request and immediately remove the VM.
Allon, is that currently possible?

Comment 9 Allon Mureinik 2015-12-10 11:40:23 UTC

Not in the current infrastructure.

Comment 10 Martin Tessun 2018-07-06 13:04:45 UTC

As there is an easy workaround here (see c#3 and c#4) and some concerns about queuing, closing this one.

If you feel this is really needed, please reopen with a justification.

Note You need to log in before you can comment on or make changes to this bug.