1540107 – The engine fails to delete a existing external VM (via REST APIs) with 'Operation Failed: [Desktop does not exist]'

Bug 1540107 - The engine fails to delete a existing external VM (via REST APIs) with 'Operation Failed: [Desktop does not exist]'

Summary: The engine fails to delete a existing external VM (via REST APIs) with 'Opera...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.2.1.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1455169 1458709 1526753
TreeView+	depends on / blocked

Reported:	2018-01-30 09:56 UTC by Simone Tiraboschi
Modified:	2018-03-29 11:13 UTC (History)
CC List:	4 users (show)
Fixed In Version:	ovirt-engine-4.2.2
Clone Of:
Environment:
Last Closed:	2018-03-29 11:13:09 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ rule-engine: exception+ sbonazzo: devel_ack+

Attachments	(Terms of Use)
alma04 logs (10.96 MB, application/x-xz) 2018-02-19 15:43 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	86954	master	MERGED	ansible: localvm: enforce status before deleting	2020-11-16 10:14:42 UTC
oVirt gerrit	87235	ovirt-hosted-engine-setup-2.2	MERGED	ansible: localvm: enforce status before deleting	2020-11-16 10:15:03 UTC
oVirt gerrit	87949	master	MERGED	ansible: try deleting the bootstrap VM only if there	2020-11-16 10:14:42 UTC
oVirt gerrit	87962	ovirt-hosted-engine-setup-2.2	MERGED	ansible: try deleting the bootstrap VM only if there	2020-11-16 10:14:42 UTC

Description Simone Tiraboschi 2018-01-30 09:56:56 UTC

Description of problem:
In node-zero hosted-engine setup we bootstrap the system from a local VM and the engine imports it as an external VM.
At the end we move the disk of that VM over the disk of a VM created by the engine (and we restart it from there) and so we have two VMs in the DB and we try to delete the initial bootstrap local VM bu its uuid.

In the logs we see:

2018-01-29 21:30:26,399-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] VM '96ebce44-c24e-40e9-9cd9-91048b3091b2'(external-HostedEngineLocal) moved from 'Up' --> 'Down'
2018-01-29 21:30:26,561-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] EVENT_ID: VM_DOWN(61), VM external-HostedEngineLocal is down. Exit message: User shut down from within the guest
2

And so external-HostedEngineLocal exists for sure with uuid '96ebce44-c24e-40e9-9cd9-91048b3091b2'

Then we try to delete if from the rest API:
2018-01-29 21:30:26,690-05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 43974d05


but destroy VM commands fails:
2018-01-29 21:30:26,690-05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 43974d05
2018-01-29 21:30:26,695-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVDSCommand(HostName = lago-he-basic-ansible-suite-master-host0, DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 5134ff4f
2018-01-29 21:30:26,717-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Destroy VM couldn't find VM '96ebce44-c24e-40e9-9cd9-91048b3091b2'. If after Cancel Migration and VM is UP on source, ignore next error, it's just a clean-up call
2018-01-29 21:30:26,717-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Failed in 'DestroyVDS' method
2018-01-29 21:30:26,769-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM lago-he-basic-ansible-suite-master-host0 command DestroyVDS failed: Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}
2018-01-29 21:30:26,769-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=1, message=Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}]]'
2018-01-29 21:30:26,770-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] HostName = lago-he-basic-ansible-suite-master-host0
2018-01-29 21:30:26,770-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'DestroyVDSCommand(HostName = lago-he-basic-ansible-suite-master-host0, DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1
2018-01-29 21:30:26,770-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] FINISH, DestroyVDSCommand, log id: 5134ff4f
2018-01-29 21:30:26,771-05 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Failed to destroy VM '96ebce44-c24e-40e9-9cd9-91048b3091b2' in VDS = '3c8f2061-6fb8-4707-8a88-c05b12a92f45' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1'
2018-01-29 21:30:26,771-05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] FINISH, DestroyVmVDSCommand, log id: 43974d05
2018-01-29 21:30:26,772-05 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1 (Failed with error noVM and code 1)
2018-01-29 21:30:27,207-05 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] VMs initialization finished for Host: 'lago-he-basic-ansible-suite-master-host0:3c8f2061-6fb8-4707-8a88-c05b12a92f45'
2018-01-29 21:30:27,291-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM external-HostedEngineLocal (Host: lago-he-basic-ansible-suite-master-host0, User: admin@internal-authz).

and at the end we got a strange error message:

2018-01-29 21:30:27,437-05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-12) [] Operation Failed: [Desktop does not exist]


all the logs are visible here:
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-master/20/artifact/exported-artifacts/test_logs/he-basic-ansible-suite-master/post-he_deploy/lago-he-basic-ansible-suite-master-engine/_var_log/ovirt-engine/engine.log



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes it works as expected, sometimes no. tbd.

Steps to Reproduce:
1. deploy hosted-engine via ansible
2.
3.

Actual results:
Operation Failed: [Desktop does not exist]

Expected results:
it successfully deletes external-HostedEngineLocal by its uuid

Additional info:

Comment 1 Simone Tiraboschi 2018-01-30 14:20:15 UTC

Maybe it could be related to https://gerrit.ovirt.org/#/c/86689/

Comment 2 Arik 2018-01-30 15:11:42 UTC

(In reply to Simone Tiraboschi from comment #1)
> Maybe it could be related to https://gerrit.ovirt.org/#/c/86689/

That's a race that is hard to avoid.
Before 21:30:26,561-05 the VM was shut-down from within the guest.
At 2018-01-29 21:30:26,629-05 the engine processes a request to stop the VM, it probably sees that the VM is UP at this point so it tries to destroy it but VDSM reports back that the VM is not running on the host.

There's a kind of philosophical question here: if I ask to stop a VM and the VM is already down, did the operation succeeded or not?
That's debatable but someone has decided that it means a failure. The question is what is the implication of that on your use-case, can you simply check the status of the VM after the call to stop the VM?

Comment 3 Simone Tiraboschi 2018-01-30 17:00:31 UTC

(In reply to Arik from comment #2)
> There's a kind of philosophical question here: if I ask to stop a VM and the
> VM is already down, did the operation succeeded or not?
> That's debatable but someone has decided that it means a failure. The
> question is what is the implication of that on your use-case, can you simply
> check the status of the VM after the call to stop the VM?

I don't need to stop it: the VM is down for sure.
I want to delete it form the engine ans I'm using an ansible module which should theoretically be idempotent.

https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/ansible/create_target_vm.yml#L310

In the node-zero flow we start an engine on a local VM and we use that engine to create the storage domain, deploy the host and create the engine VM.
At the engine we shutdown the local VM and we copy it's disk over the disk of the VM created by the engine.

The bootstrap initial VM at the end is just a leftover in the DB.
How can we safely delete them?

Comment 4 Michal Skrivanek 2018-01-31 08:27:27 UTC

You try to delete a VM over REST API, and it sometimes returns error when it tries to stop it but it stops in the meantime and the internal destroy operation fails? Is that correct?

Why don't you ignore that specific error code? Or just check that the VM is not there afterwards?

Comment 5 Simone Tiraboschi 2018-01-31 08:42:58 UTC

(In reply to Michal Skrivanek from comment #4)
> You try to delete a VM over REST API, and it sometimes returns error when it
> tries to stop it but it stops in the meantime and the internal destroy
> operation fails? Is that correct?

At that time it's stopped for sure, no doubt, but maybe the engine has still to realize it and so the error.

> Why don't you ignore that specific error code? Or just check that the VM is
> not there afterwards?

Try stopping it ignoring errors on stop command and then deleting it as now?

Comment 6 Michal Skrivanek 2018-01-31 09:12:42 UTC

who is stopping it in the first place?
if that's your scripts then I would suggest to use API like this:
1) shutdown VM
2) wait for status being "Down" with a timeout
3) delete VM

Comment 7 Simone Tiraboschi 2018-01-31 09:44:56 UTC

(In reply to Michal Skrivanek from comment #6)
> who is stopping it in the first place?

hosted-engine-setup and the engine was on that VM.

> if that's your scripts then I would suggest to use API like this:
> 1) shutdown VM
> 2) wait for status being "Down" with a timeout
> 3) delete VM

OK

Comment 8 Yaniv Kaul 2018-02-01 09:51:50 UTC

(In reply to Simone Tiraboschi from comment #7)
> (In reply to Michal Skrivanek from comment #6)
> > who is stopping it in the first place?
> 
> hosted-engine-setup and the engine was on that VM.
> 
> > if that's your scripts then I would suggest to use API like this:
> > 1) shutdown VM
> > 2) wait for status being "Down" with a timeout
> > 3) delete VM
> 
> OK

So when will it delete 'itself' ? When it boots back up from shared storage?

Comment 9 Simone Tiraboschi 2018-02-01 10:30:39 UTC

(In reply to Yaniv Kaul from comment #8)
> So when will it delete 'itself' ? When it boots back up from shared storage?

Yes, otherwise if we delete the external local VM when the engine is still there we have a race condition trying to shutdown the engine before it imports again the external VM since it's still running.

On the other side, deleting it from the engine running on the shared storage, it can fail, as for this bug, if the engine still thinks that the VM is still up.

Comment 10 Nikolai Sednev 2018-02-19 14:05:55 UTC

Due to https://bugzilla.redhat.com/show_bug.cgi?id=1527866#c4, moving to verified, as now external VM in Node 0 is properly removed on latests components:
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 11 Nikolai Sednev 2018-02-19 15:42:32 UTC

Moving back to assigned as I've just got it reproduced on my new NFS Node 0 deployment with these components:
rhvm-appliance-4.2-20180202.0.el7.noarch
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch
Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

Logs from host attached.

Comment 12 Nikolai Sednev 2018-02-19 15:43:13 UTC

Created attachment 1397914 [details]
alma04 logs

Comment 13 Nikolai Sednev 2018-02-21 15:22:56 UTC

NFS deployment was successful, although with these these errors:
[ ERROR ] Error: Fault reason is "Operation Failed". Fault detail is "[Desktop does not exist]". HTTP response code is 400.
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Desktop does not exist]\". HTTP response code is 400."}
For more details please see https://bugzilla.redhat.com/attachment.cgi?bugid=1547595 .

Works for me on these components:
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux alma03.qa.lab.tlv.redhat.com 3.10.0-851.el7.x86_64 #1 SMP Mon Feb 12 07:53:52 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 Beta (Maipo)

Moving to verified.

Comment 14 Sandro Bonazzola 2018-03-29 11:13:09 UTC

This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.