Bug 1540107 - The engine fails to delete a existing external VM (via REST APIs) with 'Operation Failed: [Desktop does not exist]'
Summary: The engine fails to delete a existing external VM (via REST APIs) with 'Opera...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.2.1.4
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-4.2.2
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: 1455169 1458709 1526753
TreeView+ depends on / blocked
 
Reported: 2018-01-30 09:56 UTC by Simone Tiraboschi
Modified: 2018-03-29 11:13 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.2.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-29 11:13:09 UTC
oVirt Team: Integration
rule-engine: ovirt-4.2+
rule-engine: exception+
sbonazzo: devel_ack+


Attachments (Terms of Use)
alma04 logs (10.96 MB, application/x-xz)
2018-02-19 15:43 UTC, Nikolai Sednev
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 86954 master MERGED ansible: localvm: enforce status before deleting 2020-07-12 11:51:23 UTC
oVirt gerrit 87235 ovirt-hosted-engine-setup-2.2 MERGED ansible: localvm: enforce status before deleting 2020-07-12 11:51:23 UTC
oVirt gerrit 87949 master MERGED ansible: try deleting the bootstrap VM only if there 2020-07-12 11:51:23 UTC
oVirt gerrit 87962 ovirt-hosted-engine-setup-2.2 MERGED ansible: try deleting the bootstrap VM only if there 2020-07-12 11:51:23 UTC

Description Simone Tiraboschi 2018-01-30 09:56:56 UTC
Description of problem:
In node-zero hosted-engine setup we bootstrap the system from a local VM and the engine imports it as an external VM.
At the end we move the disk of that VM over the disk of a VM created by the engine (and we restart it from there) and so we have two VMs in the DB and we try to delete the initial bootstrap local VM bu its uuid.

In the logs we see:

2018-01-29 21:30:26,399-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] VM '96ebce44-c24e-40e9-9cd9-91048b3091b2'(external-HostedEngineLocal) moved from 'Up' --> 'Down'
2018-01-29 21:30:26,561-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] EVENT_ID: VM_DOWN(61), VM external-HostedEngineLocal is down. Exit message: User shut down from within the guest
2

And so external-HostedEngineLocal exists for sure with uuid '96ebce44-c24e-40e9-9cd9-91048b3091b2'

Then we try to delete if from the rest API:
2018-01-29 21:30:26,690-05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 43974d05


but destroy VM commands fails:
2018-01-29 21:30:26,690-05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 43974d05
2018-01-29 21:30:26,695-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVDSCommand(HostName = lago-he-basic-ansible-suite-master-host0, DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 5134ff4f
2018-01-29 21:30:26,717-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Destroy VM couldn't find VM '96ebce44-c24e-40e9-9cd9-91048b3091b2'. If after Cancel Migration and VM is UP on source, ignore next error, it's just a clean-up call
2018-01-29 21:30:26,717-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Failed in 'DestroyVDS' method
2018-01-29 21:30:26,769-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM lago-he-basic-ansible-suite-master-host0 command DestroyVDS failed: Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}
2018-01-29 21:30:26,769-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=1, message=Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}]]'
2018-01-29 21:30:26,770-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] HostName = lago-he-basic-ansible-suite-master-host0
2018-01-29 21:30:26,770-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'DestroyVDSCommand(HostName = lago-he-basic-ansible-suite-master-host0, DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1
2018-01-29 21:30:26,770-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] FINISH, DestroyVDSCommand, log id: 5134ff4f
2018-01-29 21:30:26,771-05 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Failed to destroy VM '96ebce44-c24e-40e9-9cd9-91048b3091b2' in VDS = '3c8f2061-6fb8-4707-8a88-c05b12a92f45' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1'
2018-01-29 21:30:26,771-05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] FINISH, DestroyVmVDSCommand, log id: 43974d05
2018-01-29 21:30:26,772-05 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1 (Failed with error noVM and code 1)
2018-01-29 21:30:27,207-05 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] VMs initialization finished for Host: 'lago-he-basic-ansible-suite-master-host0:3c8f2061-6fb8-4707-8a88-c05b12a92f45'
2018-01-29 21:30:27,291-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM external-HostedEngineLocal (Host: lago-he-basic-ansible-suite-master-host0, User: admin@internal-authz).

and at the end we got a strange error message:

2018-01-29 21:30:27,437-05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-12) [] Operation Failed: [Desktop does not exist]


all the logs are visible here:
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-master/20/artifact/exported-artifacts/test_logs/he-basic-ansible-suite-master/post-he_deploy/lago-he-basic-ansible-suite-master-engine/_var_log/ovirt-engine/engine.log



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes it works as expected, sometimes no. tbd.

Steps to Reproduce:
1. deploy hosted-engine via ansible
2.
3.

Actual results:
Operation Failed: [Desktop does not exist]

Expected results:
it successfully deletes external-HostedEngineLocal by its uuid

Additional info:

Comment 1 Simone Tiraboschi 2018-01-30 14:20:15 UTC
Maybe it could be related to https://gerrit.ovirt.org/#/c/86689/

Comment 2 Arik 2018-01-30 15:11:42 UTC
(In reply to Simone Tiraboschi from comment #1)
> Maybe it could be related to https://gerrit.ovirt.org/#/c/86689/

That's a race that is hard to avoid.
Before 21:30:26,561-05 the VM was shut-down from within the guest.
At 2018-01-29 21:30:26,629-05 the engine processes a request to stop the VM, it probably sees that the VM is UP at this point so it tries to destroy it but VDSM reports back that the VM is not running on the host.

There's a kind of philosophical question here: if I ask to stop a VM and the VM is already down, did the operation succeeded or not?
That's debatable but someone has decided that it means a failure. The question is what is the implication of that on your use-case, can you simply check the status of the VM after the call to stop the VM?

Comment 3 Simone Tiraboschi 2018-01-30 17:00:31 UTC
(In reply to Arik from comment #2)
> There's a kind of philosophical question here: if I ask to stop a VM and the
> VM is already down, did the operation succeeded or not?
> That's debatable but someone has decided that it means a failure. The
> question is what is the implication of that on your use-case, can you simply
> check the status of the VM after the call to stop the VM?

I don't need to stop it: the VM is down for sure.
I want to delete it form the engine ans I'm using an ansible module which should theoretically be idempotent.

https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/ansible/create_target_vm.yml#L310

In the node-zero flow we start an engine on a local VM and we use that engine to create the storage domain, deploy the host and create the engine VM.
At the engine we shutdown the local VM and we copy it's disk over the disk of the VM created by the engine.

The bootstrap initial VM at the end is just a leftover in the DB.
How can we safely delete them?

Comment 4 Michal Skrivanek 2018-01-31 08:27:27 UTC
You try to delete a VM over REST API, and it sometimes returns error when it tries to stop it but it stops in the meantime and the internal destroy operation fails? Is that correct?

Why don't you ignore that specific error code? Or just check that the VM is not there afterwards?

Comment 5 Simone Tiraboschi 2018-01-31 08:42:58 UTC
(In reply to Michal Skrivanek from comment #4)
> You try to delete a VM over REST API, and it sometimes returns error when it
> tries to stop it but it stops in the meantime and the internal destroy
> operation fails? Is that correct?

At that time it's stopped for sure, no doubt, but maybe the engine has still to realize it and so the error.

> Why don't you ignore that specific error code? Or just check that the VM is
> not there afterwards?

Try stopping it ignoring errors on stop command and then deleting it as now?

Comment 6 Michal Skrivanek 2018-01-31 09:12:42 UTC
who is stopping it in the first place?
if that's your scripts then I would suggest to use API like this:
1) shutdown VM
2) wait for status being "Down" with a timeout
3) delete VM

Comment 7 Simone Tiraboschi 2018-01-31 09:44:56 UTC
(In reply to Michal Skrivanek from comment #6)
> who is stopping it in the first place?

hosted-engine-setup and the engine was on that VM.

> if that's your scripts then I would suggest to use API like this:
> 1) shutdown VM
> 2) wait for status being "Down" with a timeout
> 3) delete VM

OK

Comment 8 Yaniv Kaul 2018-02-01 09:51:50 UTC
(In reply to Simone Tiraboschi from comment #7)
> (In reply to Michal Skrivanek from comment #6)
> > who is stopping it in the first place?
> 
> hosted-engine-setup and the engine was on that VM.
> 
> > if that's your scripts then I would suggest to use API like this:
> > 1) shutdown VM
> > 2) wait for status being "Down" with a timeout
> > 3) delete VM
> 
> OK

So when will it delete 'itself' ? When it boots back up from shared storage?

Comment 9 Simone Tiraboschi 2018-02-01 10:30:39 UTC
(In reply to Yaniv Kaul from comment #8)
> So when will it delete 'itself' ? When it boots back up from shared storage?

Yes, otherwise if we delete the external local VM when the engine is still there we have a race condition trying to shutdown the engine before it imports again the external VM since it's still running.

On the other side, deleting it from the engine running on the shared storage, it can fail, as for this bug, if the engine still thinks that the VM is still up.

Comment 10 Nikolai Sednev 2018-02-19 14:05:55 UTC
Due to https://bugzilla.redhat.com/show_bug.cgi?id=1527866#c4, moving to verified, as now external VM in Node 0 is properly removed on latests components:
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 11 Nikolai Sednev 2018-02-19 15:42:32 UTC
Moving back to assigned as I've just got it reproduced on my new NFS Node 0 deployment with these components:
rhvm-appliance-4.2-20180202.0.el7.noarch
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch
Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

Logs from host attached.

Comment 12 Nikolai Sednev 2018-02-19 15:43:13 UTC
Created attachment 1397914 [details]
alma04 logs

Comment 13 Nikolai Sednev 2018-02-21 15:22:56 UTC
NFS deployment was successful, although with these these errors:
[ ERROR ] Error: Fault reason is "Operation Failed". Fault detail is "[Desktop does not exist]". HTTP response code is 400.
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Desktop does not exist]\". HTTP response code is 400."}
For more details please see https://bugzilla.redhat.com/attachment.cgi?bugid=1547595 .

Works for me on these components:
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux alma03.qa.lab.tlv.redhat.com 3.10.0-851.el7.x86_64 #1 SMP Mon Feb 12 07:53:52 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 Beta (Maipo)

Moving to verified.

Comment 14 Sandro Bonazzola 2018-03-29 11:13:09 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.