Description of problem: In node-zero hosted-engine setup we bootstrap the system from a local VM and the engine imports it as an external VM. At the end we move the disk of that VM over the disk of a VM created by the engine (and we restart it from there) and so we have two VMs in the DB and we try to delete the initial bootstrap local VM bu its uuid. In the logs we see: 2018-01-29 21:30:26,399-05 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] VM '96ebce44-c24e-40e9-9cd9-91048b3091b2'(external-HostedEngineLocal) moved from 'Up' --> 'Down' 2018-01-29 21:30:26,561-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] EVENT_ID: VM_DOWN(61), VM external-HostedEngineLocal is down. Exit message: User shut down from within the guest 2 And so external-HostedEngineLocal exists for sure with uuid '96ebce44-c24e-40e9-9cd9-91048b3091b2' Then we try to delete if from the rest API: 2018-01-29 21:30:26,690-05 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 43974d05 but destroy VM commands fails: 2018-01-29 21:30:26,690-05 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 43974d05 2018-01-29 21:30:26,695-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] START, DestroyVDSCommand(HostName = lago-he-basic-ansible-suite-master-host0, DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 5134ff4f 2018-01-29 21:30:26,717-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Destroy VM couldn't find VM '96ebce44-c24e-40e9-9cd9-91048b3091b2'. If after Cancel Migration and VM is UP on source, ignore next error, it's just a clean-up call 2018-01-29 21:30:26,717-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Failed in 'DestroyVDS' method 2018-01-29 21:30:26,769-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM lago-he-basic-ansible-suite-master-host0 command DestroyVDS failed: Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'} 2018-01-29 21:30:26,769-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=1, message=Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}]]' 2018-01-29 21:30:26,770-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] HostName = lago-he-basic-ansible-suite-master-host0 2018-01-29 21:30:26,770-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'DestroyVDSCommand(HostName = lago-he-basic-ansible-suite-master-host0, DestroyVmVDSCommandParameters:{hostId='3c8f2061-6fb8-4707-8a88-c05b12a92f45', vmId='96ebce44-c24e-40e9-9cd9-91048b3091b2', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1 2018-01-29 21:30:26,770-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] FINISH, DestroyVDSCommand, log id: 5134ff4f 2018-01-29 21:30:26,771-05 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Failed to destroy VM '96ebce44-c24e-40e9-9cd9-91048b3091b2' in VDS = '3c8f2061-6fb8-4707-8a88-c05b12a92f45' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1' 2018-01-29 21:30:26,771-05 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] FINISH, DestroyVmVDSCommand, log id: 43974d05 2018-01-29 21:30:26,772-05 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist: {'vmId': u'96ebce44-c24e-40e9-9cd9-91048b3091b2'}, code = 1 (Failed with error noVM and code 1) 2018-01-29 21:30:27,207-05 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [41938cfc] VMs initialization finished for Host: 'lago-he-basic-ansible-suite-master-host0:3c8f2061-6fb8-4707-8a88-c05b12a92f45' 2018-01-29 21:30:27,291-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [a990a4a2-a443-4cea-8c32-adeebf76df2e] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM external-HostedEngineLocal (Host: lago-he-basic-ansible-suite-master-host0, User: admin@internal-authz). and at the end we got a strange error message: 2018-01-29 21:30:27,437-05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-12) [] Operation Failed: [Desktop does not exist] all the logs are visible here: http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-master/20/artifact/exported-artifacts/test_logs/he-basic-ansible-suite-master/post-he_deploy/lago-he-basic-ansible-suite-master-engine/_var_log/ovirt-engine/engine.log Version-Release number of selected component (if applicable): How reproducible: Sometimes it works as expected, sometimes no. tbd. Steps to Reproduce: 1. deploy hosted-engine via ansible 2. 3. Actual results: Operation Failed: [Desktop does not exist] Expected results: it successfully deletes external-HostedEngineLocal by its uuid Additional info:
Maybe it could be related to https://gerrit.ovirt.org/#/c/86689/
(In reply to Simone Tiraboschi from comment #1) > Maybe it could be related to https://gerrit.ovirt.org/#/c/86689/ That's a race that is hard to avoid. Before 21:30:26,561-05 the VM was shut-down from within the guest. At 2018-01-29 21:30:26,629-05 the engine processes a request to stop the VM, it probably sees that the VM is UP at this point so it tries to destroy it but VDSM reports back that the VM is not running on the host. There's a kind of philosophical question here: if I ask to stop a VM and the VM is already down, did the operation succeeded or not? That's debatable but someone has decided that it means a failure. The question is what is the implication of that on your use-case, can you simply check the status of the VM after the call to stop the VM?
(In reply to Arik from comment #2) > There's a kind of philosophical question here: if I ask to stop a VM and the > VM is already down, did the operation succeeded or not? > That's debatable but someone has decided that it means a failure. The > question is what is the implication of that on your use-case, can you simply > check the status of the VM after the call to stop the VM? I don't need to stop it: the VM is down for sure. I want to delete it form the engine ans I'm using an ansible module which should theoretically be idempotent. https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/ansible/create_target_vm.yml#L310 In the node-zero flow we start an engine on a local VM and we use that engine to create the storage domain, deploy the host and create the engine VM. At the engine we shutdown the local VM and we copy it's disk over the disk of the VM created by the engine. The bootstrap initial VM at the end is just a leftover in the DB. How can we safely delete them?
You try to delete a VM over REST API, and it sometimes returns error when it tries to stop it but it stops in the meantime and the internal destroy operation fails? Is that correct? Why don't you ignore that specific error code? Or just check that the VM is not there afterwards?
(In reply to Michal Skrivanek from comment #4) > You try to delete a VM over REST API, and it sometimes returns error when it > tries to stop it but it stops in the meantime and the internal destroy > operation fails? Is that correct? At that time it's stopped for sure, no doubt, but maybe the engine has still to realize it and so the error. > Why don't you ignore that specific error code? Or just check that the VM is > not there afterwards? Try stopping it ignoring errors on stop command and then deleting it as now?
who is stopping it in the first place? if that's your scripts then I would suggest to use API like this: 1) shutdown VM 2) wait for status being "Down" with a timeout 3) delete VM
(In reply to Michal Skrivanek from comment #6) > who is stopping it in the first place? hosted-engine-setup and the engine was on that VM. > if that's your scripts then I would suggest to use API like this: > 1) shutdown VM > 2) wait for status being "Down" with a timeout > 3) delete VM OK
(In reply to Simone Tiraboschi from comment #7) > (In reply to Michal Skrivanek from comment #6) > > who is stopping it in the first place? > > hosted-engine-setup and the engine was on that VM. > > > if that's your scripts then I would suggest to use API like this: > > 1) shutdown VM > > 2) wait for status being "Down" with a timeout > > 3) delete VM > > OK So when will it delete 'itself' ? When it boots back up from shared storage?
(In reply to Yaniv Kaul from comment #8) > So when will it delete 'itself' ? When it boots back up from shared storage? Yes, otherwise if we delete the external local VM when the engine is still there we have a race condition trying to shutdown the engine before it imports again the external VM since it's still running. On the other side, deleting it from the engine running on the shared storage, it can fail, as for this bug, if the engine still thinks that the VM is still up.
Due to https://bugzilla.redhat.com/show_bug.cgi?id=1527866#c4, moving to verified, as now external VM in Node 0 is properly removed on latests components: ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch rhvm-appliance-4.2-20180202.0.el7.noarch Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Moving back to assigned as I've just got it reproduced on my new NFS Node 0 deployment with these components: rhvm-appliance-4.2-20180202.0.el7.noarch ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Logs from host attached.
Created attachment 1397914 [details] alma04 logs
NFS deployment was successful, although with these these errors: [ ERROR ] Error: Fault reason is "Operation Failed". Fault detail is "[Desktop does not exist]". HTTP response code is 400. [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Desktop does not exist]\". HTTP response code is 400."} For more details please see https://bugzilla.redhat.com/attachment.cgi?bugid=1547595 . Works for me on these components: ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch rhvm-appliance-4.2-20180202.0.el7.noarch Linux alma03.qa.lab.tlv.redhat.com 3.10.0-851.el7.x86_64 #1 SMP Mon Feb 12 07:53:52 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) Moving to verified.
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.