Created attachment 1286130 [details] engine.log and vdsm.logs host_mixed_2 is original host Description of problem: I'm testing the leases on VM with high availability. I'm blocking the connection from the engine to the host (making it not responsive) and from the host to the storage domain with the lease (forcing VM to start in a new host). After the VM is migrated and the original host is back up, I power off the vm and remove it, and just after it the system creates an external VM (In some cases it just pauses the VM when it was UP already in the new host) Version-Release number of selected component (if applicable): 4.2.0-0.0.master.20170606135217.gita02bf9d.el7.centos How reproducible: ~ 20 % Steps to Reproduce: 1. Create a Vm with OS and lease on a storage domain 2. start the vm 3. block the connection from the engine to the host (host becomes non responsive) 4. block connection from the host to the storage domain with the lease ==> VM starts in a different host 5. After the VM is UP, unblock the connections and wait for the host to be UP 6. Remove the VM Actual results: VM is removed, but a new external-VM is created in status UP. Sometimes this does not happen and the vm is just paused (I think because it detects the VM in the original hsot to be paused) Expected results: VM is not created Additional info: I understand that maybe is an issue with not waiting enough for the system to realize the VM is UP on a different host? 2017-06-08 14:39:07,390+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler10) [48f25b31] VM '47fa8957-e778-44ee-83d0-3fc357c62473'(vm_0_TestCase17618_REST_IS_0812293664) was unexpectedly detected as 'Paused' on VDS '983d51a1-6d09-46c7-8d74-a89bbfc338bd'(host_mixed_2) (expected on '1ba1a1a2-fb58-40da-8887-50ffd9c7aa10') [...] 2017-06-08 14:39:32,425+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler4) [51180f25] VM '47fa8957-e778-44ee-83d0-3fc357c62473'(vm_0_TestCase17618_REST_IS_0812293664) was unexpectedly detected as 'Up' on VDS '1ba1a1a2-fb58-40da-8887-50ffd9c7aa10'(host_mixed_1) (expected on '983d51a1-6d09-46c7-8d74-a89bbfc338bd') 2017-06-08 14:39:37,427+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [18f33749] VM '47fa8957-e778-44ee-83d0-3fc357c62473'(vm_0_TestCase17618_REST_IS_0812293664) was unexpectedly detected as 'Paused' on VDS '983d51a1-6d09-46c7-8d74-a89bbfc338bd'(host_mixed_2) (expected on '1ba1a1a2-fb58-40da-8887-50ffd9c7aa10') 2017-06-08 14:39:39,455+03 INFO [org.ovirt.engine.core.bll.StopVmCommand] (default task-27) [vms_syncAction_e507f043-e80c-45be] Running command: StopVmCommand internal: false. Entities affected : ID: 47fa8957-e778-44ee-83d0-3fc357c62473 Type: VMAction group STOP_VM with role type USER 2017-06-08 14:39:39,460+03 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-27) [vms_syncAction_e507f043-e80c-45be] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='983d51a1-6d09-46c7-8d74-a89bbfc338bd', vmId='47fa8957-e778-44ee-83d0-3fc357c62473', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 57ee2535 [...] 17-06-08 14:39:40,558+03 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-27) [vms_syncAction_e507f043-e80c-45be] FINISH, DestroyVmVDSCommand, log id: 57ee2535 2017-06-08 14:39:40,564+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-27) [vms_syncAction_e507f043-e80c-45be] EVENT_ID: USER_STOP_VM(33), VM vm_0_TestCase17618_REST_IS_0812293664 powered off by admin@internal-authz (Host: host_mixed_2). 2017-06-08 14:39:40,597+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-10) [] Failed to destroy VM '47fa8957-e778-44ee-83d0-3fc357c62473' because VM does not exist, ignoring 2017-06-08 14:39:40,597+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-10) [] FINISH, DestroyVDSCommand, log id: 29dcf9e9 2017-06-08 14:39:40,597+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] VM '47fa8957-e778-44ee-83d0-3fc357c62473'(vm_0_TestCase17618_REST_IS_0812293664) moved from 'Paused' --> 'Down' [...] 2017-06-08 14:39:46,440+03 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (default task-1) [vms_delete_d54e9a6e-4f11-4d92] FINISH, SetVmStatusVDSCommand, log id: 2844026f 2017-06-08 14:39:46,442+03 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (default task-1) [vms_delete_d54e9a6e-4f11-4d92] Lock freed to object 'EngineLock:{exclusiveLocks='[47fa8957-e778-44ee-83d0-3fc357c62473=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-06-08 14:39:46,447+03 INFO [org.ovirt.engine.core.bll.storage.lease.RemoveVmLeaseCommand] (default task-1) [5fad4b84] Running command: RemoveVmLeaseCommand internal: true. Entities affected : ID: ee235130-2313-4456-a920-a298821c5005 Type: Storage 2017-06-08 14:39:46,459+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveVmLeaseVDSCommand] (default task-1) [5fad4b84] START, RemoveVmLeaseVDSCommand( VmLeaseVDSParameters:{storagePoolId='5f1856d0-21f0-416f-87ed-47c78228c0bd', ignoreFailoverLimit='false', leaseId='47fa8957-e778-44ee-83d0-3fc357c62473', storageDomainId='ee235130-2313-4456-a920-a298821c5005'}), log id: 4e0eb210 2017-06-08 14:39:46,566+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveVmLeaseVDSCommand] (default task-1) [5fad4b84] FINISH, RemoveVmLeaseVDSCommand, log id: 4e0eb210 2017-06-08 14:39:46,569+03 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (default task-1) [5fad4b84] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command 'b634c98f-f4fb-4e78-bb9f-7003c7f473db' 2017-06-08 14:39:46,569+03 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (default task-1) [5fad4b84] CommandMultiAsyncTasks::attachTask: Attaching task 'c56025b8-d6b3-4734-9977-b7f302d127b6' to command 'b634c98f-f4fb-4e78-bb9f-7003c7f473db'. 2017-06-08 14:39:46,575+03 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (default task-1) [5fad4b84] Adding task 'c56025b8-d6b3-4734-9977-b7f302d127b6' (Parent Command 'RemoveVmLease', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet.. 2017-06-08 14:39:46,581+03 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (default task-1) [5fad4b84] BaseAsyncTask::startPollingTask: Starting to poll task 'c56025b8-d6b3-4734-9977-b7f302d127b6'. 2017-06-08 14:39:46,599+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.RemoveAllVmImagesCommand] (default task-1) [vms_delete_d54e9a6e-4f11-4d92] Running command: RemoveAllVmImagesCommand internal: true. Entities affected : ID: 47fa8957-e778-44ee-83d0-3fc357c62473 Type: VM and] (default task-1) [vms_delete_d54e9a6e-4f11-4d92] Running command: RemoveImageCommand internal: true. Entities affected : ID: 00000000-0000-0000-0000-000000000000 Type: Storage [..] 2017-06-08 14:39:47,444+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler8) [76255c8c] VM '47fa8957-e778-44ee-83d0-3fc357c62473' was discovered as 'Up' on VDS '1ba1a1a2-fb58-40da-8887-50ffd9c7aa10'(host_mixed_1) [..] 2017-06-08 14:39:47,582+03 INFO [org.ovirt.engine.core.bll.AddVmFromScratchCommand] (DefaultQuartzScheduler8) [] Lock freed to object 'EngineLock:{exclusiveLocks='[external-vm_0_TestCase17618_REST_IS_0812293664=<VM_NAME, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-06-08 14:39:47,585+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (DefaultQuartzScheduler8) [] Received a spice Device without an address when processing VM 47fa8957-e778-44ee-83d0-3fc357c62473 devices, skipping device: {tlsPort=5900, vmid=47fa8957-e778-44ee-83d0-3fc357c62473, specParams={fileTransferEnable=true, displayNetwork=ovirtmgmt, copyPasteEnable=true, displayIp=10.35.82.111, spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,ssmartcard,susbredir}, deviceId=2b4db5e4-8d3c-476a-b8d6-8a5ee68a4e3c, device=spice, type=graphics} 2017-06-08 14:39:47,585+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (DefaultQuartzScheduler8) [] Received a lease Device without an address when processing VM 47fa8957-e778-44ee-83d0-3fc357c62473 devices, skipping device: {lease_id=47fa8957-e778-44ee-83d0-3fc357c62473, vmid=47fa8957-e778-44ee-83d0-3fc357c62473, sd_id=ee235130-2313-4456-a920-a298821c5005, deviceId=e863e7db-0cc4-43bf-b3b8-de36b1f2d095, offset=3145728, device=lease, path=/dev/ee235130-2313-4456-a920-a298821c5005/xleases, type=lease} 2017-06-08 14:39:47,594+03 INFO [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (DefaultQuartzScheduler8) [] Importing VM 'vm_0_TestCase17618_REST_IS_0812293664' as 'external-vm_0_TestCase17618_REST_IS_0812293664', as it is running on the on Host, but does not exist in the engine.
This is somewhere between Virt and Storage - Tal, can you please take lead on this and loop in any Virt stakeholders as required?
haven't seen this occur in last build (run multiple times): oVirt Engine Version: 4.2.0-0.0.master.20170619142113.git9aa81c2.el7.centos I'll close it, re-open if I encounter it again)
Openning again, actually this only happens with vms boot disks and leases on ISCSI and glusterfs
Again I've only seen this in iscsi and glusterfs domain as lease/boot disks domains. This issue is preventing us to integrate the VM leases in our tests for iscsi and gluster so I'm adding the severity to medium.
Likely the same as bug 1488416. There is a fix POSTed
(Not the creation itself yet, just the exception)
(In reply to Michal Skrivanek from comment #5) > Likely the same as bug 1488416. There is a fix POSTed (In reply to Michal Skrivanek from comment #6) > (Not the creation itself yet, just the exception) Not sure I understand this comment. Should the fix for bug 1488416 fix this one too?
it fixes the traceback, not the fact that external vm is created. That feels wrong and shall be further analyzed by Arik
Let H1 be the original host and H2 be the other host. The flow looks as follow: (1) H1: VM runs on that host (2) H1: Becomes non-responsive (3) H1: Lose connection to the storage domain (4) H1: VM switches to PAUSED H2: VM starts automatically on that host (5) H2: VM is in WaitForLaunch, thus ignored by the monitoring (it is not on the host we expect the VM to run on) (6) H2: VM runs on that host (7) Both H1 and H2 report the VM, the VM is switched between the two (8) H2: The VM is destroyed (the user asks to stop the VM) (9) The VM is removed in the engine (10) H1: The VM is still reported as PAUSED (11) The engine adds the VM as an external VM The fact that VM succeeded to run with a lease on H2 (step 6) means that the lease expired (as expected) on H1. We rely on the qemu process being killed at that point and the VM no longer being reported by H1. If that happens only with certain storage types then it may be a related to different timeouts. AFAIR, the timeout of the lease is 80 sec, maybe the timeout of the storage is lower and thus the VM switches to PAUSED before the lease is expired. In that case, maybe we fail to destroy the qemu process or maybe we stop to monitor the lease or maybe we lose the correlation between the lease and the qemu process that used to use it. We need guys that are more familiar with the lower level lease mechanism to check this, I believe.
(In reply to Arik from comment #9) > Let H1 be the original host and H2 be the other host. > The flow looks as follow: > (1) H1: VM runs on that host > (2) H1: Becomes non-responsive at this point engine stops getting data from H1 > (3) H1: Lose connection to the storage domain > (4) H1: VM switches to PAUSED the lease is released > H2: VM starts automatically on that host the start only happens after considerable time. But indeed if the host first goes Not Responding due to networking issue between host and engine but the VM continues to operate and only loses SD connection later it can happen that on the engine side we get to the "HA with lease" rerun during the sanlock 80s grace period already. Or at any time later as sanlock is not killing that VM when it's paused(neither does bug 1317450 with the kill option - that only happens once the SD resumes) > (5) H2: VM is in WaitForLaunch, thus ignored by the monitoring (it is not on > the host we expect the VM to run on) > (6) H2: VM runs on that host > (7) Both H1 and H2 report the VM, the VM is switched between the two do you assume only now the H1 starts reporting the VM again? Host-engine connectivity can resume at any time I suppose. So the Paused state from H1 can be received by engine > (8) H2: The VM is destroyed (the user asks to stop the VM) > (9) The VM is removed in the engine > (10) H1: The VM is still reported as PAUSED > (11) The engine adds the VM as an external VM > > The fact that VM succeeded to run with a lease on H2 (step 6) means that the > lease expired (as expected) on H1. We rely on the qemu process being killed > at that point and the VM no longer being reported by H1. the problem is with the automatic release on pause. H2 succeeds even when the VM is just paused on H1. We kill it (in bug 1317450) when the 80s is exceeded, but only once the SD connectivity is resumed. It does a bit too complex to happen but it still might happen, I think, when the host-storage connectivity problem does not overlap with the engine-host connectivity problem. Worst case is that you end up with a Down external VM, no data corruption. With bug 1317450 and "kill" policy it at least won't happen in case when host-storage is resumed before host-engine covering part of the problem. Carlos, might be worth testing in your case... It's not ideal, but I do not see a simple fix. Maybe "do not add external VMs when they are down with errror" once bug 1516322 is fixed? Further thoughts?
(In reply to Michal Skrivanek from comment #10) > Further thoughts? The analysis described in comment 9 was done according to the log of the particular case reported in this bug. In that case, we got the VM from H1 before the VM was shut-down and removed - a scenario we should avoid because the VM then switches on the engine side between H1 and H2. This can be done by adding the following logic to the monitoring: when we get the VM as Paused with error on unexpected host (i.e., we expect the VM to run on another host), we will destroy the VM on that host, much like we do for Down VMs. I may happen that the connectivity with H1 will resume after the VM was removed. In that case we should not import the VM, we can add the following logic to the monitoring: ignore unmanaged VMs that are either Down or Paused. It would cover everything we discussed offline, right?
I would include Down (all of them) and Paused with EIO only
In 4.2 we'll just skip adding the external VM for now. The fix for switching the VM between both hosts was merged into the master branch and we'll consider alternative solutions before merging that to the 4.2 branch.
Tested with the following code: -------------------------------------------- ovirt-engine-4.2.2-0.1.el7.noarch vdsm-4.20.18-1.el7ev.x86_64 Tested with the following scenario: -------------------------------------------- Steps to Reproduce: 1. Create a Vm with OS and lease on a storage domain 2. start the vm 3. block the connection from the engine to the host (host becomes non responsive and VM is move to Unknown state 4. block connection from the host to the storage domain with the lease ==> VM remains in the Unknow state My understanding is that the VM should have move over to the second host Please advise whether this is the expected behaviour or whether to move this bug back to ASSIGNED? Engine log with the VM move to Unknown state 2018-02-21 14:25:24,664+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-131936) [] EVENT_ID: VM_SET_TO_UNKNOWN_STAT US(142), VM vm_lease was set to the Unknown status. Adding logs
Created attachment 1398713 [details] server.vdsm logs Added logs
(In reply to Kevin Alon Goldblatt from comment #14) it was rerun at 2018-02-21 14:27:20,743+02, mere 2 minutes later, on vm-83-163 but failed to start. Do you have logs from there?
Hi, I'm working on verification of the bug on build rhv-release-4.2.2-4-001.noarch. Could you please approve the below? Because there is difference in behavior described in the reported bug. the following steps were performed for verification: setup: rhv-release-4.2.2-4-001.noarch. environment with three hosts - H1, H2, H3 1. create HA VM (disk on iscsi) with lease on iscsi SD and run on H3. 2. block the host from engine iptables -I INPUT -s H3 -j DROP 3. Wait untill the host in Not Responsive. The VM turns to be Unknown. 4. Block the storage from H3. ==> the VM doesn't start automatically on another host as described in the bug report and comments. It just remains as is in Unknown state. 5. Unblock the connection from engine to host. At this phase, the VM starts automatically on other host H2. Power off and Remove the VM. result: Nothing suspicious happens, no external VM.
Hi Arik, could you please look that the verification reproduce matches the expectation?
(In reply to Michal Skrivanek from comment #16) > (In reply to Kevin Alon Goldblatt from comment #14) > it was rerun at 2018-02-21 14:27:20,743+02, mere 2 minutes later, on > vm-83-163 but failed to start. Do you have logs from there? No additional logs were available
(In reply to Kevin Alon Goldblatt from comment #19) > (In reply to Michal Skrivanek from comment #16) > > (In reply to Kevin Alon Goldblatt from comment #14) > > it was rerun at 2018-02-21 14:27:20,743+02, mere 2 minutes later, on > > vm-83-163 but failed to start. Do you have logs from there? > > No additional logs were available Kevin, then as far as comment #14 is concerned I do not see any issue. You had an unrelated problem of not being able to launch the VM on the other host. Polina, not sure, can you add more details, please? The VM should have been restarted in step 4, after fencing. Do you have fencing enabled, and the VM configured with a lease, and you waited long enough (5 mins+)?
(In reply to Michal Skrivanek from comment #20) > Polina, not sure, can you add more details, please? The VM should have been > restarted in step 4, after fencing. Do you have fencing enabled, and the VM > configured with a lease, and you waited long enough (5 mins+)? Thank you Michal for your comment, it was a timing issue - the VM is restarted on another host after waiting for a while. I'm verifying the bug after trying to reproduce the problem many times (trying to catch the situation at the different point in time). The reproducing scenario: 1. Run HA VM (iscsi SD) with lease on iscsi_0. Block the connection engine -> host and host -> iscsi target. 2. Wait until the VM is restarted on another host (it turns to unknown for a while and then is restarted). Unblock all the blocked connections. 3. Stop and remove the VM. result: ok. no external VM created.
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.