Description of problem: Basically a VM with lease fails to run if Attach+Activate+Import VM+Run VM is done quickly in a sucession, which is what the DR playbook does for HA VMs. The issue is the engine filters out all hosts during scheduling, the VmLeasesReadyFilter policy does it. I believe there is more than one problem here. I'll describe the entire process and point out the things that are not making much sense to me. Scenario: * During DR, we are attaching+import an SD that was previously attached to another DC. After this, we will import the VM (which has a Lease) and attempt to run it. 1. First, engine tells SPM to detach the SD from previous DC (force Detach) 2021-06-22 10:12:37,242+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] START, DetachStorageDomainVDSCommand( DetachStorageDomainVDSCommandParameters:{storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', ignoreFailoverLimit='false', storageDomainId='e824ca4c-2fd3-460a-b746-f9e80d5c11fb', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='0', force='true'}), log id: 26ff7009 2. SPM starts acquiring the lock, this is a long delay because its unclean release 2021-06-22 10:12:37 5191 [59724]: s5 delta_acquire delta_large_delay 80 delay 140 3. Acquired: 2021-06-22 10:15:17 5352 [59724]: s5 delta_acquire done 1 11 5332 4. VDSM finishes the detach, and releases the lock, replying to the engine 2021-06-22 10:15:19 5354 [59724]: s5 delta_release done 1 11 0 5. engine gets the answer from VDSM, and moves to attach it to the new Pool 2021-06-22 10:15:20,667+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] FINISH, DetachStorageDomainVDSCommand, return: , log id: 26ff7009 2021-06-22 10:15:20,670+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] START, AttachStorageDomainVDSCommand( AttachStorageDomainVDSCommandParameters:{storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', ignoreFailoverLimit='false', storageDomainId='e824ca4c-2fd3-460a-b746-f9e80d5c11fb'}), log id: 37fbfe63 5. VDSM acquires it again, now its quick one as the last release was clean 2021-06-22 10:15:20 5355 [60857]: s6 delta_acquire delta_short_delay 20 2021-06-22 10:15:41 5376 [24733]: s6 add_lockspace done 6. VDSM then releases it again ,replies to engine, its attached. 2021-06-22 10:15:43 5377 [60857]: s6 delta_release done 1 12 0 2021-06-22 10:15:44,069+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] FINISH, AttachStorageDomainVDSCommand, return: , log id: 37fbfe63 7. Engine reads the SD, retrieves OVFs etc. 8. Engine starts SD activation, tells host to connect to poll 2021-06-22 10:15:51,812+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] START, ActivateStorageDomainVDSCommand( ActivateStorageDomainVDSCommandParameters:{storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', ignoreFailoverLimit='false', storageDomainId='e824ca4c-2fd3-460a-b746-f9e80d5c11fb'}), log id: 122683a 2021-06-22 10:15:51,891+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] FINISH, ActivateStorageDomainVDSCommand, return: true, log id: 122683a 2021-06-22 10:15:51,922+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-416) [4890bba8] START, ConnectStoragePoolVDSCommand(HostName = host2.kvm, ConnectStoragePoolVDSCommandParameters:{hostId='fbaf49bb-0936-4dc0-b912-d1d8815f0724', vdsId='fbaf49bb-0936-4dc0-b912-d1d8815f0724', storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', masterVersion='1'}), log id: 7d141be 2021-06-22 10:15:51,922+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-416) [4890bba8] Executing with domain map: {6a1bd11f-0db7-4ca8-bc00-b5935f992dce=active, e824ca4c-2fd3-460a-b746-f9e80d5c11fb=active} 9. And we are suddenly done? VDSM returned ConnectStoragePool quickly... 2021-06-22 10:15:52,199+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-416) [4890bba8] FINISH, ConnectStoragePoolVDSCommand, return: , log id: 7d141be 2021-06-22 10:15:52,219+10 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] EVENT_ID: USER_ACTIVATED_STORAGE_DOMAIN(966), Storage Domain iSCSI-A (Data Center Default) was activated by admin@internal-authz 2021-06-22 10:15:52,226+10 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] EVENT_ID: USER_ATTACH_STORAGE_DOMAIN_TO_POOL(962), Storage Domain iSCSI-A was attached to Data Center Default by admin@internal-authz 10. On the host side, we are *not* really done yet, we just started acquiring ids on the SD again upon connect storage pool command, and we need 20s to acquire it: 2021-06-22 10:15:52 5387 [61273]: s7 delta_acquire delta_short_delay 20 11. But the engine moved on, and the VM imported while the host was acquiring ids: 2021-06-22 10:16:04,176+10 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-419) [e61c9891-2290-4487-9ec0-f78cd506f949] EVENT_ID: VM_IMPORT_FROM_CONFIGURATION_EXECUTED_SUCCESSFULLY(174), VM VM1 has been successfully imported from the given configuration. 12. Now the user tries to run the VM and it fails, as the lease is not arquired 2021-06-22 10:16:09,349+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-10) [5a861725-466e-4061-85f9-f57d22f2656c] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM VM1 due to a failed validation: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host2.kvm did not satisfy internal filter VM leases ready because VM lease is not ready yet.] (User: admin@internal-authz). 13. Now the lease is finally acquired (4s after the run attempt) 2021-06-22 10:16:13 5408 [24733]: s7 add_lockspace done 14. And run fails again almost 10s after it was acquired: 2021-06-22 10:16:22,433+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-9) [16f737cc-f64b-49a0-ac01-67249cfe2ca3] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM VM1 due to a failed validation: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host2.kvm did not satisfy internal filter VM leases ready because VM lease is not ready yet.] (User: admin@internal-authz). 15. Next run suceeds, after the next monitoring cycle, when the host reported the SD as acquired: 2021-06-22 10:16:27,996+10 INFO [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-429) [2c3ff3ab-fd48-40a4-ab11-fd8f8c3f397d] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='fbaf49bb-0936-4dc0-b912-d1d8815f0724', vmId='d6623aca-28d1-48cd-8a12-72e4bf927dc8', vm='VM [VM1]'}), log id: d404c8d So, in summary... I think: Comment 1 --------- At step 9, connectStoragePool seems to run async on the host, the host is still acquiring the sanlock leases but the engine has moved on. This allows the user (or DR playbook) to try to run a VM with lease even before the lease is ready, which fails. Either the DR script needs to implement some wait here, or something must be made sync. Comment 2 --------- At step 14, the lease was ready but the engine denied scheduling anyway. The engine is using cached info for this decision, retrieved on monitoring cycles here: backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/vdsbroker/VdsBrokerObjectsBuilder.java 1564 private static void updateVDSDomainData(VDS vds, Map<String, Object> struct) { ... 1565 if (struct.containsKey(VdsProperties.domains)) { 1566 Map<String, Object> domains = (Map<String, Object>) 1567 struct.get(VdsProperties.domains); 1568 ArrayList<VDSDomainsData> domainsData = new ArrayList<>(); 1569 for (Map.Entry<String, ?> value : domains.entrySet()) { ... 1594 data.setAcquired(acquired); So the engine only allows scheduling after the next monitoring cycle. It's hard to say where the problem is, it seems to be just a consequence of how things are. Maybe VDSM needs to return connectStoragePool after it acquires the lock, or maybe engine needs to check if all hosts are already reporting the domain as acquired before the SD is really considered active. It's probably a better idea to fix this somehow in the engine than sleeping in the DR playbook, as its reproducible also via UI if you do it quickly. Version-Release number of selected component (if applicable): rhvm-4.4.6.8-0.1.el8ev.noarch vdsm-4.40.60.7-1.el8ev.x86_64 How reproducible: Always Steps to Reproduce: It can be done doing a Active-Passing failover, or just by Attach+Activate+Import+Run very quickly. Actual results: * VM with lease fails to run after import Expected results: * VM with lease runs after import
*** This bug has been marked as a duplicate of bug 1968433 ***
We think this may not be specific to DR but to every import-VM - that we don't wait enough time for the lease to be created and by waiting for the lease to be created in ImportVm we solve it in all cases
we believe it's fixed by other lease-related changes in 4.5
let's wait for first build then
Verified. VM was up both in failover and fallback in the same flow. Versions: RHV 4.5.3-3 ovirt-engine-4.5.3.1-2.el8ev.noarch vdsm-4.50.3.4-1.el8ev.ppc64le
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.3] bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8502