Bug 1974535 - Virtual Machine with lease fails to run on DR failover
Summary: Virtual Machine with lease fails to run on DR failover
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.5.3
: ---
Assignee: Arik
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-22 01:12 UTC by Germano Veit Michel
Modified: 2024-10-01 18:44 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-4.5.3.1
Doc Type: Bug Fix
Doc Text:
Previously, highly available VMs with a VM lease running on a primary site may not have started on a secondary site during active-passive failover because none of the hosts were set as ready to support VM leases. In this release, when a highly available VM with a VM lease fails to start because hosts were filtered out due to not being ready to support VM leases, it keeps trying to start periodically. If it takes time for the engine to discover that the storage domain that contains the VM lease is ready, the attempts to start the VM will continue until the status of the storage domain changes.
Clone Of:
Environment:
Last Closed: 2022-11-16 12:17:27 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
sleviim: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 660 0 None Merged Handle absent of lease SD when starting ha vm 2022-09-21 21:13:30 UTC
Red Hat Knowledge Base (Solution) 6131621 0 None None None 2021-06-22 01:48:28 UTC
Red Hat Product Errata RHSA-2022:8502 0 None None None 2022-11-16 12:17:37 UTC

Description Germano Veit Michel 2021-06-22 01:12:12 UTC
Description of problem:

Basically a VM with lease fails to run if Attach+Activate+Import VM+Run VM is done quickly in a sucession, which is what the DR playbook does for HA VMs.
The issue is the engine filters out all hosts during scheduling, the VmLeasesReadyFilter policy does it.

I believe there is more than one problem here. I'll describe the entire process and point out the things that are not making much sense to me.

Scenario:
* During DR, we are attaching+import an SD that was previously attached to another DC. After this, we will import the VM (which has a Lease) and attempt to run it.

1. First, engine tells SPM to detach the SD from previous DC (force Detach)
2021-06-22 10:12:37,242+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] START, DetachStorageDomainVDSCommand( DetachStorageDomainVDSCommandParameters:{storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', ignoreFailoverLimit='false', storageDomainId='e824ca4c-2fd3-460a-b746-f9e80d5c11fb', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='0', force='true'}), log id: 26ff7009

2. SPM starts acquiring the lock, this is a long delay because its unclean release
2021-06-22 10:12:37 5191 [59724]: s5 delta_acquire delta_large_delay 80 delay 140

3. Acquired:
2021-06-22 10:15:17 5352 [59724]: s5 delta_acquire done 1 11 5332

4. VDSM finishes the detach, and releases the lock, replying to the engine
2021-06-22 10:15:19 5354 [59724]: s5 delta_release done 1 11 0

5. engine gets the answer from VDSM, and moves to attach it to the new Pool

2021-06-22 10:15:20,667+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] FINISH, DetachStorageDomainVDSCommand, return: , log id: 26ff7009
2021-06-22 10:15:20,670+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] START, AttachStorageDomainVDSCommand( AttachStorageDomainVDSCommandParameters:{storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', ignoreFailoverLimit='false', storageDomainId='e824ca4c-2fd3-460a-b746-f9e80d5c11fb'}), log id: 37fbfe63

5. VDSM acquires it again, now its quick one as the last release was clean
2021-06-22 10:15:20 5355 [60857]: s6 delta_acquire delta_short_delay 20
2021-06-22 10:15:41 5376 [24733]: s6 add_lockspace done

6. VDSM then releases it again ,replies to engine, its attached.
2021-06-22 10:15:43 5377 [60857]: s6 delta_release done 1 12 0
2021-06-22 10:15:44,069+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [f93f7a57-561b-4106-9d4d-3b51df8e50bc] FINISH, AttachStorageDomainVDSCommand, return: , log id: 37fbfe63

7. Engine reads the SD, retrieves OVFs etc.

8. Engine starts SD activation, tells host to connect to poll

2021-06-22 10:15:51,812+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] START, ActivateStorageDomainVDSCommand( ActivateStorageDomainVDSCommandParameters:{storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', ignoreFailoverLimit='false', storageDomainId='e824ca4c-2fd3-460a-b746-f9e80d5c11fb'}), log id: 122683a
2021-06-22 10:15:51,891+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] FINISH, ActivateStorageDomainVDSCommand, return: true, log id: 122683a

2021-06-22 10:15:51,922+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-416) [4890bba8] START, ConnectStoragePoolVDSCommand(HostName = host2.kvm, ConnectStoragePoolVDSCommandParameters:{hostId='fbaf49bb-0936-4dc0-b912-d1d8815f0724', vdsId='fbaf49bb-0936-4dc0-b912-d1d8815f0724', storagePoolId='cf5df12c-d2e5-11eb-bff9-5254000000fa', masterVersion='1'}), log id: 7d141be
2021-06-22 10:15:51,922+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-416) [4890bba8] Executing with domain map: {6a1bd11f-0db7-4ca8-bc00-b5935f992dce=active, e824ca4c-2fd3-460a-b746-f9e80d5c11fb=active}

9. And we are suddenly done? VDSM returned ConnectStoragePool quickly...

2021-06-22 10:15:52,199+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-416) [4890bba8] FINISH, ConnectStoragePoolVDSCommand, return: , log id: 7d141be

2021-06-22 10:15:52,219+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] EVENT_ID: USER_ACTIVATED_STORAGE_DOMAIN(966), Storage Domain iSCSI-A (Data Center Default) was activated by admin@internal-authz
2021-06-22 10:15:52,226+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-383) [4890bba8] EVENT_ID: USER_ATTACH_STORAGE_DOMAIN_TO_POOL(962), Storage Domain iSCSI-A was attached to Data Center Default by admin@internal-authz

10. On the host side, we are *not* really done yet, we just started acquiring ids on the SD again upon connect storage pool command, and we need 20s to acquire it:

2021-06-22 10:15:52 5387 [61273]: s7 delta_acquire delta_short_delay 20

11. But the engine moved on, and the VM imported while the host was acquiring ids:

2021-06-22 10:16:04,176+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-419) [e61c9891-2290-4487-9ec0-f78cd506f949] EVENT_ID: VM_IMPORT_FROM_CONFIGURATION_EXECUTED_SUCCESSFULLY(174), VM VM1 has been successfully imported from the given configuration.

12. Now the user tries to run the VM and it fails, as the lease is not arquired

2021-06-22 10:16:09,349+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-10) [5a861725-466e-4061-85f9-f57d22f2656c] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM VM1 due to a failed validation: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host2.kvm did not satisfy internal filter VM leases ready because VM lease is not ready yet.] (User: admin@internal-authz).

13. Now the lease is finally acquired (4s after the run attempt)

2021-06-22 10:16:13 5408 [24733]: s7 add_lockspace done

14. And run fails again almost 10s after it was acquired:

2021-06-22 10:16:22,433+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-9) [16f737cc-f64b-49a0-ac01-67249cfe2ca3] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM VM1 due to a failed validation: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host2.kvm did not satisfy internal filter VM leases ready because VM lease is not ready yet.] (User: admin@internal-authz).

15. Next run suceeds, after the next monitoring cycle, when the host reported the SD as acquired:

2021-06-22 10:16:27,996+10 INFO  [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-429) [2c3ff3ab-fd48-40a4-ab11-fd8f8c3f397d] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='fbaf49bb-0936-4dc0-b912-d1d8815f0724', vmId='d6623aca-28d1-48cd-8a12-72e4bf927dc8', vm='VM [VM1]'}), log id: d404c8d

So, in summary... I think:

Comment 1
---------
At step 9, connectStoragePool seems to run async on the host, the host is still acquiring the sanlock leases but the engine has moved on. This allows the user (or DR playbook) to try to run a VM with lease even before the lease is ready, which fails. Either the DR script needs to implement some wait here, or something must be made sync.

Comment 2
---------
At step 14, the lease was ready but the engine denied scheduling anyway. The engine is using cached info for this decision, retrieved on monitoring cycles here:

backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/vdsbroker/VdsBrokerObjectsBuilder.java
  1564	    private static void updateVDSDomainData(VDS vds, Map<String, Object> struct) {
  ...
  1565	        if (struct.containsKey(VdsProperties.domains)) {
  1566	            Map<String, Object> domains = (Map<String, Object>)
  1567	                    struct.get(VdsProperties.domains);
  1568	            ArrayList<VDSDomainsData> domainsData = new ArrayList<>();
  1569	            for (Map.Entry<String, ?> value : domains.entrySet()) {
  ...
  1594	                    data.setAcquired(acquired);

So the engine only allows scheduling after the next monitoring cycle. 

It's hard to say where the problem is, it seems to be just a consequence of how things are. Maybe VDSM needs to return connectStoragePool after it acquires the lock, or maybe engine needs to check if all hosts are already reporting the domain as acquired before the SD is really considered active.

It's probably a better idea to fix this somehow in the engine than sleeping in the DR playbook, as its reproducible also via UI if you do it quickly.

Version-Release number of selected component (if applicable):
rhvm-4.4.6.8-0.1.el8ev.noarch
vdsm-4.40.60.7-1.el8ev.x86_64

How reproducible:
Always

Steps to Reproduce:
It can be done doing a Active-Passing failover, or just by Attach+Activate+Import+Run very quickly.

Actual results:
* VM with lease fails to run after import

Expected results:
* VM with lease runs after import

Comment 1 Eyal Shenitzky 2021-06-27 15:22:41 UTC

*** This bug has been marked as a duplicate of bug 1968433 ***

Comment 5 Arik 2022-01-26 13:42:39 UTC
We think this may not be specific to DR but to every import-VM - that we don't wait enough time for the lease to be created and by waiting for the lease to be created in ImportVm we solve it in all cases

Comment 9 Michal Skrivanek 2022-05-18 14:30:48 UTC
we believe it's fixed by other lease-related changes in 4.5

Comment 11 Michal Skrivanek 2022-05-18 14:33:14 UTC
let's wait for first build then

Comment 19 sshmulev 2022-10-25 10:07:43 UTC
Verified.
VM was up both in failover and fallback in the same flow.

Versions:
RHV 4.5.3-3
ovirt-engine-4.5.3.1-2.el8ev.noarch
vdsm-4.50.3.4-1.el8ev.ppc64le

Comment 23 errata-xmlrpc 2022-11-16 12:17:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.3] bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8502


Note You need to log in before you can comment on or make changes to this bug.