Created attachment 1457815 [details] logs Description of problem: Failed to start VM with LibVirtError "Failed to acquire lock: No space left on device". 2018-07-04 03:08:46,895+0300 ERROR (vm/c77ac961) [virt.vm] (vmId='c77ac961-190a-4466-9783-779ae53f93e0') The vm start process failed (vm:942) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 871, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2877, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Failed to acquire lock: No space left on device During the TestCase, the storage domain in being deactivated and activated again. Version-Release number of selected component (if applicable): ovirt-engine-4.2.5-0.1.el7ev.noarch How reproducible: Unknown Steps to Reproduce: 1. 2. 3. Actual results: The VM failed to start on any host (3 hosts) with LibVirtError. Expected results: The VM should start successfully as there is enough available space Additional info:
Steps to reproduce (according to the TestCase): 1. Create an HA VM with lease reside on default storage domain 2. Move the storage domain to maintenance 3. Try to start VM (expected result -> fail) 4. Activate the storage domain where the lease resides on 5. Start VM
Hey Yossi, This issue seems like a race condition between the refresh of all the hosts in the storage pool that is initiated asynchronously when activating the storage domain and the run of the VM. It seems like the host that runs the VM is still doesn't refreshed after you activated the storage domain. You can try to run the VM 2 seconds after the storage domain activation ends and it will work fine.
(In reply to Eyal Shenitzky from comment #2) > Hey Yossi, > > This issue seems like a race condition between the refresh of all the hosts > in the storage pool that is initiated asynchronously when activating the > storage domain and the run of the VM. > > It seems like the host that runs the VM is still doesn't refreshed after you > activated the storage domain. > > You can try to run the VM 2 seconds after the storage domain activation ends > and it will work fine. I think that if I got a situation where the hosts are up and the storage domain is active, I should be able to start the VM without taking into calculation the time that takes the host to be refreshed (even if it takes only a few seconds).
Martin, In order to fix this bug there is a need to check whether the host that selected to run a VM with lease registered on the storage domain that holds the lease. The flow in the VDSM for activating a storage domain is as follow: - vdsm start the domain monitor - domain monitor ask sanlock to acquire the host id (sanlock.add_lockspace) - domain monitor check every 10 seconds if the host id was acquired - before the host id is acquired, this domain will rpeort "acquired": false in repoStat - once the host id was acquired, this domain will report "acquired": true in repoStat VM with a lease will not be able to run before the domain will report "acquired": true in the repoStat. Currently, the Engine ignores this report and the VM with the lease managed to start even though the host-id doesn't register yet on the storage domain that holds the lease, then the run failed on SANLock error.
Sure, the engine should do that. I wonder if the host should be UP when and only when all the domains have acquired their locks.
(In reply to Martin Sivák from comment #5) > Sure, the engine should do that. I wonder if the host should be UP when and > only when all the domains have acquired their locks. I think yes, this will also prevent later race-conditions that didn't discover yet. So this bug should move to SLA?
Well neither host status nor storage monitoring is SLA :) There is no more SLA anyway (joined with Virt) and I am not on the team anyway either. Can somebody from storage's engine team talk to mperina's infra folks about this please?
I'm not sure I completely understand this bug, but I think you want to prevent host being Up if not connected to storage where VM leases should go. If so, this is not completely possible, host is moved to Up status automatically when engine is able to communicate with it and received response to getCaps. When host is moved Up, we are executing InitVdsOnUpCommand [1], which contains various flows (for example refreshing host devices, connecting host to storage, ...) and if any of those flows fail the host should its status to NonOperational. So if you want the host to prevent being Up, that's the place where you need to include you new code ... [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/InitVdsOnUpCommand.java#L141
Yosi, note that after this fix you will not encounter the LibVirtError but you will get a validation error upon trying to run the VM if the host id is not yet acquired, you'll need to verify in your test case that the host id is acquired in order for the test to run without encountering a race condition
(In reply to Tal Nisan from comment #9) > Yosi, note that after this fix you will not encounter the LibVirtError but > you will get a validation error upon trying to run the VM if the host id is > not yet acquired, you'll need to verify in your test case that the host id > is acquired in order for the test to run without encountering a race > condition OK. Noted. Thanks
Tested using: ovirt-engine-4.3.0-0.0.master.20181016132820.gite60d148.el7.noarch The test case succeeded and the validation error appears in the engine log: 2018-11-05 13:54:19,119+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirect or] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm_0_TestCase25515_0513513119 is down with error. Exit message: Failed to acquire lock: No space left on device. Moving to VERIFIED
(In reply to Yosi Ben Shimon from comment #11) > Tested using: > ovirt-engine-4.3.0-0.0.master.20181016132820.gite60d148.el7.noarch > > The test case succeeded and the validation error appears in the engine log: > 2018-11-05 13:54:19,119+02 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirect > or] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_DOWN_ERROR(119), VM > vm_0_TestCase25515_0513513119 is > down with error. Exit message: Failed to acquire lock: No space left on > device. > > > Moving to VERIFIED This is not the expected behavior. When running a VM with a lease on a storage domain that just activated you should get an error that says that the is no host with that can run this VM. The error above means that the VM schedule to run on a host but failed to do so.
Verified (by mistake) on a version that doesn't include the fix. Moving back to MODIFIED for re-test.
Tested again on: ovirt-engine-4.3.0-0.0.master.20181101091940.git61310aa.el7.noarch * This version includes the fix. While the storage domain was deactivated, I tried to start the VM and got the next log in the engine: 2018-11-05 17:02:30,829+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-0) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_VM is down with error. Exit message: Failed to acquire lock: No space left on device. 2018-11-05 17:02:30,831+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-0) [] add VM 'bf2c0a15-cfd8-4e93-ac06-d0e817cfca11'(test_VM) to rerun treatment 2018-11-05 17:02:30,841+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-0) [] Rerun VM 'bf2c0a15-cfd8-4e93-ac06-d0e817cfca11'. Called from VDS 'host_mixed_1' 2018-11-05 17:02:30,868+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1640) [] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Failed to run VM test_VM on Host host_mixed_1. 2018-11-05 17:02:30,882+02 INFO [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] Lock Acquired to object 'EngineLock:{exclusiveLocks='[bf2c0a15-cfd8-4e93-ac06-d0e817cfca11=VM]', sharedLocks=''}' 2018-11-05 17:02:30,953+02 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] START, IsVmDuringInitiatingVDSCommand( IsVmDuringInitiatingVDSCommandParameters:{vmId='bf2c0a15-cfd8-4e93-ac06-d0e817cfca11'}), log id: 7c00c76a 2018-11-05 17:02:30,954+02 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 7c00c76a 2018-11-05 17:02:30,995+02 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (EE-ManagedThreadFactory-engine-Thread-1640) [] Candidate host 'host_mixed_3' ('ccfe1e53-b91f-486f-b20e-17c9633d0391') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VM leases ready' (correlation id: null) 2018-11-05 17:02:31,041+02 INFO [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] Running command: RunVmCommand internal: false. Entities affected : ID: bf2c0a15-cfd8-4e93-ac06-d0e817cfca11 Type: VMAction group RUN_VM with role type USER 2018-11-05 17:02:31,076+02 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (EE-ManagedThreadFactory-engine-Thread-1640) [] Candidate host 'host_mixed_2' ('651249cc-fc95-4822-9bf4-9e40ea67f9cc') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VM leases ready' (correlation id: 04c66f4e-1faa-481d-a7ee-9fbf2d69e536) 2018-11-05 17:02:31,076+02 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (EE-ManagedThreadFactory-engine-Thread-1640) [] Candidate host 'host_mixed_3' ('ccfe1e53-b91f-486f-b20e-17c9633d0391') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VM leases ready' (correlation id: 04c66f4e-1faa-481d-a7ee-9fbf2d69e536) 2018-11-05 17:02:31,078+02 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] Can't find VDS to run the VM 'bf2c0a15-cfd8-4e93-ac06-d0e817cfca11' on, so this VM will not be run. 2018-11-05 17:02:31,091+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1640) [] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM test_VM (User: admin@internal-authz). This is the expected behavior. Moving to VERiFIED.
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.