1599732 – Failed to start VM with LibVirtError "Failed to acquire lock: No space left on device"

Bug 1599732 - Failed to start VM with LibVirtError "Failed to acquire lock: No space left on device"

Summary: Failed to start VM with LibVirtError "Failed to acquire lock: No space left o...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.5
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.3.0
Target Release:	4.3.0
Assignee:	Eyal Shenitzky
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-10 12:55 UTC by Yosi Ben Shimon
Modified:	2019-02-13 07:48 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ovirt-engine-4.3.0_alpha
Clone Of:
Environment:
Last Closed:	2019-02-13 07:48:00 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3+

Attachments	(Terms of Use)
logs (2.37 MB, application/zip) 2018-07-10 12:55 UTC, Yosi Ben Shimon	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	94324	0	'None'	MERGED	core: introduce VmLeasesReadyFilterPolicyUnit	2021-01-13 16:49:32 UTC

Description Yosi Ben Shimon 2018-07-10 12:55:13 UTC

Created attachment 1457815 [details]
logs

Description of problem:
Failed to start VM with LibVirtError "Failed to acquire lock: No space left on device".

2018-07-04 03:08:46,895+0300 ERROR (vm/c77ac961) [virt.vm] (vmId='c77ac961-190a-4466-9783-779ae53f93e0') The vm start process failed (vm:942)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 871, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2877, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: Failed to acquire lock: No space left on device

During the TestCase, the storage domain in being deactivated and activated again.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.5-0.1.el7ev.noarch

How reproducible:
Unknown

Steps to Reproduce:
1.
2.
3.

Actual results:
The VM failed to start on any host (3 hosts) with LibVirtError.

Expected results:
The VM should start successfully as there is enough available space

Additional info:

Comment 1 Yosi Ben Shimon 2018-07-10 12:59:56 UTC

Steps to reproduce (according to the TestCase):

1. Create an HA VM with lease reside on default storage domain
2. Move the storage domain to maintenance
3. Try to start VM (expected result -> fail)
4. Activate the storage domain where the lease resides on
5. Start VM

Comment 2 Eyal Shenitzky 2018-07-11 11:19:52 UTC

Hey Yossi,

This issue seems like a race condition between the refresh of all the hosts in the storage pool that is initiated asynchronously when activating the storage domain and the run of the VM.

It seems like the host that runs the VM is still doesn't refreshed after you activated the storage domain.

You can try to run the VM 2 seconds after the storage domain activation ends and it will work fine.

Comment 3 Yosi Ben Shimon 2018-07-15 11:23:23 UTC

(In reply to Eyal Shenitzky from comment #2)
> Hey Yossi,
> 
> This issue seems like a race condition between the refresh of all the hosts
> in the storage pool that is initiated asynchronously when activating the
> storage domain and the run of the VM.
> 
> It seems like the host that runs the VM is still doesn't refreshed after you
> activated the storage domain.
> 
> You can try to run the VM 2 seconds after the storage domain activation ends
> and it will work fine.

I think that if I got a situation where the hosts are up and the storage domain is active, I should be able to start the VM without taking into calculation the time that takes the host to be refreshed (even if it takes only a few seconds).

Comment 4 Eyal Shenitzky 2018-08-02 09:36:03 UTC

Martin,

In order to fix this bug there is a need to check whether the host that selected to run a VM with lease registered on the storage domain that holds the lease.

The flow in the VDSM for activating a storage domain is as follow:
- vdsm start the domain monitor
- domain monitor ask sanlock to acquire the host id (sanlock.add_lockspace)
- domain monitor check every 10 seconds if the host id was acquired
- before the host id is acquired, this domain will rpeort "acquired": false in repoStat
- once the host id was acquired, this domain will report "acquired": true in repoStat

VM with a lease will not be able to run before the domain will report "acquired": true in the repoStat.

Currently, the Engine ignores this report and the VM with the lease managed to start even though the host-id doesn't register yet on the storage domain that holds the lease, then the run failed on SANLock error.

Comment 5 Martin Sivák 2018-08-09 12:18:56 UTC

Sure, the engine should do that. I wonder if the host should be UP when and only when all the domains have acquired their locks.

Comment 6 Eyal Shenitzky 2018-08-13 07:02:24 UTC

(In reply to Martin Sivák from comment #5)
> Sure, the engine should do that. I wonder if the host should be UP when and
> only when all the domains have acquired their locks.

I think yes, this will also prevent later race-conditions that didn't discover yet.

So this bug should move to SLA?

Comment 7 Martin Sivák 2018-08-13 08:21:02 UTC

Well neither host status nor storage monitoring is SLA :) There is no more SLA anyway (joined with Virt) and I am not on the team anyway either. Can somebody from storage's engine team talk to mperina's infra folks about this please?

Comment 8 Martin Perina 2018-09-05 13:06:55 UTC

I'm not sure I completely understand this bug, but I think you want to prevent host being Up if not connected to storage where VM leases should go.

If so, this is not completely possible, host is moved to Up status automatically when engine is able to communicate with it and received response to getCaps. When host is moved Up, we are executing InitVdsOnUpCommand [1], which contains various flows (for example refreshing host devices, connecting host to storage, ...) and if any of those flows fail the host should its status to NonOperational. So if you want the host to prevent being Up, that's the place where you need to include you new code ...

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/InitVdsOnUpCommand.java#L141

Comment 9 Tal Nisan 2018-09-16 11:58:23 UTC

Yosi, note that after this fix you will not encounter the LibVirtError but you will get a validation error upon trying to run the VM if the host id is not yet acquired, you'll need to verify in your test case that the host id is acquired in order for the test to run without encountering a race condition

Comment 10 Yosi Ben Shimon 2018-10-02 08:23:06 UTC

(In reply to Tal Nisan from comment #9)
> Yosi, note that after this fix you will not encounter the LibVirtError but
> you will get a validation error upon trying to run the VM if the host id is
> not yet acquired, you'll need to verify in your test case that the host id
> is acquired in order for the test to run without encountering a race
> condition

OK. Noted.
Thanks

Comment 11 Yosi Ben Shimon 2018-11-05 12:25:45 UTC

Tested using:
ovirt-engine-4.3.0-0.0.master.20181016132820.gite60d148.el7.noarch

The test case succeeded and the validation error appears in the engine log:
2018-11-05 13:54:19,119+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirect
or] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm_0_TestCase25515_0513513119 is 
down with error. Exit message: Failed to acquire lock: No space left on device.


Moving to VERIFIED

Comment 12 Eyal Shenitzky 2018-11-05 12:43:47 UTC

(In reply to Yosi Ben Shimon from comment #11)
> Tested using:
> ovirt-engine-4.3.0-0.0.master.20181016132820.gite60d148.el7.noarch
> 
> The test case succeeded and the validation error appears in the engine log:
> 2018-11-05 13:54:19,119+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirect
> or] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_DOWN_ERROR(119), VM
> vm_0_TestCase25515_0513513119 is 
> down with error. Exit message: Failed to acquire lock: No space left on
> device.
> 
> 
> Moving to VERIFIED

This is not the expected behavior.

When running a VM with a lease on a storage domain that just activated you should get an error that says that the is no host with that can run this VM.

The error above means that the VM schedule to run on a host but failed to do so.

Comment 13 Yosi Ben Shimon 2018-11-05 13:23:38 UTC

Verified (by mistake) on a version that doesn't include the fix.
Moving back to MODIFIED for re-test.

Comment 14 Yosi Ben Shimon 2018-11-07 10:11:09 UTC

Tested again on:
ovirt-engine-4.3.0-0.0.master.20181101091940.git61310aa.el7.noarch
* This version includes the fix.
While the storage domain was deactivated, I tried to start the VM and got the next log in the engine:
2018-11-05 17:02:30,829+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-0) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_VM is down with error. Exit message: Failed to acquire lock: No space left on device.
2018-11-05 17:02:30,831+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-0) [] add VM 'bf2c0a15-cfd8-4e93-ac06-d0e817cfca11'(test_VM) to rerun treatment
2018-11-05 17:02:30,841+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-0) [] Rerun VM 'bf2c0a15-cfd8-4e93-ac06-d0e817cfca11'. Called from VDS 'host_mixed_1'
2018-11-05 17:02:30,868+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1640) [] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Failed to run VM test_VM on Host host_mixed_1.
2018-11-05 17:02:30,882+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] Lock Acquired to object 'EngineLock:{exclusiveLocks='[bf2c0a15-cfd8-4e93-ac06-d0e817cfca11=VM]', sharedLocks=''}'
2018-11-05 17:02:30,953+02 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] START, IsVmDuringInitiatingVDSCommand( IsVmDuringInitiatingVDSCommandParameters:{vmId='bf2c0a15-cfd8-4e93-ac06-d0e817cfca11'}), log id: 7c00c76a
2018-11-05 17:02:30,954+02 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 7c00c76a
2018-11-05 17:02:30,995+02 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (EE-ManagedThreadFactory-engine-Thread-1640) [] Candidate host 'host_mixed_3' ('ccfe1e53-b91f-486f-b20e-17c9633d0391') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VM leases ready' (correlation id: null)
2018-11-05 17:02:31,041+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] Running command: RunVmCommand internal: false. Entities affected :  ID: bf2c0a15-cfd8-4e93-ac06-d0e817cfca11 Type: VMAction group RUN_VM with role type USER
2018-11-05 17:02:31,076+02 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (EE-ManagedThreadFactory-engine-Thread-1640) [] Candidate host 'host_mixed_2' ('651249cc-fc95-4822-9bf4-9e40ea67f9cc') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VM leases ready' (correlation id: 04c66f4e-1faa-481d-a7ee-9fbf2d69e536)
2018-11-05 17:02:31,076+02 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (EE-ManagedThreadFactory-engine-Thread-1640) [] Candidate host 'host_mixed_3' ('ccfe1e53-b91f-486f-b20e-17c9633d0391') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VM leases ready' (correlation id: 04c66f4e-1faa-481d-a7ee-9fbf2d69e536)
2018-11-05 17:02:31,078+02 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1640) [] Can't find VDS to run the VM 'bf2c0a15-cfd8-4e93-ac06-d0e817cfca11' on, so this VM will not be run.
2018-11-05 17:02:31,091+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1640) [] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM test_VM  (User: admin@internal-authz).

This is the expected behavior.

Moving to VERiFIED.

Comment 15 Sandro Bonazzola 2019-02-13 07:48:00 UTC

This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.