Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1595303

Summary: HE VM migration fails with libvirtError: resource busy: Failed to acquire lock: Lease is held by another host
Product: [oVirt] ovirt-engine Reporter: Polina <pagranat>
Component: BLL.HostedEngineAssignee: Doron Fediuck <dfediuck>
Status: CLOSED WORKSFORME QA Contact: Polina <pagranat>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.2CC: ahadas, bugs, michal.skrivanek, msivak, pagranat
Target Milestone: ---Keywords: Automation
Target Release: ---Flags: pm-rhel: ovirt-4.5?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-01 12:00:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
logs 4.2.8-7
none
logs 4.2.8 none

Description Polina 2018-06-26 14:49:33 UTC
Created attachment 1454685 [details]
logs

Description of problem: HE migration sometimes fails with libvirtError: resource busy: Failed to acquire lock: Lease is held by another host.

Version-Release number of selected component (if applicable): rhv-release-4.2.4-6-001.noarch

How reproducible: sometimes happens. not easily reproduced.

Steps to Reproduce:
1. Run HE environment with three hosts and the hosted storage is on iscsi
2. Sometimes he vm migration fails with the trace (please see lynx16_vdsm.log):

2018-06-23 15:52:15,547+0300 ERROR (vm/96b4f434) [virt.vm] (vmId='96b4f434-de9e-4be6-b842-adae55933dc2') The vm start process failed (vm:943)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2876, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: resource busy: Failed to acquire lock: Lease is held by another host

Expected results: migration succeeds

Additional info: the attached contains: agent.log  broker.log  engine.log  logs  lynx14_vdsm.log  lynx16_vdsm.log  lynx17_vdsm.log. 
The Migration attempt from lynx16 to lynx17.

Comment 1 Michal Skrivanek 2018-06-27 10:53:22 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1459829#c31 ?

Comment 2 Martin Sivák 2018-06-27 11:34:14 UTC
Hi Paulina, can you please attach some additional information?

- what kind of storage domain did you use for hosted engine (NFS3, NFS4, iSCSI..)
- how did sanlock look like just before the migration (sanlock client status from both source and destination)

Meital already answered our question about how was the migration started - by clicking the Migrate button in the UI. Is that correct?

Comment 3 Martin Sivák 2018-06-27 11:35:19 UTC
(In reply to Michal Skrivanek from comment #1)
> https://bugzilla.redhat.com/show_bug.cgi?id=1459829#c31 ?

Maybe, but I do not think so.

The error here is "libvirtError: resource busy: Failed to acquire lock: Lease is held by another host" and that seems to imply the sanlock on the other host knew about the lockspace.

Comment 4 Polina 2018-07-05 06:43:39 UTC
(In reply to Martin Sivák from comment #2)

Hi, Hosted Engine disk is on iSCSI for this env.

About "how the migration started" - there was no clicking on UI button. The failures happened by automation build running.

The tests send a rest action:

2018-06-23 15:52:16,246 - MainThread - art.ll_lib.vms - INFO - Migrate VM HostedEngine
2018-06-23 15:52:16,246 - MainThread - vms - DEBUG - Action request content is --  url:/ovirt-engine/api/vms/96b4f434-de9e-4be6-b842-adae55933dc2/migrate body:<action>
    <async>false</async>
    <force>true</force>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <host id="074db613-5fb8-4722-8801-130797dc18b1"/>
</action>

sanlock client status is not reported to logs.

Comment 5 Ryan Barry 2019-04-08 17:36:55 UTC
Polina, still reproducible?

Comment 6 Polina 2019-04-10 05:37:31 UTC
yes, In the last automation runs we saw this problem twice on 4.3.3.2. 
Since engine was down after this for a long time and the whole environment was not able to run the tests, we had to reprovision and rebuild everything. didn't save the logs. the next time I see it , will update with the logs.

Comment 7 Polina 2019-04-23 08:39:28 UTC
Created attachment 1557486 [details]
logs 4.2.8-7

the migration failure is reproduced in the last ovirt-engine-4.2.8.7-0.1.el7ev.noarch run.

the time for the failed migration in the attached logs is 2019-04-18 20:12:56

Comment 8 Polina 2019-04-23 08:45:54 UTC
the engine.log also contains a lot of errors, like "createCommand failed: WFLYWELD0039: Singleton not set for null. This means that you are trying to access a weld deployment with a Thread Context ClassLoader that is not associated with the deployment."
though it seems to be not related to the problem (https://bugzilla.redhat.com/show_bug.cgi?id=1701898)

Comment 9 Polina 2019-05-21 08:51:43 UTC
Created attachment 1571545 [details]
logs 4.2.8

happened again in the last 4.2 run (ovirt-engine-4.2.8.7-0.1.el7ev.noarch). logs logs_4.2.8.tar.gz attached

Comment 10 Michal Skrivanek 2020-03-17 12:17:10 UTC
deprecating SLA team usage, moving to Virt

Comment 11 Polina 2021-06-01 06:54:25 UTC
Hi Arik, this failure was never seen in 4.4.6

Comment 12 Arik 2021-06-01 12:00:39 UTC
This issue didn't reproduce lately (comment 11) - it could be that it was fixed by other changes. Anyway, it doesn't make sense to put an effort into investigating it at this point. If it happens again, feel free to reopen.