Bug 994463

Summary: Failed attached Export Storage Domain - Could not obtain lock
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: ovirt-engine-webadmin-portalAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.3.0CC: abaron, acanan, acathrow, adahms, amureini, bazulay, derez, ebenahar, ecohen, fsimonce, iheim, lpeer, pnovotny, Rhev-m-bugs, scohen, srevivo, tymikk, wdaniel, yeylon
Target Milestone: ---Flags: scohen: Triaged+
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is14 Doc Type: Bug Fix
Doc Text:
Previously, the Import Domain operation executed from the Administration Portal would sometimes fail to attach the relevant Storage Domain. With this update, this issue has been resolved by eliminating a race condition in the execution of the steps required to complete the Import Domain operation.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 17:35:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1019461    
Attachments:
Description Flags
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
none
sanlock
none
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (RHEVM 3.3 IS9.) none

Description vvyazmin@redhat.com 2013-08-07 10:47:48 UTC
Created attachment 783800 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Description of problem:
Failed attached Export Storage Domain - Could not obtain lock

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS8 environment:

RHEVM:  rhevm-3.3.0-0.13.master.el6ev.noarch
VDSM:  vdsm-4.12.0-rc3.13.git06ed3cc.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.6-2.el6.x86_64
PythonSDK:  rhevm-sdk-python-3.3.0.8-1.el6ev.noarch

How reproducible:
unknown

Steps to Reproduce:
 Import Preconfigured “Export / NFS” Domain

Actual results:
Get an error:
Error while executing action Attached Storage Domain: Could not obtain lock

Failed attached Export Storage Domain

Expected results:
No exception should be found
Succeed attached Export Storage Domain

Impact on user:
Failed attached Export Storage Domain

Workaround:
none

Additional info:

/var/log/ovirt-engine/engine.log

2013-08-07 12:00:15,643 ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper] (pool-5-thread-47) The connection with details wolf.qa.lab.tlv.redhat.com:/volumes/wolf/kipi-export failed because of error code 477 and error message is: problem while trying to mount target
2013-08-07 12:00:15,646 ERROR [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-5-thread-47) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand.
2013-08-07 12:00:15,647 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-2) START, AttachStorageDomainVDSCommand( storagePoolId = 9cde139a-d5be-4555-b054-c2edf95f33fd, ignoreFailoverLimit = false, storageDomainId = c82ed380-b1d9-4133-a380-e49f7d87b44e), log id: 3c9c261a
2013-08-07 12:00:15,722 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-2) Failed in AttachStorageDomainVDS method
2013-08-07 12:00:15,750 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-2) Error code AcquireLockFailure and error message IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot obtain lock: "id=c82ed380-b1d9-4133-a380-e49f7d87b44e, rc=1, out=['error - lease file does not exist or is not writeable', 'usage: /usr/libexec/vdsm/spmprotect.sh COMMAND PARAMETERS', 'Commands:', '  start { sdUUID hostId renewal_interval_sec lease_path[:offset] lease_time_ms io_op_timeout_ms fail_retries }', 'Parameters:', '  sdUUID -                domain uuid', '  hostId -                host id in pool', '  renewal_interval_sec -  intervals for lease renewals attempts', '  lease_path -            path to lease file/volume', '  offset -                offset of lease within file', '  lease_time_ms -         time limit within which lease must be renewed (at least 2*renewal_interval_sec)', '  io_op_timeout_ms -      I/O operation timeout', '  fail_retries -          Maximal number of attempts to retry to renew the lease before fencing (<= lease_time_ms/renewal_interval_sec)'], err=[]"
2013-08-07 12:00:15,750 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-2) Command AttachStorageDomainVDS execution failed. Exception: IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot obtain lock: "id=c82ed380-b1d9-4133-a380-e49f7d87b44e, rc=1, out=['error - lease file does not exist or is not writeable', 'usage: /usr/libexec/vdsm/spmprotect.sh COMMAND PARAMETERS', 'Commands:', '  start { sdUUID hostId renewal_interval_sec lease_path[:offset] lease_time_ms io_op_timeout_ms fail_retries }', 'Parameters:', '  sdUUID -                domain uuid', '  hostId -                host id in pool', '  renewal_interval_sec -  intervals for lease renewals attempts', '  lease_path -            path to lease file/volume', '  offset -                offset of lease within file', '  lease_time_ms -         time limit within which lease must be renewed (at least 2*renewal_interval_sec)', '  io_op_timeout_ms -      I/O operation timeout', '  fail_retries -          Maximal number of attempts to retry to renew the lease before fencing (<= lease_time_ms/renewal_interval_sec)'], err=[]"
2013-08-07 12:00:15,750 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-2) FINISH, AttachStorageDomainVDSCommand, log id: 3c9c261a
2013-08-07 12:00:15,750 ERROR [org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand] (ajp-/127.0.0.1:8702-2) Command org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot obtain lock: "id=c82ed380-b1d9-4133-a380-e49f7d87b44e, rc=1, out=['error - lease file does not exist or is not writeable', 'usage: /usr/libexec/vdsm/spmprotect.sh COMMAND PARAMETERS', 'Commands:', '  start { sdUUID hostId renewal_interval_sec lease_path[:offset] lease_time_ms io_op_timeout_ms fail_retries }', 'Parameters:', '  sdUUID -                domain uuid', '  hostId -                host id in pool', '  renewal_interval_sec -  intervals for lease renewals attempts', '  lease_path -            path to lease file/volume', '  offset -                offset of lease within file', '  lease_time_ms -         time limit within which lease must be renewed (at least 2*renewal_interval_sec)', '  io_op_timeout_ms -      I/O operation timeout', '  fail_retries -          Maximal number of attempts to retry to renew the lease before fencing (<= lease_time_ms/renewal_interval_sec)'], err=[]" (Failed with VDSM error AcquireLockFailure and code 651)
2013-08-07 12:00:15,752 INFO  [org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand] (ajp-/127.0.0.1:8702-2) Command [id=9f2a76ed-d258-446f-bd1b-0ac1353e31a6]: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: storagePoolId = 9cde139a-d5be-4555-b054-c2edf95f33fd, storageId = c82ed380-b1d9-4133-a380-e49f7d87b44e.


/var/log/vdsm/vdsm.log

Thread-166883::ERROR::2013-08-07 10:00:17,385::task::850::TaskManager.Task::(_setError) Task=`847814bb-7181-49a2-a63d-5b0e73aa3740`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1149, in attachStorageDomain
    pool.attachSD(sdUUID)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 989, in attachSD
    dom.acquireClusterLock(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 487, in acquireClusterLock
    self._clusterLock.acquire(hostID)
  File "/usr/share/vdsm/storage/clusterlock.py", line 112, in acquire
    raise se.AcquireLockFailure(self._sdUUID, rc, out, err)
AcquireLockFailure: Cannot obtain lock: "id=c82ed380-b1d9-4133-a380-e49f7d87b44e, rc=1, out=['error - lease file does not exist or is not writeable', 'usage: /usr/libexec/vdsm/s
pmprotect.sh COMMAND PARAMETERS', 'Commands:', '  start { sdUUID hostId renewal_interval_sec lease_path[:offset] lease_time_ms io_op_timeout_ms fail_retries }', 'Parameters:', '
  sdUUID -                domain uuid', '  hostId -                host id in pool', '  renewal_interval_sec -  intervals for lease renewals attempts', '  lease_path -          
  path to lease file/volume', '  offset -                offset of lease within file', '  lease_time_ms -         time limit within which lease must be renewed (at least 2*renew
al_interval_sec)', '  io_op_timeout_ms -      I/O operation timeout', '  fail_retries -          Maximal number of attempts to retry to renew the lease before fencing (<= lease_
time_ms/renewal_interval_sec)'], err=[]"

Thread-166883::DEBUG::2013-08-07 10:00:17,392::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.9cde139a-d5be-4555-b054-c2edf95f33fd' is free, finding out if anyone is waiting for it.
Thread-166883::DEBUG::2013-08-07 10:00:17,392::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.9cde139a-d5be-4555-b054-c2edf95f33fd', Clearing records.
Thread-166883::ERROR::2013-08-07 10:00:17,393::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Cannot obtain lock: "id=c82ed380-b1d9-4133-a380-e49f7d87b44e, rc=1, out=[\'error - lease file does not exist or is not writeable\', \'usage: /usr/libexec/vdsm/spmprotect.sh COMMAND PARAMETERS\', \'Commands:\', \'  start { sdUUID hostId renewal_interval_sec lease_path[:offset] lease_time_ms io_op_timeout_ms fail_retries }\', \'Parameters:\', \'  sdUUID -                domain uuid\', \'  hostId -                host id in pool\', \'  renewal_interval_sec -  intervals for lease renewals attempts\', \'  lease_path -            path to lease file/volume\', \'  offset -                offset of lease within file\', \'  lease_time_ms -         time limit within which lease must be renewed (at least 2*renewal_interval_sec)\', \'  io_op_timeout_ms -      I/O operation timeout\', \'  fail_retries -          Maximal number of attempts to retry to renew the lease before fencing (<= lease_time_ms/renewal_interval_sec)\'], err=[]"', 'code': 651}}

Comment 3 vvyazmin@redhat.com 2013-08-08 13:56:41 UTC
Created attachment 784401 [details]
sanlock

Comment 6 vvyazmin@redhat.com 2013-08-14 09:53:40 UTC
Reproduced on  RHEVM 3.3 - IS9.1 environment:

RHEVM:  rhevm-3.3.0-0.14.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.8-1.el6ev.noarch
VDSM:  vdsm-4.12.0-52.gitce029ba.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.6-2.el6.x86_64


Logs attached

Comment 7 vvyazmin@redhat.com 2013-08-14 09:54:46 UTC
Created attachment 786481 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (RHEVM 3.3 IS9.)

Comment 8 Federico Simoncelli 2013-08-28 10:50:15 UTC
It looks like there's a race between a disconnectStorageServer and a connectStorageServer right before the failing attachStorageDomain:

Thread-166881::INFO::2013-08-07 10:00:17,192::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'wolf.qa.lab.tlv.redhat.com:/volumes/wolf/kipi-export', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '00000000-0000-0000-0000-000000000000', 'port': ''}], options=None)

...

Thread-166882::INFO::2013-08-07 10:00:17,244::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'wolf.qa.lab.tlv.redhat.com:/volumes/wolf/kipi-export', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '4bb35284-3206-4d47-8ee4-df66c215071a', 'port': ''}], options=None)
Thread-166882::WARNING::2013-08-07 10:00:17,249::fileUtils::167::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt/wolf.qa.lab.tlv.redhat.com:_volumes_wolf_kipi-export already exists

...

Thread-166882::ERROR::2013-08-07 10:00:17,300::storageServer::209::StorageServer.MountConnection::(connect) Mount failed: (1, ';mount.nfs: mount point /rhev/data-center/mnt/wolf.qa.lab.tlv.redhat.com:_volumes_wolf_kipi-export does not exist\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 222, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (1, ';mount.nfs: mount point /rhev/data-center/mnt/wolf.qa.lab.tlv.redhat.com:_volumes_wolf_kipi-export does not exist\n')

Thread-166882::INFO::2013-08-07 10:00:17,304::logUtils::47::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 477, 'id': '4bb35284-3206-4d47-8ee4-df66c215071a'}]}

...

Thread-166883::INFO::2013-08-07 10:00:17,326::logUtils::44::dispatcher::(wrapper) Run and protect: attachStorageDomain(sdUUID='c82ed380-b1d9-4133-a380-e49f7d87b44e', spUUID='9cde139a-d5be-4555-b054-c2edf95f33fd', options=None)

...

Thread-166883::ERROR::2013-08-07 10:00:17,385::task::850::TaskManager.Task::(_setError) Task=`847814bb-7181-49a2-a63d-5b0e73aa3740`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1149, in attachStorageDomain
    pool.attachSD(sdUUID)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 989, in attachSD
    dom.acquireClusterLock(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 487, in acquireClusterLock
    self._clusterLock.acquire(hostID)
  File "/usr/share/vdsm/storage/clusterlock.py", line 112, in acquire
    raise se.AcquireLockFailure(self._sdUUID, rc, out, err)
AcquireLockFailure: Cannot obtain lock: "id=c82ed380-b1d9-4133-a380-e49f7d87b44e ...

...

Thread-166881::INFO::2013-08-07 10:00:27,538::logUtils::47::dispatcher::(wrapper) Run and protect: disconnectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '00000000-0000-0000-0000-000000000000'}]}


Probably the bugs to address here are 2:

 1. remove the race between disconnectStorageServer and connectStorageServer
 2. take in account the connectStorageServer return code (which is failing): status 477 (MountError)

Point 2 is probably on the backend side and at first sight looks harder to address since it's probably related to connectAllHostsToPool (StorageDomainCommandBase.java) and it doesn't have a single return code (it's per vds).

The interesting thing is that the entire "Import Domain" flow is driven by the UI, and in fact the race can be found in importFileStorageAddDomain (StorageListModel.java):

    StoragePool dataCenter = (StoragePool) model.getDataCenter().getSelectedItem();
    if (!dataCenter.getId().equals(StorageModel.UnassignedDataCenterId)) {
        storageListModel.attachStorageToDataCenter(sdToAdd1.getId(), dataCenter.getId());
    }
    postImportFileStorage(storageListModel.context, true, storageListModel.storageModel, null);

The non-blocking call to attachStorageToDataCenter is in a race with postImportFileStorage:

    public void postImportFileStorage(TaskContext context, boolean isSucceeded, IStorageModel model, String message)
    {
        Frontend.RunAction(VdcActionType.DisconnectStorageServerConnection, ...
    ...


Moving this to ovirt-engine-webadmin-portal.

Comment 10 vvyazmin@redhat.com 2013-08-28 18:44:04 UTC
Same problem was found in attached ISO domain by ebenhar in RHEVM 3.3 - IS11

Comment 14 Elad 2013-09-01 08:32:13 UTC
Reopening:
Reproduced on RHEV3.3 - IS11:
rhevm-3.3.0-0.16.master.el6ev.noarch
vdsm-4.12.0-72.git287bb7e.el6ev.x86_64
sanlock-2.8-1.el6.x86_64
libvirt-0.10.2-18.el6_4.9.x86_64


Seems to happen when importing a pre-configured export/ISO domain which was previously removed by another host on cluster (not the one that participating in the import domain action)


Full logs will be attached

Comment 15 Elad 2013-09-01 08:36:29 UTC
(In reply to Elad from comment #14)
> Reopening:
> Reproduced on RHEV3.3 - IS11:
> rhevm-3.3.0-0.16.master.el6ev.noarch
> vdsm-4.12.0-72.git287bb7e.el6ev.x86_64
> sanlock-2.8-1.el6.x86_64
> libvirt-0.10.2-18.el6_4.9.x86_64
> 
> 
> Seems to happen when importing a pre-configured export/ISO domain which was
> previously removed by another host on cluster (not the one that
> participating in the import domain action)
> 
> 
> Full logs will be attached

Sorry, this was moved to ASSIGNED by accident, moving back to POST

Comment 17 vvyazmin@redhat.com 2013-10-01 02:23:23 UTC
Verified, tested on RHEVM 3.3 - IS16 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.22.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.14-1.el6ev.noarch
VDSM:  vdsm-4.12.0-156.git6e499d6.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-27.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.407.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

Comment 20 Charlie 2013-11-28 00:08:31 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 22 errata-xmlrpc 2014-01-21 17:35:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Comment 23 Jan Tymiński 2016-01-12 09:15:17 UTC
(In reply to vvyazmin from comment #10)
> Same problem was found in attached ISO domain by ebenhar in RHEVM 3.3 - IS11

I am affected by this bug in 3.6.1 of oVirt
I detached iso domain, got problem #1210808 and after solving it manually I point to the exactly same path as was attached before detaching and I get the message as in this bug report.