Description of problem: Failed attached “Unattached” Storage Domain to FCP Data Center Version-Release number of selected component (if applicable): RHEVM 3.3 - IS19 environment: Host OS: RHEL 6.5 RHEVM: rhevm-3.3.0-0.27.beta1.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.16-1.el6ev.noarch VDSM: vdsm-4.13.0-0.3.beta1.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-29.el6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 How reproducible: 100 % Steps to Reproduce: 1. Attached “Unattached” Storage Domain to FCP Data Center Actual results: Failed attached Expected results: Succeed attached Impact on user: Failed attached “Unattached” Storage Domain to FCP Data Center Workaround: None Additional info: [root@tigris01 ~]# service sanlock status sanlock (pid 21652 21650) is running... /var/log/ovirt-engine/engine.log 2013-10-21 15:38:49,122 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Failed in CreateStoragePoolVDS method 2013-10-21 15:38:49,123 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Error code AcquireHostIdFailure and error message VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) 2013-10-21 15:38:49,123 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Command org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=661, mMessage=Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))]] 2013-10-21 15:38:49,123 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] HostName = tigris01.scl.lab.tlv.redhat.com 2013-10-21 15:38:49,123 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Command CreateStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) 2013-10-21 15:38:49,123 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] FINISH, CreateStoragePoolVDSCommand, log id: 51eb0870 2013-10-21 15:38:49,124 ERROR [org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand] (pool-5-thread-48) [5694c659] Command org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) (Failed with error AcquireHostIdFailure and code 661) /var/log/vdsm/vdsm.log Thread-4660::ERROR::2013-10-21 13:38:49,324::task::850::TaskManager.Task::(_setError) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 981, in createStoragePool poolName, masterDom, domList, masterVersion, leaseParams) File "/usr/share/vdsm/storage/sp.py", line 615, in create self._acquireTemporaryClusterLock(msdUUID, leaseParams) File "/usr/share/vdsm/storage/sp.py", line 557, in _acquireTemporaryClusterLock msd.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) Thread-4660::DEBUG::2013-10-21 13:38:49,342::task::869::TaskManager.Task::(_run) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Task._run: 36a06b9c-b911-47f9-bc7f-ca964f62eb23 (None, 'f6b336b6-08a2-4c6f-a063-a16ad7571bd1', 'DC-aaa', 'c7812 c6a-5959-44ff-a990-df1713f2aef7', ['c7812c6a-5959-44ff-a990-df1713f2aef7'], 1, None, 5, 60, 10, 3) {} failed - stopping task Thread-4660::DEBUG::2013-10-21 13:38:49,342::task::1194::TaskManager.Task::(stop) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::stopping in state preparing (force False) Thread-4660::DEBUG::2013-10-21 13:38:49,343::task::974::TaskManager.Task::(_decref) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::ref 1 aborting True Thread-4660::INFO::2013-10-21 13:38:49,344::task::1151::TaskManager.Task::(prepare) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::aborting: Task is aborted: 'Cannot acquire host id' - code 661 Thread-4660::DEBUG::2013-10-21 13:38:49,346::task::1156::TaskManager.Task::(prepare) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Prepare: aborted: Cannot acquire host id Thread-4660::DEBUG::2013-10-21 13:38:49,348::task::974::TaskManager.Task::(_decref) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::ref 0 aborting True Thread-4660::DEBUG::2013-10-21 13:38:49,350::task::909::TaskManager.Task::(_doAbort) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Task._doAbort: force False Thread-4660::DEBUG::2013-10-21 13:38:49,353::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-4660::DEBUG::2013-10-21 13:38:49,356::task::579::TaskManager.Task::(_updateState) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::moving from state preparing -> state aborting Thread-4660::DEBUG::2013-10-21 13:38:49,359::task::534::TaskManager.Task::(__state_aborting) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::_aborting: recover policy none Thread-4660::DEBUG::2013-10-21 13:38:49,361::task::579::TaskManager.Task::(_updateState) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::moving from state aborting -> state failed Thread-4660::DEBUG::2013-10-21 13:38:49,363::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7': < ResourceRef 'Storage.c7812c6a-5959-44ff-a990 -df1713f2aef7', isValid: 'True' obj: 'None'>, 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1': < ResourceRef 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1', isValid: 'True' obj: 'None'>} Thread-4660::DEBUG::2013-10-21 13:38:49,366::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-4660::DEBUG::2013-10-21 13:38:49,370::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7' Thread-4660::DEBUG::2013-10-21 13:38:49,374::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7' (0 active users) Thread-4660::DEBUG::2013-10-21 13:38:49,376::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7' is free, finding out if anyone is waiting for it. Thread-4660::DEBUG::2013-10-21 13:38:49,378::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7', Clearing records. Thread-4660::DEBUG::2013-10-21 13:38:49,383::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1' Thread-4660::DEBUG::2013-10-21 13:38:49,387::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1' (0 active users) Thread-4660::DEBUG::2013-10-21 13:38:49,388::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1' is free, finding out if anyone is waiting for it. Thread-4660::DEBUG::2013-10-21 13:38:49,391::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1', Clearing records. Thread-4660::ERROR::2013-10-21 13:38:49,395::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))", 'code': 661}}
Created attachment 814617 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Vlad, from the logs it seems as though the failure occurs when attaching the FIRST storage domain to a DC without any domains. Is this true? Does this reproduce when adding the second/third/Nth domain to an existing DC?
(In reply to Allon Mureinik from comment #3) > Vlad, from the logs it seems as though the failure occurs when attaching the > FIRST storage domain to a DC without any domains. > Is this true? > Does this reproduce when adding the second/third/Nth domain to an existing > DC? Vlad, from the logs it seems as though the failure occurs when attaching the FIRST storage domain to a DC without any domains. Is this true? - Yes, it's true Does this reproduce when adding the second/third/Nth domain to an existing DC? - No, I succeed attach second “Unattached” Storage Domain to an existing DC.
Lowering this to medium as after speaking with vvyazmin he wasn't able to reproduce the issue. Anyway I don't want to close this yet since I saw a couple glitches in the logs that I want to investigate.
Not a TestBlocker according to comment 5.
This error means that the host id was already acquired in the past: AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) After investigating it and successfully reproduce the issue it seems that this is related to destroyStoragePool not stopping the relevant monitoring domain threads. From the attached logs we can see the preparation to destroy the storage pool: Thread-4199::INFO::2013-10-21 13:23:47,241::logUtils::44::dispatcher::(wrapper) Run and protect: deactivateStorageDomain(sdUUID='c7812c6a-5959-44ff-a990-df1713f2aef7', spUUID='445dea2d-511a-4529-a66a-670720e587de', msdUUID='00000000-0000-0000-0000-000000000000', masterVersion=1, options=None) ... Thread-4201::INFO::2013-10-21 13:23:47,275::logUtils::44::dispatcher::(wrapper) Run and protect: spmStop(spUUID='445dea2d-511a-4529-a66a-670720e587de', options=None) ... Thread-4204::INFO::2013-10-21 13:23:54,687::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', remove=False, options=None) and the destroy storage pool flow: Thread-4256::INFO::2013-10-21 13:26:01,937::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', msdUUID='c7812c6a-5959-44ff-a990-df1713f2aef7', masterVersion=1, options=None) ... Thread-4346::INFO::2013-10-21 13:26:07,967::logUtils::44::dispatcher::(wrapper) Run and protect: spmStart(spUUID='445dea2d-511a-4529-a66a-670720e587de', prevID=-1, prevLVER='0', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='3', options=None) ... f302e86a-7954-433a-adbd-328aed14f849::INFO::2013-10-21 13:26:28,282::clusterlock::225::SANLock::(acquire) Acquiring cluster lock for domain c7812c6a-5959-44ff-a990-df1713f2aef7 (id: 1) f302e86a-7954-433a-adbd-328aed14f849::DEBUG::2013-10-21 13:26:28,292::clusterlock::251::SANLock::(acquire) Cluster lock for domain c7812c6a-5959-44ff-a990-df1713f2aef7 successfully acquired (id: 1) ... Thread-4387::INFO::2013-10-21 13:26:31,515::logUtils::44::dispatcher::(wrapper) Run and protect: destroyStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', options=None) ... and as we see the host id (1) is never released. Later on when we try to use the same storage domain for a new pool we get the EINVAL error because the host id was left acquired (with a different id) from destroyStoragePool: Thread-4660::INFO::2013-10-21 13:38:47,355::logUtils::44::dispatcher::(wrapper) Run and protect: createStoragePool(poolType=None, spUUID='f6b336b6-08a2-4c6f-a063-a16ad7571bd1', poolName='DC-aaa', masterDom='c7812c6a-5959-44ff-a990-df1713f2aef7', domList=['c7812c6a-5959-44ff-a990-df1713f2aef7'], masterVersion=1, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None) ... Thread-4660::INFO::2013-10-21 13:38:49,323::clusterlock::174::SANLock::(acquireHostId) Acquiring host id for domain c7812c6a-5959-44ff-a990-df1713f2aef7 (id: 250) Thread-4660::ERROR::2013-10-21 13:38:49,324::task::850::TaskManager.Task::(_setError) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 981, in createStoragePool poolName, masterDom, domList, masterVersion, leaseParams) File "/usr/share/vdsm/storage/sp.py", line 615, in create self._acquireTemporaryClusterLock(msdUUID, leaseParams) File "/usr/share/vdsm/storage/sp.py", line 557, in _acquireTemporaryClusterLock msd.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) I've been able to identify that this issue was introduced as part of the monitoring implementation: 7b1cc6a Adding [start|stop]MonitoringDomain() in fact updateMonitoringThreads has been modified to assume that the list of the monitored domains is the same as the list of the attached domains: + poolDoms = self.getDomains() ... - for sdUUID in monitoredDomains: + for sdUUID in poolDoms: if sdUUID not in activeDomains: try: self.domainMonitor.stopMonitoring(sdUUID) This assumption is not valid when we are detaching a storage domain (as it won't be listed by getDomains). The resulting issue is that the monitor domain thread is not stopped and the host id is left acquired (ids file/device kept open by sanlock). The steps to reproduce are: 1. create a storage domain 2. attach the storage domain to an empty data center 3. once the spm is acquired, deactivate the storage domain 4. remove the data center 5. create a new data center 6. attach the storage domain to the newly created data center Actual results: Attach storage domain fails with SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')). Expected results: Attach storage domain should succeed.
verified using is24.1 using FC according to steps in comment #9 LIBVIRT Version: libvirt-0.10.2-29.el6 VDSM Version: vdsm-4.13.0-0.9.beta1.el6ev
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html