Bug 1021557 - Failed to create StoragePool to FCP Data Center
Summary: Failed to create StoragePool to FCP Data Center
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.3.0
Assignee: Federico Simoncelli
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks: 1021374 3.3snap3
TreeView+ depends on / blocked
 
Reported: 2013-10-21 13:57 UTC by vvyazmin@redhat.com
Modified: 2016-02-10 20:23 UTC (History)
8 users (show)

Fixed In Version: is24
Doc Type: Bug Fix
Doc Text:
The domain monitoring thread assumed that the list of monitored domains was the same as the list of attached domains, however this was not valid when any storage domain gets detached. In such cases, the thread was not stopped, and the host ID was still locked or in use. Consequently, users could not attach a previously-detached storage domain to a data center. This update adds a tag mechanism for pool-monitored domains, so now host IDs are correctly released, and unattached storage domains can be attached to data centers.
Clone Of:
Environment:
Last Closed: 2014-01-21 16:18:54 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
amureini: Triaged+


Attachments (Terms of Use)
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (10.17 MB, application/x-gzip)
2013-10-21 14:08 UTC, vvyazmin@redhat.com
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC
oVirt gerrit 21165 0 None None None Never

Description vvyazmin@redhat.com 2013-10-21 13:57:47 UTC
Description of problem:
Failed attached “Unattached” Storage Domain to FCP Data Center

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS19 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.27.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.16-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.3.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-29.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

How reproducible:
100 %

Steps to Reproduce:
1. Attached “Unattached” Storage Domain to FCP Data Center

Actual results:
Failed attached

Expected results:
Succeed attached

Impact on user:
Failed attached “Unattached” Storage Domain to FCP Data Center

Workaround:
None

Additional info:

[root@tigris01 ~]# service sanlock status
sanlock (pid 21652 21650) is running...

/var/log/ovirt-engine/engine.log

2013-10-21 15:38:49,122 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Failed in CreateStoragePoolVDS method
2013-10-21 15:38:49,123 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Error code AcquireHostIdFailure and error message VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
2013-10-21 15:38:49,123 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Command org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=661, mMessage=Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))]]
2013-10-21 15:38:49,123 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] HostName = tigris01.scl.lab.tlv.redhat.com
2013-10-21 15:38:49,123 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Command CreateStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
2013-10-21 15:38:49,123 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] FINISH, CreateStoragePoolVDSCommand, log id: 51eb0870
2013-10-21 15:38:49,124 ERROR [org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand] (pool-5-thread-48) [5694c659] Command org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) (Failed with error AcquireHostIdFailure and code 661)



/var/log/vdsm/vdsm.log

Thread-4660::ERROR::2013-10-21 13:38:49,324::task::850::TaskManager.Task::(_setError) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 981, in createStoragePool
    poolName, masterDom, domList, masterVersion, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 615, in create
    self._acquireTemporaryClusterLock(msdUUID, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 557, in _acquireTemporaryClusterLock
    msd.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
Thread-4660::DEBUG::2013-10-21 13:38:49,342::task::869::TaskManager.Task::(_run) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Task._run: 36a06b9c-b911-47f9-bc7f-ca964f62eb23 (None, 'f6b336b6-08a2-4c6f-a063-a16ad7571bd1', 'DC-aaa', 'c7812
c6a-5959-44ff-a990-df1713f2aef7', ['c7812c6a-5959-44ff-a990-df1713f2aef7'], 1, None, 5, 60, 10, 3) {} failed - stopping task
Thread-4660::DEBUG::2013-10-21 13:38:49,342::task::1194::TaskManager.Task::(stop) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::stopping in state preparing (force False)
Thread-4660::DEBUG::2013-10-21 13:38:49,343::task::974::TaskManager.Task::(_decref) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::ref 1 aborting True
Thread-4660::INFO::2013-10-21 13:38:49,344::task::1151::TaskManager.Task::(prepare) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::aborting: Task is aborted: 'Cannot acquire host id' - code 661
Thread-4660::DEBUG::2013-10-21 13:38:49,346::task::1156::TaskManager.Task::(prepare) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Prepare: aborted: Cannot acquire host id
Thread-4660::DEBUG::2013-10-21 13:38:49,348::task::974::TaskManager.Task::(_decref) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::ref 0 aborting True
Thread-4660::DEBUG::2013-10-21 13:38:49,350::task::909::TaskManager.Task::(_doAbort) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Task._doAbort: force False
Thread-4660::DEBUG::2013-10-21 13:38:49,353::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-4660::DEBUG::2013-10-21 13:38:49,356::task::579::TaskManager.Task::(_updateState) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::moving from state preparing -> state aborting
Thread-4660::DEBUG::2013-10-21 13:38:49,359::task::534::TaskManager.Task::(__state_aborting) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::_aborting: recover policy none
Thread-4660::DEBUG::2013-10-21 13:38:49,361::task::579::TaskManager.Task::(_updateState) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::moving from state aborting -> state failed
Thread-4660::DEBUG::2013-10-21 13:38:49,363::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7': < ResourceRef 'Storage.c7812c6a-5959-44ff-a990
-df1713f2aef7', isValid: 'True' obj: 'None'>, 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1': < ResourceRef 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1', isValid: 'True' obj: 'None'>}
Thread-4660::DEBUG::2013-10-21 13:38:49,366::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-4660::DEBUG::2013-10-21 13:38:49,370::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7'
Thread-4660::DEBUG::2013-10-21 13:38:49,374::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7' (0 active users)
Thread-4660::DEBUG::2013-10-21 13:38:49,376::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7' is free, finding out if anyone is waiting for it.
Thread-4660::DEBUG::2013-10-21 13:38:49,378::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7', Clearing records.
Thread-4660::DEBUG::2013-10-21 13:38:49,383::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1'
Thread-4660::DEBUG::2013-10-21 13:38:49,387::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1' (0 active users)
Thread-4660::DEBUG::2013-10-21 13:38:49,388::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1' is free, finding out if anyone is waiting for it.
Thread-4660::DEBUG::2013-10-21 13:38:49,391::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1', Clearing records.
Thread-4660::ERROR::2013-10-21 13:38:49,395::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure',
 'Invalid argument'))", 'code': 661}}

Comment 2 vvyazmin@redhat.com 2013-10-21 14:08:05 UTC
Created attachment 814617 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 3 Allon Mureinik 2013-10-22 08:29:18 UTC
Vlad, from the logs it seems as though the failure occurs when attaching the FIRST storage domain to a DC without any domains.
Is this true?
Does this reproduce when adding the second/third/Nth domain to an existing DC?

Comment 4 vvyazmin@redhat.com 2013-10-22 11:42:05 UTC
(In reply to Allon Mureinik from comment #3)
> Vlad, from the logs it seems as though the failure occurs when attaching the
> FIRST storage domain to a DC without any domains.
> Is this true?
> Does this reproduce when adding the second/third/Nth domain to an existing
> DC?

Vlad, from the logs it seems as though the failure occurs when attaching the FIRST storage domain to a DC without any domains.
Is this true? - Yes, it's true


Does this reproduce when adding the second/third/Nth domain to an existing DC? - No, I succeed attach second “Unattached” Storage Domain to an existing DC.

Comment 5 Federico Simoncelli 2013-10-29 16:35:25 UTC
Lowering this to medium as after speaking with vvyazmin he wasn't able to reproduce the issue. Anyway I don't want to close this yet since I saw a couple glitches in the logs that I want to investigate.

Comment 6 Federico Simoncelli 2013-10-29 16:36:14 UTC
Not a TestBlocker according to comment 5.

Comment 9 Federico Simoncelli 2013-11-08 10:54:28 UTC
This error means that the host id was already acquired in the past:

AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

After investigating it and successfully reproduce the issue it seems that this is related to destroyStoragePool not stopping the relevant monitoring domain threads.

From the attached logs we can see the preparation to destroy the storage pool:

Thread-4199::INFO::2013-10-21 13:23:47,241::logUtils::44::dispatcher::(wrapper) Run and protect: deactivateStorageDomain(sdUUID='c7812c6a-5959-44ff-a990-df1713f2aef7', spUUID='445dea2d-511a-4529-a66a-670720e587de', msdUUID='00000000-0000-0000-0000-000000000000', masterVersion=1, options=None)
...
Thread-4201::INFO::2013-10-21 13:23:47,275::logUtils::44::dispatcher::(wrapper) Run and protect: spmStop(spUUID='445dea2d-511a-4529-a66a-670720e587de', options=None)
...
Thread-4204::INFO::2013-10-21 13:23:54,687::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', remove=False, options=None)

and the destroy storage pool flow:

Thread-4256::INFO::2013-10-21 13:26:01,937::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', msdUUID='c7812c6a-5959-44ff-a990-df1713f2aef7', masterVersion=1, options=None)
...
Thread-4346::INFO::2013-10-21 13:26:07,967::logUtils::44::dispatcher::(wrapper) Run and protect: spmStart(spUUID='445dea2d-511a-4529-a66a-670720e587de', prevID=-1, prevLVER='0', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='3', options=None)
...
f302e86a-7954-433a-adbd-328aed14f849::INFO::2013-10-21 13:26:28,282::clusterlock::225::SANLock::(acquire) Acquiring cluster lock for domain c7812c6a-5959-44ff-a990-df1713f2aef7 (id: 1)
f302e86a-7954-433a-adbd-328aed14f849::DEBUG::2013-10-21 13:26:28,292::clusterlock::251::SANLock::(acquire) Cluster lock for domain c7812c6a-5959-44ff-a990-df1713f2aef7 successfully acquired (id: 1)
...
Thread-4387::INFO::2013-10-21 13:26:31,515::logUtils::44::dispatcher::(wrapper) Run and protect: destroyStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', options=None)
...

and as we see the host id (1) is never released.

Later on when we try to use the same storage domain for a new pool we get the EINVAL error because the host id was left acquired (with a different id) from destroyStoragePool:

Thread-4660::INFO::2013-10-21 13:38:47,355::logUtils::44::dispatcher::(wrapper) Run and protect: createStoragePool(poolType=None, spUUID='f6b336b6-08a2-4c6f-a063-a16ad7571bd1', poolName='DC-aaa', masterDom='c7812c6a-5959-44ff-a990-df1713f2aef7', domList=['c7812c6a-5959-44ff-a990-df1713f2aef7'], masterVersion=1, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
...
Thread-4660::INFO::2013-10-21 13:38:49,323::clusterlock::174::SANLock::(acquireHostId) Acquiring host id for domain c7812c6a-5959-44ff-a990-df1713f2aef7 (id: 250)
Thread-4660::ERROR::2013-10-21 13:38:49,324::task::850::TaskManager.Task::(_setError) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 981, in createStoragePool
    poolName, masterDom, domList, masterVersion, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 615, in create
    self._acquireTemporaryClusterLock(msdUUID, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 557, in _acquireTemporaryClusterLock
    msd.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))


I've been able to identify that this issue was introduced as part of the monitoring implementation:

 7b1cc6a Adding [start|stop]MonitoringDomain()

in fact updateMonitoringThreads has been modified to assume that the list of the monitored domains is the same as the list of the attached domains:

    +        poolDoms = self.getDomains()
    ...
    -        for sdUUID in monitoredDomains:
    +        for sdUUID in poolDoms:
                 if sdUUID not in activeDomains:
                     try:
                         self.domainMonitor.stopMonitoring(sdUUID)

This assumption is not valid when we are detaching a storage domain (as it won't be listed by getDomains). The resulting issue is that the monitor domain thread is not stopped and the host id is left acquired (ids file/device kept open by sanlock).

The steps to reproduce are:

1. create a storage domain
2. attach the storage domain to an empty data center
3. once the spm is acquired, deactivate the storage domain
4. remove the data center
5. create a new data center
6. attach the storage domain to the newly created data center

Actual results:
Attach storage domain fails with SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')).

Expected results:
Attach storage domain should succeed.

Comment 10 Aharon Canan 2013-11-24 13:10:32 UTC
verified using is24.1 using FC 

according to steps in comment #9


LIBVIRT Version:
libvirt-0.10.2-29.el6
VDSM Version:
vdsm-4.13.0-0.9.beta1.el6ev

Comment 11 errata-xmlrpc 2014-01-21 16:18:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.