Bug 1021557

Summary: Failed to create StoragePool to FCP Data Center
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Aharon Canan <acanan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, acanan, amureini, bazulay, iheim, lpeer, scohen, yeylon
Target Milestone: ---Keywords: Regression
Target Release: 3.3.0Flags: amureini: Triaged+
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is24 Doc Type: Bug Fix
Doc Text:
The domain monitoring thread assumed that the list of monitored domains was the same as the list of attached domains, however this was not valid when any storage domain gets detached. In such cases, the thread was not stopped, and the host ID was still locked or in use. Consequently, users could not attach a previously-detached storage domain to a data center. This update adds a tag mechanism for pool-monitored domains, so now host IDs are correctly released, and unattached storage domains can be attached to data centers.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 16:18:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1021374, 1038284    
Attachments:
Description Flags
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm none

Description vvyazmin@redhat.com 2013-10-21 13:57:47 UTC
Description of problem:
Failed attached “Unattached” Storage Domain to FCP Data Center

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS19 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.27.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.16-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.3.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-29.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

How reproducible:
100 %

Steps to Reproduce:
1. Attached “Unattached” Storage Domain to FCP Data Center

Actual results:
Failed attached

Expected results:
Succeed attached

Impact on user:
Failed attached “Unattached” Storage Domain to FCP Data Center

Workaround:
None

Additional info:

[root@tigris01 ~]# service sanlock status
sanlock (pid 21652 21650) is running...

/var/log/ovirt-engine/engine.log

2013-10-21 15:38:49,122 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Failed in CreateStoragePoolVDS method
2013-10-21 15:38:49,123 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Error code AcquireHostIdFailure and error message VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
2013-10-21 15:38:49,123 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Command org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=661, mMessage=Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))]]
2013-10-21 15:38:49,123 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] HostName = tigris01.scl.lab.tlv.redhat.com
2013-10-21 15:38:49,123 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] Command CreateStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
2013-10-21 15:38:49,123 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (pool-5-thread-48) [5694c659] FINISH, CreateStoragePoolVDSCommand, log id: 51eb0870
2013-10-21 15:38:49,124 ERROR [org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand] (pool-5-thread-48) [5694c659] Command org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) (Failed with error AcquireHostIdFailure and code 661)



/var/log/vdsm/vdsm.log

Thread-4660::ERROR::2013-10-21 13:38:49,324::task::850::TaskManager.Task::(_setError) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 981, in createStoragePool
    poolName, masterDom, domList, masterVersion, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 615, in create
    self._acquireTemporaryClusterLock(msdUUID, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 557, in _acquireTemporaryClusterLock
    msd.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
Thread-4660::DEBUG::2013-10-21 13:38:49,342::task::869::TaskManager.Task::(_run) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Task._run: 36a06b9c-b911-47f9-bc7f-ca964f62eb23 (None, 'f6b336b6-08a2-4c6f-a063-a16ad7571bd1', 'DC-aaa', 'c7812
c6a-5959-44ff-a990-df1713f2aef7', ['c7812c6a-5959-44ff-a990-df1713f2aef7'], 1, None, 5, 60, 10, 3) {} failed - stopping task
Thread-4660::DEBUG::2013-10-21 13:38:49,342::task::1194::TaskManager.Task::(stop) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::stopping in state preparing (force False)
Thread-4660::DEBUG::2013-10-21 13:38:49,343::task::974::TaskManager.Task::(_decref) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::ref 1 aborting True
Thread-4660::INFO::2013-10-21 13:38:49,344::task::1151::TaskManager.Task::(prepare) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::aborting: Task is aborted: 'Cannot acquire host id' - code 661
Thread-4660::DEBUG::2013-10-21 13:38:49,346::task::1156::TaskManager.Task::(prepare) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Prepare: aborted: Cannot acquire host id
Thread-4660::DEBUG::2013-10-21 13:38:49,348::task::974::TaskManager.Task::(_decref) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::ref 0 aborting True
Thread-4660::DEBUG::2013-10-21 13:38:49,350::task::909::TaskManager.Task::(_doAbort) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Task._doAbort: force False
Thread-4660::DEBUG::2013-10-21 13:38:49,353::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-4660::DEBUG::2013-10-21 13:38:49,356::task::579::TaskManager.Task::(_updateState) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::moving from state preparing -> state aborting
Thread-4660::DEBUG::2013-10-21 13:38:49,359::task::534::TaskManager.Task::(__state_aborting) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::_aborting: recover policy none
Thread-4660::DEBUG::2013-10-21 13:38:49,361::task::579::TaskManager.Task::(_updateState) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::moving from state aborting -> state failed
Thread-4660::DEBUG::2013-10-21 13:38:49,363::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7': < ResourceRef 'Storage.c7812c6a-5959-44ff-a990
-df1713f2aef7', isValid: 'True' obj: 'None'>, 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1': < ResourceRef 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1', isValid: 'True' obj: 'None'>}
Thread-4660::DEBUG::2013-10-21 13:38:49,366::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-4660::DEBUG::2013-10-21 13:38:49,370::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7'
Thread-4660::DEBUG::2013-10-21 13:38:49,374::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7' (0 active users)
Thread-4660::DEBUG::2013-10-21 13:38:49,376::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7' is free, finding out if anyone is waiting for it.
Thread-4660::DEBUG::2013-10-21 13:38:49,378::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.c7812c6a-5959-44ff-a990-df1713f2aef7', Clearing records.
Thread-4660::DEBUG::2013-10-21 13:38:49,383::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1'
Thread-4660::DEBUG::2013-10-21 13:38:49,387::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1' (0 active users)
Thread-4660::DEBUG::2013-10-21 13:38:49,388::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1' is free, finding out if anyone is waiting for it.
Thread-4660::DEBUG::2013-10-21 13:38:49,391::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.f6b336b6-08a2-4c6f-a063-a16ad7571bd1', Clearing records.
Thread-4660::ERROR::2013-10-21 13:38:49,395::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure',
 'Invalid argument'))", 'code': 661}}

Comment 2 vvyazmin@redhat.com 2013-10-21 14:08:05 UTC
Created attachment 814617 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 3 Allon Mureinik 2013-10-22 08:29:18 UTC
Vlad, from the logs it seems as though the failure occurs when attaching the FIRST storage domain to a DC without any domains.
Is this true?
Does this reproduce when adding the second/third/Nth domain to an existing DC?

Comment 4 vvyazmin@redhat.com 2013-10-22 11:42:05 UTC
(In reply to Allon Mureinik from comment #3)
> Vlad, from the logs it seems as though the failure occurs when attaching the
> FIRST storage domain to a DC without any domains.
> Is this true?
> Does this reproduce when adding the second/third/Nth domain to an existing
> DC?

Vlad, from the logs it seems as though the failure occurs when attaching the FIRST storage domain to a DC without any domains.
Is this true? - Yes, it's true


Does this reproduce when adding the second/third/Nth domain to an existing DC? - No, I succeed attach second “Unattached” Storage Domain to an existing DC.

Comment 5 Federico Simoncelli 2013-10-29 16:35:25 UTC
Lowering this to medium as after speaking with vvyazmin he wasn't able to reproduce the issue. Anyway I don't want to close this yet since I saw a couple glitches in the logs that I want to investigate.

Comment 6 Federico Simoncelli 2013-10-29 16:36:14 UTC
Not a TestBlocker according to comment 5.

Comment 9 Federico Simoncelli 2013-11-08 10:54:28 UTC
This error means that the host id was already acquired in the past:

AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

After investigating it and successfully reproduce the issue it seems that this is related to destroyStoragePool not stopping the relevant monitoring domain threads.

From the attached logs we can see the preparation to destroy the storage pool:

Thread-4199::INFO::2013-10-21 13:23:47,241::logUtils::44::dispatcher::(wrapper) Run and protect: deactivateStorageDomain(sdUUID='c7812c6a-5959-44ff-a990-df1713f2aef7', spUUID='445dea2d-511a-4529-a66a-670720e587de', msdUUID='00000000-0000-0000-0000-000000000000', masterVersion=1, options=None)
...
Thread-4201::INFO::2013-10-21 13:23:47,275::logUtils::44::dispatcher::(wrapper) Run and protect: spmStop(spUUID='445dea2d-511a-4529-a66a-670720e587de', options=None)
...
Thread-4204::INFO::2013-10-21 13:23:54,687::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', remove=False, options=None)

and the destroy storage pool flow:

Thread-4256::INFO::2013-10-21 13:26:01,937::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', msdUUID='c7812c6a-5959-44ff-a990-df1713f2aef7', masterVersion=1, options=None)
...
Thread-4346::INFO::2013-10-21 13:26:07,967::logUtils::44::dispatcher::(wrapper) Run and protect: spmStart(spUUID='445dea2d-511a-4529-a66a-670720e587de', prevID=-1, prevLVER='0', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='3', options=None)
...
f302e86a-7954-433a-adbd-328aed14f849::INFO::2013-10-21 13:26:28,282::clusterlock::225::SANLock::(acquire) Acquiring cluster lock for domain c7812c6a-5959-44ff-a990-df1713f2aef7 (id: 1)
f302e86a-7954-433a-adbd-328aed14f849::DEBUG::2013-10-21 13:26:28,292::clusterlock::251::SANLock::(acquire) Cluster lock for domain c7812c6a-5959-44ff-a990-df1713f2aef7 successfully acquired (id: 1)
...
Thread-4387::INFO::2013-10-21 13:26:31,515::logUtils::44::dispatcher::(wrapper) Run and protect: destroyStoragePool(spUUID='445dea2d-511a-4529-a66a-670720e587de', hostID=1, scsiKey='445dea2d-511a-4529-a66a-670720e587de', options=None)
...

and as we see the host id (1) is never released.

Later on when we try to use the same storage domain for a new pool we get the EINVAL error because the host id was left acquired (with a different id) from destroyStoragePool:

Thread-4660::INFO::2013-10-21 13:38:47,355::logUtils::44::dispatcher::(wrapper) Run and protect: createStoragePool(poolType=None, spUUID='f6b336b6-08a2-4c6f-a063-a16ad7571bd1', poolName='DC-aaa', masterDom='c7812c6a-5959-44ff-a990-df1713f2aef7', domList=['c7812c6a-5959-44ff-a990-df1713f2aef7'], masterVersion=1, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
...
Thread-4660::INFO::2013-10-21 13:38:49,323::clusterlock::174::SANLock::(acquireHostId) Acquiring host id for domain c7812c6a-5959-44ff-a990-df1713f2aef7 (id: 250)
Thread-4660::ERROR::2013-10-21 13:38:49,324::task::850::TaskManager.Task::(_setError) Task=`36a06b9c-b911-47f9-bc7f-ca964f62eb23`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 981, in createStoragePool
    poolName, masterDom, domList, masterVersion, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 615, in create
    self._acquireTemporaryClusterLock(msdUUID, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 557, in _acquireTemporaryClusterLock
    msd.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('c7812c6a-5959-44ff-a990-df1713f2aef7', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))


I've been able to identify that this issue was introduced as part of the monitoring implementation:

 7b1cc6a Adding [start|stop]MonitoringDomain()

in fact updateMonitoringThreads has been modified to assume that the list of the monitored domains is the same as the list of the attached domains:

    +        poolDoms = self.getDomains()
    ...
    -        for sdUUID in monitoredDomains:
    +        for sdUUID in poolDoms:
                 if sdUUID not in activeDomains:
                     try:
                         self.domainMonitor.stopMonitoring(sdUUID)

This assumption is not valid when we are detaching a storage domain (as it won't be listed by getDomains). The resulting issue is that the monitor domain thread is not stopped and the host id is left acquired (ids file/device kept open by sanlock).

The steps to reproduce are:

1. create a storage domain
2. attach the storage domain to an empty data center
3. once the spm is acquired, deactivate the storage domain
4. remove the data center
5. create a new data center
6. attach the storage domain to the newly created data center

Actual results:
Attach storage domain fails with SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')).

Expected results:
Attach storage domain should succeed.

Comment 10 Aharon Canan 2013-11-24 13:10:32 UTC
verified using is24.1 using FC 

according to steps in comment #9


LIBVIRT Version:
libvirt-0.10.2-29.el6
VDSM Version:
vdsm-4.13.0-0.9.beta1.el6ev

Comment 11 errata-xmlrpc 2014-01-21 16:18:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html