Bug 806774 - storage domain uuid can't be reused after storage pool destroyed
Summary: storage domain uuid can't be reused after storage pool destroyed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: 3.2
Assignee: Royce Lv
QA Contact:
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-26 08:11 UTC by Royce Lv
Modified: 2016-02-10 16:45 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-15 06:46:19 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
vdsm.log (150.03 KB, text/plain)
2012-03-26 09:00 UTC, Royce Lv
no flags Details

Description Royce Lv 2012-03-26 08:11:58 UTC
Description of problem:
storage domain uuid can't be reused after storage pool destroyed

Version-Release number of selected component (if applicable):
vdsm-4.9.4-0.18

How reproducible:
often

Steps to Reproduce:
(1)create storagedomain sd1 with uuid '1be32ac7-1e12-4823-8e8c-8c887333fe37'
(2)create storage pool with it's master domain sd1
(3)connect storage pool
(4)spmstart
(3)destroy storage pool
(4)format storage domain
(5)create storage domain with uuid '1be32ac7-1e12-4823-8e8c-8c887333fe37'
  
Actual results:
 raise se.StorageDomainAlreadyExists

Expected results:
 storage domain successfully created


Additional info:log:
CP Server Thread-23::ERROR::2012-03-26 15:45:38,961::task::855::TaskManager.Task::(_setError) Task=`b56f239f-72c6-4d83-b80f-7b00c8946221`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 863, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1912, in createStorageDomain
    self.validateNonDomain(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 167, in validateNonDomain
    raise se.StorageDomainAlreadyExists(sdUUID)
StorageDomainAlreadyExists: Storage domain already exists: ('1be32ac7-1e12-4823-8e8c-8c887333fe37',)

Comment 1 Dan Kenigsberg 2012-03-26 08:21:48 UTC
could you attach a complete vdsm.log (including the part showing the removal of the storage domain (4)). I'd appreciate if you also try to reproduce with something closer to master/HEAD.

Comment 2 Royce Lv 2012-03-26 09:00:39 UTC
Created attachment 572709 [details]
vdsm.log

Comment 3 Royce Lv 2012-04-01 13:41:52 UTC
storage pool __init__ func introduce a circular reference of itself:
def __init__(self, spUUID, taskManager):
        self._domainsToUpgrade = []
        self.lock = threading.RLock()
        self._setUnsafe()
        self.spUUID = str(spUUID)
        self.poolPath = os.path.join(self.storage_repository, self.spUUID)
        self.id = None
        self.scsiKey = None
        self.taskMng = taskManager
        self._poolFile = os.path.join(self._poolsTmpDir, self.spUUID)
        self.hsmMailer = None
        self.spmMailer = None
        self.masterDomain = None
        self.spmRole = SPM_FREE
        self.domainMonitor = DomainMonitor(self.monitorInterval)
        self._upgradeCallback = partial(StoragePool._upgradePoolDomain, self)-->circular reference here

So that Storage pool object can't been deleted after destroy,so it's master domain remains referenced,and can't be deleted,either.

Comment 4 Royce Lv 2012-04-02 07:24:52 UTC
and when addjob spmstart,
In task.py,class Job:
      def __init__(self, name, cmd, *argslist, **argsdict):
        self.name = name
        self.cmd = cmd-->reference pool.startSpm
For some reason Job is not deleted after it has been run,Job not deleted so pool not deleted, so pool
s master domain remains undeleted. provide a workaround and still investigating why the Job is not deleted:
http://gerrit.ovirt.org/#change,3242

Comment 5 Royce Lv 2012-04-10 07:44:35 UTC
Actually there are two bugs corresponding to this:
1.storage pool circular reference to itself
2.Task is not released after task clean

http://gerrit.ovirt.org/#change,3242 correct the first one.

task not released so task.jobs not released.job reference pool.startSpm so pool not released, it's a chain action.

Comment 6 Zhou Zheng Sheng 2012-09-20 06:02:35 UTC
I have the same problem. I do not create a storage pool, just create and format the storage domain, then create a storage domain with the same UUID, it will complain about "Storage domain already exists".


vdsm version:
vdsm-4.10.0


How reproducible:
Always


Steps to Reproduce:
# Firstly get a place to save local storage data
mkdir /teststorage
chown -R vdsm:kvm /teststorage

# Then create a domain and destroy it
dsClient -s localhost connectStorageServer 4 '' 'connection=/teststorage,id=1'
dsClient -s localhost createStorageDomain 4 1de3f4e2-a9f9-11e1-9956-00247edb4743 'test storage' /teststorage 1 0
vdsClient -s localhost formatStorageDomain 1de3f4e2-a9f9-11e1-9956-00247edb4743
# Now you will see the data in /teststorage are deleted after we formatting the storage domain

# Create a storage domain using the same uuid
vdsClient -s localhost createStorageDomain 4 1de3f4e2-a9f9-11e1-9956-00247edb4743 'test storage' /teststorage 1 0
# it will report an error saying "Storage domain already exists: ('1de3f4e2-a9f9-11e1-9956-00247edb4743',)"
# But if we get domains list, there is nothing in it
vdsClient -s localhost getStorageDomainsList


Actual results:
 raise se.StorageDomainAlreadyExists


Expected results:
 storage domain successfully created


My investigation:
When we call formatStorageDomain() in vdsm/storage/hsm.py, it calls _recycle(sd) at last. In _resycle(), it will call sdCache.manuallyRemoveDomain() in vdsm/storage/sdc.py. , then in StorageDomainCache.manuallyRemoveDomain, it del the domain from self.__domainCache.

However, when we want to create a storage domain reusing a previous UUID, vdsm firstly checks if this UUID is already in StorageDomainCache by calling StorageDomainCache.produce(). In this method, it calls _getDomainFromCache, which gets storage domain from self.__proxyCache. In previous format action, it just del domain from self.__domainCache but not from self.__proxyCache, so the UUID is still in self.__proxyCache. Then vdsm thinks the domain already exists and report an error.

So the root cause is inconsistency of __domainCache and __proxyCache.

I'm doing some storage functional tests. When I try to run the test twice, the second time will fail because of this bug.

Comment 7 Ayal Baron 2012-09-23 07:42:42 UTC
Fede, is this issue taken care of with one of your latest patches?

Comment 8 Zhou Zheng Sheng 2012-09-24 03:47:15 UTC
Thanks Ayal Baron. I test Federico's patch from http://gerrit.ovirt.org/#/c/7511/4, it does not fix this bug. Then I investigate a little further.

The cause is in HSM._recycle(self, dom), the code firstly try to delete the storage domain from cache by calling "sdCache.manuallyRemoveDomain(dom.sdUUID)". This is fine. Then then the code calls "dom.format(dom.sdUUID)". As we know, "dom" is a DomainProxy objects defined in "vdsm/storage/sdc.py". This proxy will forward method calls to real domain, but before forwarding the calls, it must invoke StorageDomainCache._realProduce() firstly to get the reference to the real domain object, and __realProduce() will add the domain it finds to the cache.

So we remove the domain from cache but after formatting it, it is added to the cache again.

Then I just tried to reverse the order of these actions as follow.
def _recycle(self, dom):
    dom.format(dom.sdUUID)

    try:
        sdCache.manuallyRemoveDomain(dom.sdUUID)
    except KeyError:
        self.log.warn("Storage domain %s doesn't exist in cache. Trying recycle leftovers ...", dom.sdUUID)


Then the big is fixed. However I am not sure if there will cause other problems if we reverse these action orders.

Comment 9 Zhou Zheng Sheng 2012-09-24 04:55:11 UTC
Try to fix this. Thanks in advance for anyone who can give some advices.

http://gerrit.ovirt.org/#/c/8144/

Comment 10 Zhou Zheng Sheng 2012-11-21 08:19:37 UTC
The patch is accepted, the bug is fixed.

Comment 11 Itamar Heim 2012-11-21 21:08:29 UTC
if patch is merged, bug should be in MODIFIED. changing bug status

Comment 12 Itamar Heim 2013-01-16 16:08:29 UTC
3.2 beta built, moving to ON_QA status to allow testing


Note You need to log in before you can comment on or make changes to this bug.