Bug 806774 - storage domain uuid can't be reused after storage pool destroyed
storage domain uuid can't be reused after storage pool destroyed
Status: CLOSED CURRENTRELEASE
Product: oVirt
Classification: Community
Component: vdsm (Show other bugs)
unspecified
x86_64 Linux
unspecified Severity unspecified
: ---
: 3.2
Assigned To: Royce Lv
storage
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-26 04:11 EDT by Royce Lv
Modified: 2016-02-10 11:45 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-15 01:46:19 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
vdsm.log (150.03 KB, text/plain)
2012-03-26 05:00 EDT, Royce Lv
no flags Details

  None (edit)
Description Royce Lv 2012-03-26 04:11:58 EDT
Description of problem:
storage domain uuid can't be reused after storage pool destroyed

Version-Release number of selected component (if applicable):
vdsm-4.9.4-0.18

How reproducible:
often

Steps to Reproduce:
(1)create storagedomain sd1 with uuid '1be32ac7-1e12-4823-8e8c-8c887333fe37'
(2)create storage pool with it's master domain sd1
(3)connect storage pool
(4)spmstart
(3)destroy storage pool
(4)format storage domain
(5)create storage domain with uuid '1be32ac7-1e12-4823-8e8c-8c887333fe37'
  
Actual results:
 raise se.StorageDomainAlreadyExists

Expected results:
 storage domain successfully created


Additional info:log:
CP Server Thread-23::ERROR::2012-03-26 15:45:38,961::task::855::TaskManager.Task::(_setError) Task=`b56f239f-72c6-4d83-b80f-7b00c8946221`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 863, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1912, in createStorageDomain
    self.validateNonDomain(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 167, in validateNonDomain
    raise se.StorageDomainAlreadyExists(sdUUID)
StorageDomainAlreadyExists: Storage domain already exists: ('1be32ac7-1e12-4823-8e8c-8c887333fe37',)
Comment 1 Dan Kenigsberg 2012-03-26 04:21:48 EDT
could you attach a complete vdsm.log (including the part showing the removal of the storage domain (4)). I'd appreciate if you also try to reproduce with something closer to master/HEAD.
Comment 2 Royce Lv 2012-03-26 05:00:39 EDT
Created attachment 572709 [details]
vdsm.log
Comment 3 Royce Lv 2012-04-01 09:41:52 EDT
storage pool __init__ func introduce a circular reference of itself:
def __init__(self, spUUID, taskManager):
        self._domainsToUpgrade = []
        self.lock = threading.RLock()
        self._setUnsafe()
        self.spUUID = str(spUUID)
        self.poolPath = os.path.join(self.storage_repository, self.spUUID)
        self.id = None
        self.scsiKey = None
        self.taskMng = taskManager
        self._poolFile = os.path.join(self._poolsTmpDir, self.spUUID)
        self.hsmMailer = None
        self.spmMailer = None
        self.masterDomain = None
        self.spmRole = SPM_FREE
        self.domainMonitor = DomainMonitor(self.monitorInterval)
        self._upgradeCallback = partial(StoragePool._upgradePoolDomain, self)-->circular reference here

So that Storage pool object can't been deleted after destroy,so it's master domain remains referenced,and can't be deleted,either.
Comment 4 Royce Lv 2012-04-02 03:24:52 EDT
and when addjob spmstart,
In task.py,class Job:
      def __init__(self, name, cmd, *argslist, **argsdict):
        self.name = name
        self.cmd = cmd-->reference pool.startSpm
For some reason Job is not deleted after it has been run,Job not deleted so pool not deleted, so pool
s master domain remains undeleted. provide a workaround and still investigating why the Job is not deleted:
http://gerrit.ovirt.org/#change,3242
Comment 5 Royce Lv 2012-04-10 03:44:35 EDT
Actually there are two bugs corresponding to this:
1.storage pool circular reference to itself
2.Task is not released after task clean

http://gerrit.ovirt.org/#change,3242 correct the first one.

task not released so task.jobs not released.job reference pool.startSpm so pool not released, it's a chain action.
Comment 6 Zhou Zheng Sheng 2012-09-20 02:02:35 EDT
I have the same problem. I do not create a storage pool, just create and format the storage domain, then create a storage domain with the same UUID, it will complain about "Storage domain already exists".


vdsm version:
vdsm-4.10.0


How reproducible:
Always


Steps to Reproduce:
# Firstly get a place to save local storage data
mkdir /teststorage
chown -R vdsm:kvm /teststorage

# Then create a domain and destroy it
dsClient -s localhost connectStorageServer 4 '' 'connection=/teststorage,id=1'
dsClient -s localhost createStorageDomain 4 1de3f4e2-a9f9-11e1-9956-00247edb4743 'test storage' /teststorage 1 0
vdsClient -s localhost formatStorageDomain 1de3f4e2-a9f9-11e1-9956-00247edb4743
# Now you will see the data in /teststorage are deleted after we formatting the storage domain

# Create a storage domain using the same uuid
vdsClient -s localhost createStorageDomain 4 1de3f4e2-a9f9-11e1-9956-00247edb4743 'test storage' /teststorage 1 0
# it will report an error saying "Storage domain already exists: ('1de3f4e2-a9f9-11e1-9956-00247edb4743',)"
# But if we get domains list, there is nothing in it
vdsClient -s localhost getStorageDomainsList


Actual results:
 raise se.StorageDomainAlreadyExists


Expected results:
 storage domain successfully created


My investigation:
When we call formatStorageDomain() in vdsm/storage/hsm.py, it calls _recycle(sd) at last. In _resycle(), it will call sdCache.manuallyRemoveDomain() in vdsm/storage/sdc.py. , then in StorageDomainCache.manuallyRemoveDomain, it del the domain from self.__domainCache.

However, when we want to create a storage domain reusing a previous UUID, vdsm firstly checks if this UUID is already in StorageDomainCache by calling StorageDomainCache.produce(). In this method, it calls _getDomainFromCache, which gets storage domain from self.__proxyCache. In previous format action, it just del domain from self.__domainCache but not from self.__proxyCache, so the UUID is still in self.__proxyCache. Then vdsm thinks the domain already exists and report an error.

So the root cause is inconsistency of __domainCache and __proxyCache.

I'm doing some storage functional tests. When I try to run the test twice, the second time will fail because of this bug.
Comment 7 Ayal Baron 2012-09-23 03:42:42 EDT
Fede, is this issue taken care of with one of your latest patches?
Comment 8 Zhou Zheng Sheng 2012-09-23 23:47:15 EDT
Thanks Ayal Baron. I test Federico's patch from http://gerrit.ovirt.org/#/c/7511/4, it does not fix this bug. Then I investigate a little further.

The cause is in HSM._recycle(self, dom), the code firstly try to delete the storage domain from cache by calling "sdCache.manuallyRemoveDomain(dom.sdUUID)". This is fine. Then then the code calls "dom.format(dom.sdUUID)". As we know, "dom" is a DomainProxy objects defined in "vdsm/storage/sdc.py". This proxy will forward method calls to real domain, but before forwarding the calls, it must invoke StorageDomainCache._realProduce() firstly to get the reference to the real domain object, and __realProduce() will add the domain it finds to the cache.

So we remove the domain from cache but after formatting it, it is added to the cache again.

Then I just tried to reverse the order of these actions as follow.
def _recycle(self, dom):
    dom.format(dom.sdUUID)

    try:
        sdCache.manuallyRemoveDomain(dom.sdUUID)
    except KeyError:
        self.log.warn("Storage domain %s doesn't exist in cache. Trying recycle leftovers ...", dom.sdUUID)


Then the big is fixed. However I am not sure if there will cause other problems if we reverse these action orders.
Comment 9 Zhou Zheng Sheng 2012-09-24 00:55:11 EDT
Try to fix this. Thanks in advance for anyone who can give some advices.

http://gerrit.ovirt.org/#/c/8144/
Comment 10 Zhou Zheng Sheng 2012-11-21 03:19:37 EST
The patch is accepted, the bug is fixed.
Comment 11 Itamar Heim 2012-11-21 16:08:29 EST
if patch is merged, bug should be in MODIFIED. changing bug status
Comment 12 Itamar Heim 2013-01-16 11:08:29 EST
3.2 beta built, moving to ON_QA status to allow testing

Note You need to log in before you can comment on or make changes to this bug.