Description of problem: storage domain uuid can't be reused after storage pool destroyed Version-Release number of selected component (if applicable): vdsm-4.9.4-0.18 How reproducible: often Steps to Reproduce: (1)create storagedomain sd1 with uuid '1be32ac7-1e12-4823-8e8c-8c887333fe37' (2)create storage pool with it's master domain sd1 (3)connect storage pool (4)spmstart (3)destroy storage pool (4)format storage domain (5)create storage domain with uuid '1be32ac7-1e12-4823-8e8c-8c887333fe37' Actual results: raise se.StorageDomainAlreadyExists Expected results: storage domain successfully created Additional info:log: CP Server Thread-23::ERROR::2012-03-26 15:45:38,961::task::855::TaskManager.Task::(_setError) Task=`b56f239f-72c6-4d83-b80f-7b00c8946221`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 863, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1912, in createStorageDomain self.validateNonDomain(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 167, in validateNonDomain raise se.StorageDomainAlreadyExists(sdUUID) StorageDomainAlreadyExists: Storage domain already exists: ('1be32ac7-1e12-4823-8e8c-8c887333fe37',)
could you attach a complete vdsm.log (including the part showing the removal of the storage domain (4)). I'd appreciate if you also try to reproduce with something closer to master/HEAD.
Created attachment 572709 [details] vdsm.log
storage pool __init__ func introduce a circular reference of itself: def __init__(self, spUUID, taskManager): self._domainsToUpgrade = [] self.lock = threading.RLock() self._setUnsafe() self.spUUID = str(spUUID) self.poolPath = os.path.join(self.storage_repository, self.spUUID) self.id = None self.scsiKey = None self.taskMng = taskManager self._poolFile = os.path.join(self._poolsTmpDir, self.spUUID) self.hsmMailer = None self.spmMailer = None self.masterDomain = None self.spmRole = SPM_FREE self.domainMonitor = DomainMonitor(self.monitorInterval) self._upgradeCallback = partial(StoragePool._upgradePoolDomain, self)-->circular reference here So that Storage pool object can't been deleted after destroy,so it's master domain remains referenced,and can't be deleted,either.
and when addjob spmstart, In task.py,class Job: def __init__(self, name, cmd, *argslist, **argsdict): self.name = name self.cmd = cmd-->reference pool.startSpm For some reason Job is not deleted after it has been run,Job not deleted so pool not deleted, so pool s master domain remains undeleted. provide a workaround and still investigating why the Job is not deleted: http://gerrit.ovirt.org/#change,3242
Actually there are two bugs corresponding to this: 1.storage pool circular reference to itself 2.Task is not released after task clean http://gerrit.ovirt.org/#change,3242 correct the first one. task not released so task.jobs not released.job reference pool.startSpm so pool not released, it's a chain action.
I have the same problem. I do not create a storage pool, just create and format the storage domain, then create a storage domain with the same UUID, it will complain about "Storage domain already exists". vdsm version: vdsm-4.10.0 How reproducible: Always Steps to Reproduce: # Firstly get a place to save local storage data mkdir /teststorage chown -R vdsm:kvm /teststorage # Then create a domain and destroy it dsClient -s localhost connectStorageServer 4 '' 'connection=/teststorage,id=1' dsClient -s localhost createStorageDomain 4 1de3f4e2-a9f9-11e1-9956-00247edb4743 'test storage' /teststorage 1 0 vdsClient -s localhost formatStorageDomain 1de3f4e2-a9f9-11e1-9956-00247edb4743 # Now you will see the data in /teststorage are deleted after we formatting the storage domain # Create a storage domain using the same uuid vdsClient -s localhost createStorageDomain 4 1de3f4e2-a9f9-11e1-9956-00247edb4743 'test storage' /teststorage 1 0 # it will report an error saying "Storage domain already exists: ('1de3f4e2-a9f9-11e1-9956-00247edb4743',)" # But if we get domains list, there is nothing in it vdsClient -s localhost getStorageDomainsList Actual results: raise se.StorageDomainAlreadyExists Expected results: storage domain successfully created My investigation: When we call formatStorageDomain() in vdsm/storage/hsm.py, it calls _recycle(sd) at last. In _resycle(), it will call sdCache.manuallyRemoveDomain() in vdsm/storage/sdc.py. , then in StorageDomainCache.manuallyRemoveDomain, it del the domain from self.__domainCache. However, when we want to create a storage domain reusing a previous UUID, vdsm firstly checks if this UUID is already in StorageDomainCache by calling StorageDomainCache.produce(). In this method, it calls _getDomainFromCache, which gets storage domain from self.__proxyCache. In previous format action, it just del domain from self.__domainCache but not from self.__proxyCache, so the UUID is still in self.__proxyCache. Then vdsm thinks the domain already exists and report an error. So the root cause is inconsistency of __domainCache and __proxyCache. I'm doing some storage functional tests. When I try to run the test twice, the second time will fail because of this bug.
Fede, is this issue taken care of with one of your latest patches?
Thanks Ayal Baron. I test Federico's patch from http://gerrit.ovirt.org/#/c/7511/4, it does not fix this bug. Then I investigate a little further. The cause is in HSM._recycle(self, dom), the code firstly try to delete the storage domain from cache by calling "sdCache.manuallyRemoveDomain(dom.sdUUID)". This is fine. Then then the code calls "dom.format(dom.sdUUID)". As we know, "dom" is a DomainProxy objects defined in "vdsm/storage/sdc.py". This proxy will forward method calls to real domain, but before forwarding the calls, it must invoke StorageDomainCache._realProduce() firstly to get the reference to the real domain object, and __realProduce() will add the domain it finds to the cache. So we remove the domain from cache but after formatting it, it is added to the cache again. Then I just tried to reverse the order of these actions as follow. def _recycle(self, dom): dom.format(dom.sdUUID) try: sdCache.manuallyRemoveDomain(dom.sdUUID) except KeyError: self.log.warn("Storage domain %s doesn't exist in cache. Trying recycle leftovers ...", dom.sdUUID) Then the big is fixed. However I am not sure if there will cause other problems if we reverse these action orders.
Try to fix this. Thanks in advance for anyone who can give some advices. http://gerrit.ovirt.org/#/c/8144/
The patch is accepted, the bug is fixed.
if patch is merged, bug should be in MODIFIED. changing bug status
3.2 beta built, moving to ON_QA status to allow testing