Bug 709035

Summary: [vdsm] Domain is not unmounted if attach fails due to metadata failure
Product: Red Hat Enterprise Linux 6 Reporter: Jakub Libosvar <jlibosva>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED NOTABUG QA Contact: yeylon <yeylon>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.1CC: abaron, bazulay, fsimonce, iheim, smizrahi, srevivo, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-20 16:13:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Backend + vdsm logs none

Description Jakub Libosvar 2011-05-30 12:57:55 UTC
Created attachment 501798 [details]
Backend + vdsm logs

Description of problem:
If one attempts to attach storage domain which has corrupted metadata(ie. is already in use or wrong checksum), relevant error is shown but domain is kept attached to the host. Tested on NFS with export domain. This leads to that vdsm refreshes the domain:
Thread-2915::ERROR::2011-05-30 14:49:16,309::sp::107::Storage.StatsThread::(run) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 104, in run
    self._domain = SDF.produce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdf.py", line 32, in produce
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('866d6426-f13a-4cfb-ace5-8ca74a8d477a',)

Version-Release number of selected component (if applicable):
vdsm-4.9-70.el6.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Have export domain and corrupt it's metadata (I erased last byte of SDUUID)
2. Attach this domain to some data-center
3. Wait till error occurs
  
Actual results:
Domain is not umnounted from host

Expected results:
Domain is unmounted

Additional info:
vdsm+backend log attached

Comment 1 Federico Simoncelli 2011-06-17 10:24:29 UTC
This is not reproducible on vdsm-4.9-75.el6.x86_64.
A MetaDataSealIsBroken exception is issued and StorageDomainDoesNotExist is correctly returned to rhev-m:

Thread-73::WARNING::2011-06-17 06:16:42,736::persistentDict::242::Storage.PersistentDict::(refresh) data seal is broken metadata declares `2da1e24ba793d69596096cbd21066960c28303a` should be `2da1e24ba793d69596096
cbd21066960c28303ae` (lines={'VERSION': '0', 'LEASETIMESEC': '5', 'DESCRIPTION': 'domain 2', 'LOCKPOLICY': '', 'LEASERETRIES': '3', 'SDUUID': '7218f329-b9c1-44e3-a960-964fc89a3aff', 'REMOTE_PATH': 'vm-rhdev1:/srv
/nfs/ruthexp1', 'MASTER_VERSION': '0', 'IOOPTIMEOUTSEC': '1', 'ROLE': 'Regular', 'LOCKRENEWALINTERVALSEC': '5', 'POOL_UUID': 'f5a10a36-525e-403d-8169-2ec82c1b4a56', 'TYPE': 'NFS', 'CLASS': 'Data'})
Thread-73::ERROR::2011-06-17 06:16:42,736::sdc::105::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `7218f329-b9c1-44e3-a960-964fc89a3aff`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 101, in _findDomain
    return mod.findDomain(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 130, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 77, in __init__
    sdUUID = metadata[sd.DMDK_SDUUID]
  File "/usr/share/vdsm/storage/persistentDict.py", line 63, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 171, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 125, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 243, in refresh
    raise se.MetaDataSealIsBroken(declaredChecksum, computedChecksum)
MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = 2da1e24ba793d69596096cbd21066960c28303a, computed_cksum = 2da1e24ba793d69596096cbd21066960c28303ae'
Thread-73::ERROR::2011-06-17 06:16:42,740::task::865::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/spm.py", line 115, in run
    return self.func(*args, **kwargs)
  File "/usr/share/vdsm/storage/spm.py", line 1128, in public_attachStorageDomain
    hsm.HSM.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 98, in validateSdUUID
    SDF.produce(sdUUID=sdUUID).validate()
  File "/usr/share/vdsm/storage/sdf.py", line 30, in produce
    newSD = cls.__sdc.lookup(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 83, in lookup
    dom = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 107, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('7218f329-b9c1-44e3-a960-964fc89a3aff',)
Thread-73::DEBUG::2011-06-17 06:16:42,741::task::492::TaskManager.Task::(_debug) Task 7eb64e10-fe84-4e25-be32-81844757ad79: Task._run: 7eb64e10-fe84-4e25-be32-81844757ad79 ('7218f329-b9c1-44e3-a960-964fc89a3aff',
 'f5a10a36-525e-403d-8169-2ec82c1b4a56') {} failed - stopping task

There are no additional "Storage domain does not exist" looping messages that might have been caused by the bug 705058.
Since vdsm is returing the correct error message to rhev-m:

{'status': {'message': "Storage domain does not exist: ('7218f329-b9c1-44e3-a960-964fc89a3aff',)", 'code': 358}}

If we expect the storage domain to be unmounted I suggest to move this bug to the backend.

Comment 2 Dan Kenigsberg 2011-06-17 20:39:10 UTC
If this issue is solved in rhev-m, it should take notice of bug 694408, as the returned error code may change.

Comment 3 Saggi Mizrahi 2011-06-20 16:13:38 UTC
We shouldn't umount, by design. We never did umount. This may also cause issues if it was the master domain.

The md file gone corrupt but the lease file is still in use. What about running VMs.