Bug 825732 - 3.1 - [ovirt] [storage] unable to recover pool in case meta-data is corrupted
3.1 - [ovirt] [storage] unable to recover pool in case meta-data is corrupted
Status: CLOSED CURRENTRELEASE
Product: oVirt
Classification: Community
Component: vdsm (Show other bugs)
unspecified
x86_64 Linux
unspecified Severity high
: ---
: 3.2
Assigned To: Eduardo Warszawski
storage
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-28 06:02 EDT by Haim
Modified: 2016-02-10 11:34 EST (History)
10 users (show)

See Also:
Fixed In Version: v4.9.6-26.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-15 01:46:47 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
vdsm log (512.42 KB, application/x-tar)
2012-05-28 06:05 EDT, Haim
no flags Details

  None (edit)
Description Haim 2012-05-28 06:02:21 EDT
Description of problem:

story: 

- 1 host 
- 2 storage domains, data, 1 is master
- master storage domain meta-data is corrupted
- engine tries to reconstruct master domain to second domain 
  * engine send spmStop command to vdsm, vdsm catches exception and returns None
  * engine send getSpmStatus command to vdsm, vdsm fails with meta-data seal is broken 
  * engine send disconnectStoragePool to vdsm, vdsm fails again with same error as above

this flow continues on a loop for ever. 


Thread-202130::INFO::2012-05-28 12:01:29,487::logUtils::37::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='c504b880-23a2-4503-aa11-6cfdf50c98f1', options=None)
Thread-202130::DEBUG::2012-05-28 12:01:29,488::lvm::464::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-202130::DEBUG::2012-05-28 12:01:29,488::lvm::466::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-202130::DEBUG::2012-05-28 12:01:29,488::lvm::475::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-202130::DEBUG::2012-05-28 12:01:29,489::lvm::487::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-202130::DEBUG::2012-05-28 12:01:29,489::lvm::349::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-202130::DEBUG::2012-05-28 12:01:29,490::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%1hateya-upstream-lun101|360014059a3d79162a7c45c8ac96537ba%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 795a9f86-d1ef-46e1-b338-288cc725d59f' (cwd None)
Thread-202130::DEBUG::2012-05-28 12:01:29,759::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-202130::DEBUG::2012-05-28 12:01:29,762::lvm::376::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-202130::DEBUG::2012-05-28 12:01:29,762::persistentDict::226::Storage.PersistentDict::(refresh) read lines (VGTagMetadataRW)=[u'PV0=pv:mpathaj,uuid:5rDU3L-gJ0X-bIl3-RNzZ-A3oW-dLeY-nP2qNQ,pestart:0,pecount:397,mapoffset:0', 'DESCRIPTION=lion-pool', 'TYPE=ISCSI', 'VGUUID=J71m7n-32EB-2dXo-CEWu-2FHV-3e21-V8vKKG', 'LOGBLKSIZE=512', 'LEASERETRIES=3', 'SDUUID=795a9f86-d1ef-46e1-b338-288cc725d59f', 'LOCKRENEWALINTERVALSEC=5', 'PHYBLKSIZE=512', 'VERSION=2', 'CLASS=Data', 'LOCKPOLICY=', 'LEASETIMESEC=60', 'IOOPTIMEOUTSEC=10', 'POOL_DESCRIPTION=new_pool', 'ROLE=Master', 'POOL_UUID=c504b880-23a2-4503-aa11-6cfdf50c98f1', 'MASTER_VERSION=1', u'POOL_DOMAINS=795a9f86-d1ef-46e1-b338-288cc725d59f:Active,86473814-49c4-4259-b5f2-37f1d72f5e90:Active', 'POOL_SPM_ID=1', 'POOL_SPM_LVER=3', '_SHA_CKSUM=e3a9cafa572573a6122352c5as']
Thread-202130::WARNING::2012-05-28 12:01:29,763::persistentDict::264::Storage.PersistentDict::(refresh) data seal is broken metadata declares `e3a9cafa572573a6122352c5as` should be `e3a9cafa572573a6122352c563947a6659f5bd46` (lines={u'PV0': u'pv:mpathaj,uuid:5rDU3L-gJ0X-bIl3-RNzZ-A3oW-dLeY-nP2qNQ,pestart:0,pecount:397,mapoffset:0', 'VERSION': '2', 'LEASETIMESEC': '60', 'DESCRIPTION': 'lion-pool', 'MASTER_VERSION': '1', 'LOCKRENEWALINTERVALSEC': '5', 'VGUUID': 'J71m7n-32EB-2dXo-CEWu-2FHV-3e21-V8vKKG', 'LEASERETRIES': '3', 'POOL_DESCRIPTION': 'new_pool', 'POOL_SPM_LVER': '3', 'SDUUID': '795a9f86-d1ef-46e1-b338-288cc725d59f', 'PHYBLKSIZE': '512', 'ROLE': 'Master', 'LOGBLKSIZE': '512', 'LOCKPOLICY': '', u'POOL_DOMAINS': u'795a9f86-d1ef-46e1-b338-288cc725d59f:Active,86473814-49c4-4259-b5f2-37f1d72f5e90:Active', 'POOL_UUID': 'c504b880-23a2-4503-aa11-6cfdf50c98f1', 'TYPE': 'ISCSI', 'CLASS': 'Data', 'IOOPTIMEOUTSEC': '10', 'POOL_SPM_ID': '1'})
Thread-202130::ERROR::2012-05-28 12:01:29,764::hsm::534::Storage.HSM::(getSpmStatus) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 529, in getSpmStatus
    status = {'spmStatus':pool.getSpmRole(), 'spmLver': pool.getSpmLver(), 'spmId':pool.getSpmId()}
  File "/usr/share/vdsm/storage/sp.py", line 133, in getSpmLver
    return self.getMetaParam(PMDK_LVER)
  File "/usr/share/vdsm/storage/sp.py", line 1434, in getMetaParam
    return self._metadata[key]
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 193, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 147, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 265, in refresh
    raise se.MetaDataSealIsBroken(declaredChecksum, computedChecksum)
MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = e3a9cafa572573a6122352c5as, computed_cksum = e3a9cafa572573a6122352c563947a6659f5bd46'
Thread-202130::ERROR::2012-05-28 12:01:29,764::task::853::TaskManager.Task::(_setError) Task=`3dbd6b05-d21a-457d-ac9e-deb8d78ae94d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 529, in getSpmStatus
    status = {'spmStatus':pool.getSpmRole(), 'spmLver': pool.getSpmLver(), 'spmId':pool.getSpmId()}
  File "/usr/share/vdsm/storage/sp.py", line 133, in getSpmLver
    return self.getMetaParam(PMDK_LVER)
  File "/usr/share/vdsm/storage/sp.py", line 1434, in getMetaParam
    return self._metadata[key]
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 193, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 147, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 265, in refresh
    raise se.MetaDataSealIsBroken(declaredChecksum, computedChecksum)
MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = e3a9cafa572573a6122352c5as, computed_cksum = e3a9cafa572573a6122352c563947a6659f5bd46'
Comment 1 Haim 2012-05-28 06:05:44 EDT
Created attachment 587199 [details]
vdsm log
Comment 2 Haim 2012-05-28 06:16:37 EDT
BTW - if i restart vdsm service on host, everything goes back to normal and reconstruct master succeeds
Comment 3 Eduardo Warszawski 2012-07-22 13:56:53 EDT
As said on BZ#835824:
https://bugzilla.redhat.com/show_bug.cgi?id=825732

The getDomains() call during the disconnectStoragePool() was removed in
Change-Id: I7c33ecdc2cd21c5afb445a339f400bc990b93998
http://gerrit.ovirt.org/#/c/3929/

Since the MSD domain MD was corrupted the disconnect failed. disconnectStoragePool should success with the actual code.

getSpmStatus should induce reconstruct if the MSD MD is broken:
http://gerrit.ovirt.org/#/c/6083/3

Note You need to log in before you can comment on or make changes to this bug.