Bug 997162

Summary: Failure in run getStorageDomainInfo command with corrupted metadata on NFS Storage Domain
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: vdsmAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED NOTABUG QA Contact: vvyazmin <vvyazmin>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, acanan, amureini, bazulay, hateya, iheim, lpeer, scohen, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.4.0Flags: scohen: needinfo+
Hardware: x86_64   
OS: All   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-21 13:23:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm none

Description vvyazmin@redhat.com 2013-08-14 19:41:22 UTC
Created attachment 786684 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Description of problem:
Failed getStorageDomainInfo with corrupted metadata on NFS Storage Domain

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS9.1 environment:

RHEVM:  rhevm-3.3.0-0.14.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.8-1.el6ev.noarch
VDSM:  vdsm-4.12.0-52.gitce029ba.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
 Create NFS Data Center with 3 Storage Domain (SD)
 Corrupt metadata: 
Remove the POOL_DESCRIPTION field from the master domain's metadata.
Run getStorageDomainInfo command

Actual results:
Failed get SD information from corrupted metadata
Get incorrect information from MetaData

Expected results:
Get some information from metadata
Get correct information from MetaData

Impact on user:
Failed get SD information from corrupted metadata

Workaround:
none

Additional info:
[root@tigris02 ~]# vdsClient -s 0 getConnectedStoragePoolsList
1cdf279f-b9dc-44a5-b11b-4778c592dae7

[root@tigris02 ~]# vdsClient -s 0 getStoragePoolInfo 1cdf279f-b9dc-44a5-b11b-4778c592dae7
	name = DC_NFSVer30
	isoprefix = 
	pool_status = connected
	lver = 1
	domains = 232c87e5-41e0-4e04-a56c-bea730282fdc:Attached,0275b79b-515d-45a4-9ba5-3a3f93f34e86:Active,5803df7f-1c80-4a04-ac77-25fa5e9b6e60:Attached
	master_uuid = 0275b79b-515d-45a4-9ba5-3a3f93f34e86
	version = 3
	spm_id = 2
	type = NFS
	master_ver = 2
	232c87e5-41e0-4e04-a56c-bea730282fdc = {'status': 'Attached', 'alerts': []}
	0275b79b-515d-45a4-9ba5-3a3f93f34e86 = {'status': 'Active', 'diskfree': '7302823280640', 'alerts': [], 'version': 3, 'disktotal': '7317305163776'}
	5803df7f-1c80-4a04-ac77-25fa5e9b6e60 = {'status': 'Attached', 'alerts': []}

[root@tigris02 ~]# vdsClient -s 0 getStorageDomainsList
e5ff6fb2-bbda-4675-b3b7-8f464561aeed
f29f1f15-944f-4f49-a59d-41e41ff57e8a
561ce535-9830-49a3-975b-ac5fa2915cce
bc87fdd4-cc13-4b54-82af-14d67ec1dea4
232c87e5-41e0-4e04-a56c-bea730282fdc
ace596c0-af32-4174-ad3a-fd0c56e10369
ffa1ead8-ba90-4824-8a24-5f5a854ff2aa
0275b79b-515d-45a4-9ba5-3a3f93f34e86

[root@tigris02 ~]# vdsClient -s 0 getStorageDomainInfo 0275b79b-515d-45a4-9ba5-3a3f93f34e86
	uuid = 0275b79b-515d-45a4-9ba5-3a3f93f34e86
	pool = ['1cdf279f-b9dc-44a5-b11b-4778c592dae7']
	lver = 1
	version = 3
	role = Master
	remotePath = wolf.qa.lab.tlv.redhat.com:/volumes/wolf/kipi-09
	spm_id = 2
	type = NFS
	class = Data
	master_ver = 2
	name = none_NFS_004

[root@tigris02 ~]# vdsClient -s 0 getStorageDomainInfo 232c87e5-41e0-4e04-a56c-bea730282fdc
Meta Data seal is broken (checksum mismatch): 'cksum = dda28782ca7d12775d2e6aba6f650bff39c39178, computed_cksum = 740fb0a047962732b397ba4e88b8dbe66fe27f2e'
[root@tigris02 ~]# vdsClient -s 0 getStorageDomainInfo 5803df7f-1c80-4a04-ac77-25fa5e9b6e60
Storage domain does not exist: ('5803df7f-1c80-4a04-ac77-25fa5e9b6e60',)


/var/log/ovirt-engine/engine.log

/var/log/vdsm/vdsm.log
------------------------------------------------------------------------------
Thread-15636::ERROR::2013-08-14 20:03:14,903::task::850::TaskManager.Task::(_setError) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2725, in getStorageDomainInfo
    dom = self.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 264, in validateSdUUID
    sdDom.validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 317, in validate
    if not len(self.getMetadata()):
  File "/usr/share/vdsm/storage/sd.py", line 701, in getMetadata
    return self._metadata.copy()
  File "/usr/share/vdsm/storage/persistentDict.py", line 129, in copy
    md = self._dict.copy()
  File "/usr/share/vdsm/storage/persistentDict.py", line 318, in copy
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 152, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 276, in refresh
    computedChecksum)
MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = dda28782ca7d12775d2e6aba6f650bff39c39178, computed_cksum = 740fb0a047962732b397b
a4e88b8dbe66fe27f2e'
Thread-15636::DEBUG::2013-08-14 20:03:14,905::task::869::TaskManager.Task::(_run) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::Task._run: 75ffc6f7-4392-4ad4-
a465-eb6cfd01c824 ('232c87e5-41e0-4e04-a56c-bea730282fdc',) {} failed - stopping task
Thread-15636::DEBUG::2013-08-14 20:03:14,906::task::1194::TaskManager.Task::(stop) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::stopping in state preparing (
force False)
Thread-15636::DEBUG::2013-08-14 20:03:14,906::task::974::TaskManager.Task::(_decref) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::ref 1 aborting True
Thread-15636::INFO::2013-08-14 20:03:14,906::task::1151::TaskManager.Task::(prepare) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::aborting: Task is aborted: 
'Meta Data seal is broken (checksum mismatch)' - code 752
Thread-15636::DEBUG::2013-08-14 20:03:14,906::task::1156::TaskManager.Task::(prepare) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::Prepare: aborted: Meta Dat
a seal is broken (checksum mismatch)
Thread-15636::DEBUG::2013-08-14 20:03:14,906::task::974::TaskManager.Task::(_decref) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::ref 0 aborting True
Thread-15636::DEBUG::2013-08-14 20:03:14,907::task::909::TaskManager.Task::(_doAbort) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::Task._doAbort: force False
Thread-15636::DEBUG::2013-08-14 20:03:14,907::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-15636::DEBUG::2013-08-14 20:03:14,907::task::579::TaskManager.Task::(_updateState) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::moving from state prep
aring -> state aborting
Thread-15636::DEBUG::2013-08-14 20:03:14,907::task::534::TaskManager.Task::(__state_aborting) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::_aborting: recover
 policy none
Thread-15636::DEBUG::2013-08-14 20:03:14,908::task::579::TaskManager.Task::(_updateState) Task=`75ffc6f7-4392-4ad4-a465-eb6cfd01c824`::moving from state aborting -> state failed
Thread-15636::DEBUG::2013-08-14 20:03:14,908::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-15636::DEBUG::2013-08-14 20:03:14,908::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-15636::ERROR::2013-08-14 20:03:14,908::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Meta Data seal is broken (checksum mismatch): 'cksum = dda28782ca7d12775d2e6aba6f650bff39c39178, computed_cksum = 740fb0a047962732b397ba4e88b8dbe66fe27f2e'", 'code': 752}}

------------------------------------------------------------------------------
Thread-15642::ERROR::2013-08-14 20:03:25,801::sdc::143::Storage.StorageDomainCache::(_findDomain) domain 5803df7f-1c80-4a04-ac77-25fa5e9b6e60 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('5803df7f-1c80-4a04-ac77-25fa5e9b6e60',)
Thread-15642::ERROR::2013-08-14 20:03:25,802::task::850::TaskManager.Task::(_setError) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2725, in getStorageDomainInfo
    dom = self.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 263, in validateSdUUID
    sdDom = sdCache.produce(sdUUID=sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('5803df7f-1c80-4a04-ac77-25fa5e9b6e60',)
Thread-15642::DEBUG::2013-08-14 20:03:25,802::task::869::TaskManager.Task::(_run) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::Task._run: 4ce1e2c2-0084-49c0-bc24-774d2bba166d ('5803df7f-1c80-4a04-ac77-25fa5e9b6e60',) {} failed - stopping task
Thread-15642::DEBUG::2013-08-14 20:03:25,803::task::1194::TaskManager.Task::(stop) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::stopping in state preparing (force False)
Thread-15642::DEBUG::2013-08-14 20:03:25,803::task::974::TaskManager.Task::(_decref) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::ref 1 aborting True
Thread-15642::INFO::2013-08-14 20:03:25,803::task::1151::TaskManager.Task::(prepare) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::aborting: Task is aborted: 'Storage domain does not exist' - code 358
Thread-15642::DEBUG::2013-08-14 20:03:25,804::task::1156::TaskManager.Task::(prepare) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::Prepare: aborted: Storage domain does not exist
Thread-15642::DEBUG::2013-08-14 20:03:25,804::task::974::TaskManager.Task::(_decref) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::ref 0 aborting True
Thread-15642::DEBUG::2013-08-14 20:03:25,804::task::909::TaskManager.Task::(_doAbort) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::Task._doAbort: force False
Thread-15642::DEBUG::2013-08-14 20:03:25,804::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-15642::DEBUG::2013-08-14 20:03:25,805::task::579::TaskManager.Task::(_updateState) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::moving from state preparing -> state aborting
Thread-15642::DEBUG::2013-08-14 20:03:25,805::task::534::TaskManager.Task::(__state_aborting) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::_aborting: recover policy none
Thread-15642::DEBUG::2013-08-14 20:03:25,805::task::579::TaskManager.Task::(_updateState) Task=`4ce1e2c2-0084-49c0-bc24-774d2bba166d`::moving from state aborting -> state failed
Thread-15642::DEBUG::2013-08-14 20:03:25,805::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-15642::DEBUG::2013-08-14 20:03:25,806::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-15642::ERROR::2013-08-14 20:03:25,806::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Storage domain does not exist: ('5803df7f-1c80-4a04-ac77-25fa5e9b6e60',)", 'code': 358}}

Comment 4 Allon Mureinik 2013-08-21 13:23:09 UTC
The metadata is corrupted:
20:03:14,908::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Meta Data seal is broken (checksum mismatch): 'cksum = dda28782ca7d12775d2e6aba6f650bff39c39178, computed_cksum = 740fb0a047962732b397ba4e88b8dbe66fe27f2e'", 'code': 752}}

This is the designed behavior.
The correct approach is to reconstruct the master, get a valid revision of the MD, and continue working with it (e.g., query the info).
If THAT flow does not work, please file a bug on that.