Bug 1528391 - Following update to 4.2, hosts stuck in non-operational state
Summary: Following update to 4.2, hosts stuck in non-operational state
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.20.9.3
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Denis Chaplygin
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-21 17:40 UTC by Jason Brooks
Modified: 2018-01-31 13:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-31 13:33:59 UTC
oVirt Team: Gluster
Embargoed:


Attachments (Terms of Use)
gluster log for data domain (209.43 KB, application/x-gzip)
2018-01-02 19:15 UTC, Jason Brooks
no flags Details

Description Jason Brooks 2017-12-21 17:40:45 UTC
Description of problem:

I upgraded my 4 host converged gluster/ovirt lab setup from 4.1 to 4.2
yesterday, and now 3 of my hosts won't connect to my main data domain,
so they're non-operational when I try to activate them.

The hosts can mount the gluster storage just fine, I can mount to a
test location on the hosts, and I can see that the hosts are mounting
the storage in the usual place when they attempt to activate.
Permissions look normal, too.

I've undeployed and redeployed the hosted engine from the three problem machines, in case that was causing an issue. I'm able to start the hosted engine from one of the problematic hosts, but when I access the engine, the state is still non-operational.

My three hosts are marked with "Host has no default route," but they do have a default route. I don't know if this is related or not.

Here's where the error happens in vdsm.log:

2017-12-20 22:10:20,156-0500 INFO  (jsonrpc/7) [vdsm.api] START connectStoragePool(spUUID=u'00000001-0001-0001-0001-00000000025e', hostID=5, msdUUID=u'616be2b6-71db-4f54-befd-be6a444775d7', masterVersion=5, domainsMap={u'1988cb7d-9c66-4434-ae0e-6b7b4546b12c': u'attached', u'efa23102-3be1-42bb-abe6-fb1f53af70a2': u'attached', u'616be2b6-71db-4f54-befd-be6a444775d7': u'active', u'9fb70b99-2e09-4923-8f48-24a73017aba8': u'active', u'978b957e-9a49-421a-a10a-1d8445b704a6': u'active'}, options=None) from=::ffff:10.10.171.33,52592, flow_id=7f33e075, task_id=4c10f825-20f6-4595-8f4b-92f7b570c70f (api:46)
2017-12-20 22:10:20,156-0500 INFO  (jsonrpc/7) [storage.StoragePoolMemoryBackend] new storage pool master version 5 and domains map {u'978b957e-9a49-421a-a10a-1d8445b704a6': u'Active', u'616be2b6-71db-4f54-befd-be6a444775d7': u'Active', u'efa23102-3be1-42bb-abe6-fb1f53af70a2': u'Attached', u'9fb70b99-2e09-4923-8f48-24a73017aba8': u'Active', u'1988cb7d-9c66-4434-ae0e-6b7b4546b12c': u'Attached'} (spbackends:449)
2017-12-20 22:10:20,157-0500 INFO  (jsonrpc/7) [storage.StoragePool] updating pool 00000001-0001-0001-0001-00000000025e backend from type NoneType instance 0x7f417a515f20 to type StoragePoolMemoryBackend instance 0x3f98cb0 (sp:157)
2017-12-20 22:10:20,157-0500 INFO  (jsonrpc/7) [storage.StoragePool] Connect host #5 to the storage pool 00000001-0001-0001-0001-00000000025e with master domain: 616be2b6-71db-4f54-befd-be6a444775d7 (ver = 5) (sp:692)
2017-12-20 22:10:20,278-0500 INFO  (jsonrpc/7) [IOProcessClient] Starting client ioprocess-13 (__init__:330)
2017-12-20 22:10:20,286-0500 INFO  (ioprocess/42758) [IOProcess] Starting ioprocess (__init__:452)
2017-12-20 22:10:20,289-0500 INFO  (jsonrpc/7) [vdsm.api] FINISH connectStoragePool error=[Errno 13] Permission denied from=::ffff:10.10.171.33,52592, flow_id=7f33e075, task_id=4c10f825-20f6-4595-8f4b-92f7b570c70f (api:50)
2017-12-20 22:10:20,290-0500 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='4c10f825-20f6-4595-8f4b-92f7b570c70f') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in connectStoragePool
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1028, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1090, in _connectStoragePool
    res = pool.connect(hostID, msdUUID, masterVersion)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 704, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1275, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1485, in setMasterDomain
    domain = sdCache.produce(msdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__
    manifest = self.manifestClass(domainPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__
    self._domainLock = self._makeDomainLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock
    domVersion = self.getVersion()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion
    return self.getMetaParam(DMDK_VERSION)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam
    return self._metadata[key]
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__
    return dec(self._dict[key])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 201, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 154, in _accessWrapper
    self.refresh()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 232, in refresh
    lines = self._metaRW.readlines()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 132, in readlines
    return stripNewLines(self._oop.directReadLines(self._metafile))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 323, in directReadLines
    fileStr = ioproc.readfile(path, direct=True)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 557, in readfile
    "direct": direct}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 466, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 13] Permission denied
2017-12-20 22:10:20,290-0500 INFO  (jsonrpc/7) [storage.TaskManager.Task] (Task='4c10f825-20f6-4595-8f4b-92f7b570c70f') aborting: Task is aborted: u'[Errno 13] Permission denied' - code 100 (task:1181)
2017-12-20 22:10:20,291-0500 ERROR (jsonrpc/7) [storage.Dispatcher] FINISH connectStoragePool error=[Errno 13] Permission denied (dispatcher:86)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare
    raise self.error
OSError: [Errno 13] Permission denied
2017-12-20 22:10:20,291-0500 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.connect failed (error 302) in 0.13 seconds (__init__:573)

Comment 1 Jason Brooks 2017-12-21 20:16:49 UTC
I was able to activate my hosts, but my master storage domain was corrupted. I added another gluster-based storage domain and reinitialized the data center with it, but now the hosts are in a cycle of "contending."

Comment 2 Jason Brooks 2017-12-21 20:24:56 UTC
more from vdsm.log

2017-12-21 15:23:01,461-0500 INFO  (jsonrpc/1) [vdsm.api] START spmStart(spUUID=u'00000001-0001-0001-0001-00000000025e', prevID=-1, prevLVER=u'-1', maxHostID=250, domVersion=u'4', options=None) from=::ffff:10.10.171.33,38244, flow_id=41d4c825, task_id=df32dac5-c49f-4980-a8c5-fa8d3ca36afe (api:46)
2017-12-21 15:23:01,462-0500 INFO  (jsonrpc/1) [vdsm.api] FINISH spmStart return=None from=::ffff:10.10.171.33,38244, flow_id=41d4c825, task_id=df32dac5-c49f-4980-a8c5-fa8d3ca36afe (api:52)
2017-12-21 15:23:01,462-0500 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StoragePool.spmStart succeeded in 0.00 seconds (__init__:573)
2017-12-21 15:23:01,463-0500 INFO  (tasks/0) [storage.ThreadPool.WorkerThread] START task df32dac5-c49f-4980-a8c5-fa8d3ca36afe (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x495cf80>>, args=None) (threadPool:208)
2017-12-21 15:23:01,466-0500 INFO  (tasks/0) [storage.SANLock] Acquiring host id for domain 733b23d8-482d-4b0d-af84-4791d1285f8e (id=5, async=False) (clusterlock:284)
2017-12-21 15:23:01,466-0500 INFO  (tasks/0) [storage.SANLock] Host id for domain 733b23d8-482d-4b0d-af84-4791d1285f8e already acquired (id=5, async=False) (clusterlock:312)
2017-12-21 15:23:01,466-0500 INFO  (tasks/0) [storage.SANLock] Acquiring Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) for host id 5 (clusterlock:377)
2017-12-21 15:23:01,605-0500 INFO  (tasks/0) [storage.SANLock] Successfully acquired Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) for host id 5 (clusterlock:415)
2017-12-21 15:23:01,621-0500 WARN  (upgrade/616be2b) [storage.PersistentDict] Could not parse line ``. (persistent:244)
2017-12-21 15:23:01,622-0500 ERROR (upgrade/616be2b) [storage.StoragePool] FINISH thread <Thread(upgrade/616be2b, started daemon 140001144719104)> failed (concurrent:198)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 189, in _upgradePoolDomain
    domain = sdCache.produce(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__
    manifest = self.manifestClass(domainPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__
    self._domainLock = self._makeDomainLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock
    domVersion = self.getVersion()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion
    return self.getMetaParam(DMDK_VERSION)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam
    return self._metadata[key]
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__
    return dec(self._dict[key])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__
    raise KeyError(key)
KeyError: 'VERSION'
2017-12-21 15:23:01,628-0500 WARN  (tasks/0) [storage.PersistentDict] Could not parse line ``. (persistent:244)
2017-12-21 15:23:01,629-0500 ERROR (tasks/0) [storage.StoragePool] Backup domain validation failed (sp:353)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 350, in startSpm
    self.checkBackupDomain(__securityOverride=True)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1532, in checkBackupDomain
    dom = sdCache.produce(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__
    manifest = self.manifestClass(domainPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__
    self._domainLock = self._makeDomainLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock
    domVersion = self.getVersion()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion
    return self.getMetaParam(DMDK_VERSION)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam
    return self._metadata[key]
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__
    return dec(self._dict[key])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__
    raise KeyError(key)
KeyError: 'VERSION'
2017-12-21 15:23:01,635-0500 WARN  (tasks/0) [storage.PersistentDict] Could not parse line ``. (persistent:244)
2017-12-21 15:23:01,636-0500 ERROR (tasks/0) [storage.StoragePool] Unexpected error (sp:389)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 361, in startSpm
    self._updateDomainsRole()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 254, in _updateDomainsRole
    domain = sdCache.produce(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__
    manifest = self.manifestClass(domainPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__
    self._domainLock = self._makeDomainLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock
    domVersion = self.getVersion()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion
    return self.getMetaParam(DMDK_VERSION)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam
    return self._metadata[key]
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__
    return dec(self._dict[key])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__
    raise KeyError(key)
KeyError: 'VERSION'
2017-12-21 15:23:01,636-0500 ERROR (tasks/0) [storage.StoragePool] failed: 'VERSION' (sp:390)
2017-12-21 15:23:01,636-0500 INFO  (tasks/0) [storage.SANLock] Releasing Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) (clusterlock:435)
2017-12-21 15:23:01,710-0500 INFO  (tasks/0) [storage.SANLock] Successfully released Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) (clusterlock:444)
2017-12-21 15:23:01,710-0500 ERROR (tasks/0) [storage.TaskManager.Task] (Task='df32dac5-c49f-4980-a8c5-fa8d3ca36afe') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 361, in startSpm
    self._updateDomainsRole()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 254, in _updateDomainsRole
    domain = sdCache.produce(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__
    manifest = self.manifestClass(domainPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__
    self._domainLock = self._makeDomainLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock
    domVersion = self.getVersion()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion
    return self.getMetaParam(DMDK_VERSION)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam
    return self._metadata[key]
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__
    return dec(self._dict[key])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__
    raise KeyError(key)
KeyError: 'VERSION'

Comment 3 Jason Brooks 2017-12-22 21:30:27 UTC
My master domain's metadata file was corrupted. I remade a metadata file and my hosts were able to come up. Closing this issue.

Comment 4 Sahina Bose 2017-12-28 06:01:42 UTC
Do you have any logs to indicate how the metadata file got corrupted? vdsm/gluster mount logs?

Comment 5 Sahina Bose 2018-01-02 08:49:39 UTC
Re-opening this bug to understand the issue.

Comment 6 Martin Perina 2018-01-02 10:08:07 UTC
How exactly have you performed upgrade of those hosts?

1. Manually from command line using yum update? If so were hosts in Maintenance during that operation?

2. Or using Upgrade option in webadmin?

Also have you upgraded engine to 4.2 before upgrading hosts?

Comment 7 Jason Brooks 2018-01-02 19:07:18 UTC
(In reply to Martin Perina from comment #6)
> How exactly have you performed upgrade of those hosts?
> 
> 1. Manually from command line using yum update? If so were hosts in
> Maintenance during that operation?
> 
> 2. Or using Upgrade option in webadmin?
> 
> Also have you upgraded engine to 4.2 before upgrading hosts?

I upgraded the engine first, and then upgraded the hosts from the webadmin, the hosts were in maintenance mode before upgrading.

Comment 8 Jason Brooks 2018-01-02 19:15:14 UTC
Created attachment 1375886 [details]
gluster log for data domain

Comment 9 Sahina Bose 2018-01-03 05:19:43 UTC
Ravi, can you take a look..the logs indicate split-brain

Comment 10 Ravishankar N 2018-01-08 07:01:01 UTC
Hi Jason, 
1. When you say you re-created the metadata file in comment #3, how did you do it? Did you remove the old file from the mount and create a new one?

2. Do you happen to know the gfid and the filename of the old metadata file? That would be helpful in looking at the logs. I do see split-brain messages in the logs but these can also occur if there was only one good copy of the file and brick that had that copy went down.  When that brick comes up, I/O will be allowed on the file again.

Comment 11 Ravishankar N 2018-01-08 07:08:36 UTC
Notes to self:
split-brain messages in the logs for the following gfids:
00000000-0000-0000-0000-000000000001
33173ce2-b91e-441c-91a5-e65eba02a6eb
69b1370d-bc6f-4def-8896-4352ee8862c8
979f2db9-da06-4123-b873-4b9199346537

Multiple mounts/umounts seem to have happened:
grep -rne "Started running" rhev-data-center-mnt-glusterSD-10.0.20.1__data.log-20171224

Multiple disconnects to bricks seem to have happened:
grep -nE "Connected to|disconnected from" rhev-data-center-mnt-glusterSD-10.0.20.1__data.log-20171224

Comment 12 Jason Brooks 2018-01-08 17:45:54 UTC
(In reply to Ravishankar N from comment #10)
> Hi Jason, 
> 1. When you say you re-created the metadata file in comment #3, how did you
> do it? Did you remove the old file from the mount and create a new one?

Yes, deleted it and created a new one.


> 
> 2. Do you happen to know the gfid and the filename of the old metadata file?
> That would be helpful in looking at the logs. I do see split-brain messages
> in the logs but these can also occur if there was only one good copy of the
> file and brick that had that copy went down.  When that brick comes up, I/O
> will be allowed on the file again.

The (mounted) filename/path is 
/rhev/data-center/mnt/glusterSD/10.0.20.1\:_data/616be2b6-71db-4f54-befd-be6a444775d7/dom_md/metadata I don't know how to get the gfid

Comment 13 Ravishankar N 2018-01-09 05:29:13 UTC
gfid (which is the inode number in gluster world) is stored as an extended attribute trusted.gfid on the file.

`getfattr -d -m . -e hex /path/to/backend-brick/616be2b6-71db-4f54-befd-be6a444775d7/dom_md/metadata` should give you the gfid amongst other attributes. But since you deleted and re-created the file, it would have a gotten a different gfid, so it won't be of much help.

Comment 14 Ravishankar N 2018-01-31 13:33:59 UTC
Closing this bug as we don't have the afr extended attributes of the file in question to ascertain if it was indeed a split-brain. Note that we are in the process of fixing BZ 1384983 and BZ 1537480 for preventing a few known cases of split-brains in replica 3/ arbiter volumes.


Note You need to log in before you can comment on or make changes to this bug.