Description of problem: I upgraded my 4 host converged gluster/ovirt lab setup from 4.1 to 4.2 yesterday, and now 3 of my hosts won't connect to my main data domain, so they're non-operational when I try to activate them. The hosts can mount the gluster storage just fine, I can mount to a test location on the hosts, and I can see that the hosts are mounting the storage in the usual place when they attempt to activate. Permissions look normal, too. I've undeployed and redeployed the hosted engine from the three problem machines, in case that was causing an issue. I'm able to start the hosted engine from one of the problematic hosts, but when I access the engine, the state is still non-operational. My three hosts are marked with "Host has no default route," but they do have a default route. I don't know if this is related or not. Here's where the error happens in vdsm.log: 2017-12-20 22:10:20,156-0500 INFO (jsonrpc/7) [vdsm.api] START connectStoragePool(spUUID=u'00000001-0001-0001-0001-00000000025e', hostID=5, msdUUID=u'616be2b6-71db-4f54-befd-be6a444775d7', masterVersion=5, domainsMap={u'1988cb7d-9c66-4434-ae0e-6b7b4546b12c': u'attached', u'efa23102-3be1-42bb-abe6-fb1f53af70a2': u'attached', u'616be2b6-71db-4f54-befd-be6a444775d7': u'active', u'9fb70b99-2e09-4923-8f48-24a73017aba8': u'active', u'978b957e-9a49-421a-a10a-1d8445b704a6': u'active'}, options=None) from=::ffff:10.10.171.33,52592, flow_id=7f33e075, task_id=4c10f825-20f6-4595-8f4b-92f7b570c70f (api:46) 2017-12-20 22:10:20,156-0500 INFO (jsonrpc/7) [storage.StoragePoolMemoryBackend] new storage pool master version 5 and domains map {u'978b957e-9a49-421a-a10a-1d8445b704a6': u'Active', u'616be2b6-71db-4f54-befd-be6a444775d7': u'Active', u'efa23102-3be1-42bb-abe6-fb1f53af70a2': u'Attached', u'9fb70b99-2e09-4923-8f48-24a73017aba8': u'Active', u'1988cb7d-9c66-4434-ae0e-6b7b4546b12c': u'Attached'} (spbackends:449) 2017-12-20 22:10:20,157-0500 INFO (jsonrpc/7) [storage.StoragePool] updating pool 00000001-0001-0001-0001-00000000025e backend from type NoneType instance 0x7f417a515f20 to type StoragePoolMemoryBackend instance 0x3f98cb0 (sp:157) 2017-12-20 22:10:20,157-0500 INFO (jsonrpc/7) [storage.StoragePool] Connect host #5 to the storage pool 00000001-0001-0001-0001-00000000025e with master domain: 616be2b6-71db-4f54-befd-be6a444775d7 (ver = 5) (sp:692) 2017-12-20 22:10:20,278-0500 INFO (jsonrpc/7) [IOProcessClient] Starting client ioprocess-13 (__init__:330) 2017-12-20 22:10:20,286-0500 INFO (ioprocess/42758) [IOProcess] Starting ioprocess (__init__:452) 2017-12-20 22:10:20,289-0500 INFO (jsonrpc/7) [vdsm.api] FINISH connectStoragePool error=[Errno 13] Permission denied from=::ffff:10.10.171.33,52592, flow_id=7f33e075, task_id=4c10f825-20f6-4595-8f4b-92f7b570c70f (api:50) 2017-12-20 22:10:20,290-0500 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='4c10f825-20f6-4595-8f4b-92f7b570c70f') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in connectStoragePool File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1028, in connectStoragePool spUUID, hostID, msdUUID, masterVersion, domainsMap) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1090, in _connectStoragePool res = pool.connect(hostID, msdUUID, masterVersion) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 704, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1275, in __rebuild self.setMasterDomain(msdUUID, masterVersion) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1485, in setMasterDomain domain = sdCache.produce(msdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 201, in __getitem__ with self._accessWrapper(): File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 154, in _accessWrapper self.refresh() File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 232, in refresh lines = self._metaRW.readlines() File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 132, in readlines return stripNewLines(self._oop.directReadLines(self._metafile)) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 323, in directReadLines fileStr = ioproc.readfile(path, direct=True) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 557, in readfile "direct": direct}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 466, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 13] Permission denied 2017-12-20 22:10:20,290-0500 INFO (jsonrpc/7) [storage.TaskManager.Task] (Task='4c10f825-20f6-4595-8f4b-92f7b570c70f') aborting: Task is aborted: u'[Errno 13] Permission denied' - code 100 (task:1181) 2017-12-20 22:10:20,291-0500 ERROR (jsonrpc/7) [storage.Dispatcher] FINISH connectStoragePool error=[Errno 13] Permission denied (dispatcher:86) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error OSError: [Errno 13] Permission denied 2017-12-20 22:10:20,291-0500 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.connect failed (error 302) in 0.13 seconds (__init__:573)
I was able to activate my hosts, but my master storage domain was corrupted. I added another gluster-based storage domain and reinitialized the data center with it, but now the hosts are in a cycle of "contending."
more from vdsm.log 2017-12-21 15:23:01,461-0500 INFO (jsonrpc/1) [vdsm.api] START spmStart(spUUID=u'00000001-0001-0001-0001-00000000025e', prevID=-1, prevLVER=u'-1', maxHostID=250, domVersion=u'4', options=None) from=::ffff:10.10.171.33,38244, flow_id=41d4c825, task_id=df32dac5-c49f-4980-a8c5-fa8d3ca36afe (api:46) 2017-12-21 15:23:01,462-0500 INFO (jsonrpc/1) [vdsm.api] FINISH spmStart return=None from=::ffff:10.10.171.33,38244, flow_id=41d4c825, task_id=df32dac5-c49f-4980-a8c5-fa8d3ca36afe (api:52) 2017-12-21 15:23:01,462-0500 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StoragePool.spmStart succeeded in 0.00 seconds (__init__:573) 2017-12-21 15:23:01,463-0500 INFO (tasks/0) [storage.ThreadPool.WorkerThread] START task df32dac5-c49f-4980-a8c5-fa8d3ca36afe (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x495cf80>>, args=None) (threadPool:208) 2017-12-21 15:23:01,466-0500 INFO (tasks/0) [storage.SANLock] Acquiring host id for domain 733b23d8-482d-4b0d-af84-4791d1285f8e (id=5, async=False) (clusterlock:284) 2017-12-21 15:23:01,466-0500 INFO (tasks/0) [storage.SANLock] Host id for domain 733b23d8-482d-4b0d-af84-4791d1285f8e already acquired (id=5, async=False) (clusterlock:312) 2017-12-21 15:23:01,466-0500 INFO (tasks/0) [storage.SANLock] Acquiring Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) for host id 5 (clusterlock:377) 2017-12-21 15:23:01,605-0500 INFO (tasks/0) [storage.SANLock] Successfully acquired Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) for host id 5 (clusterlock:415) 2017-12-21 15:23:01,621-0500 WARN (upgrade/616be2b) [storage.PersistentDict] Could not parse line ``. (persistent:244) 2017-12-21 15:23:01,622-0500 ERROR (upgrade/616be2b) [storage.StoragePool] FINISH thread <Thread(upgrade/616be2b, started daemon 140001144719104)> failed (concurrent:198) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 189, in _upgradePoolDomain domain = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION' 2017-12-21 15:23:01,628-0500 WARN (tasks/0) [storage.PersistentDict] Could not parse line ``. (persistent:244) 2017-12-21 15:23:01,629-0500 ERROR (tasks/0) [storage.StoragePool] Backup domain validation failed (sp:353) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 350, in startSpm self.checkBackupDomain(__securityOverride=True) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1532, in checkBackupDomain dom = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION' 2017-12-21 15:23:01,635-0500 WARN (tasks/0) [storage.PersistentDict] Could not parse line ``. (persistent:244) 2017-12-21 15:23:01,636-0500 ERROR (tasks/0) [storage.StoragePool] Unexpected error (sp:389) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 361, in startSpm self._updateDomainsRole() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 254, in _updateDomainsRole domain = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION' 2017-12-21 15:23:01,636-0500 ERROR (tasks/0) [storage.StoragePool] failed: 'VERSION' (sp:390) 2017-12-21 15:23:01,636-0500 INFO (tasks/0) [storage.SANLock] Releasing Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) (clusterlock:435) 2017-12-21 15:23:01,710-0500 INFO (tasks/0) [storage.SANLock] Successfully released Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) (clusterlock:444) 2017-12-21 15:23:01,710-0500 ERROR (tasks/0) [storage.TaskManager.Task] (Task='df32dac5-c49f-4980-a8c5-fa8d3ca36afe') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 361, in startSpm self._updateDomainsRole() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 254, in _updateDomainsRole domain = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION'
My master domain's metadata file was corrupted. I remade a metadata file and my hosts were able to come up. Closing this issue.
Do you have any logs to indicate how the metadata file got corrupted? vdsm/gluster mount logs?
Re-opening this bug to understand the issue.
How exactly have you performed upgrade of those hosts? 1. Manually from command line using yum update? If so were hosts in Maintenance during that operation? 2. Or using Upgrade option in webadmin? Also have you upgraded engine to 4.2 before upgrading hosts?
(In reply to Martin Perina from comment #6) > How exactly have you performed upgrade of those hosts? > > 1. Manually from command line using yum update? If so were hosts in > Maintenance during that operation? > > 2. Or using Upgrade option in webadmin? > > Also have you upgraded engine to 4.2 before upgrading hosts? I upgraded the engine first, and then upgraded the hosts from the webadmin, the hosts were in maintenance mode before upgrading.
Created attachment 1375886 [details] gluster log for data domain
Ravi, can you take a look..the logs indicate split-brain
Hi Jason, 1. When you say you re-created the metadata file in comment #3, how did you do it? Did you remove the old file from the mount and create a new one? 2. Do you happen to know the gfid and the filename of the old metadata file? That would be helpful in looking at the logs. I do see split-brain messages in the logs but these can also occur if there was only one good copy of the file and brick that had that copy went down. When that brick comes up, I/O will be allowed on the file again.
Notes to self: split-brain messages in the logs for the following gfids: 00000000-0000-0000-0000-000000000001 33173ce2-b91e-441c-91a5-e65eba02a6eb 69b1370d-bc6f-4def-8896-4352ee8862c8 979f2db9-da06-4123-b873-4b9199346537 Multiple mounts/umounts seem to have happened: grep -rne "Started running" rhev-data-center-mnt-glusterSD-10.0.20.1__data.log-20171224 Multiple disconnects to bricks seem to have happened: grep -nE "Connected to|disconnected from" rhev-data-center-mnt-glusterSD-10.0.20.1__data.log-20171224
(In reply to Ravishankar N from comment #10) > Hi Jason, > 1. When you say you re-created the metadata file in comment #3, how did you > do it? Did you remove the old file from the mount and create a new one? Yes, deleted it and created a new one. > > 2. Do you happen to know the gfid and the filename of the old metadata file? > That would be helpful in looking at the logs. I do see split-brain messages > in the logs but these can also occur if there was only one good copy of the > file and brick that had that copy went down. When that brick comes up, I/O > will be allowed on the file again. The (mounted) filename/path is /rhev/data-center/mnt/glusterSD/10.0.20.1\:_data/616be2b6-71db-4f54-befd-be6a444775d7/dom_md/metadata I don't know how to get the gfid
gfid (which is the inode number in gluster world) is stored as an extended attribute trusted.gfid on the file. `getfattr -d -m . -e hex /path/to/backend-brick/616be2b6-71db-4f54-befd-be6a444775d7/dom_md/metadata` should give you the gfid amongst other attributes. But since you deleted and re-created the file, it would have a gotten a different gfid, so it won't be of much help.
Closing this bug as we don't have the afr extended attributes of the file in question to ascertain if it was indeed a split-brain. Note that we are in the process of fixing BZ 1384983 and BZ 1537480 for preventing a few known cases of split-brains in replica 3/ arbiter volumes.