Bug 1528391
Summary: | Following update to 4.2, hosts stuck in non-operational state | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Jason Brooks <jbrooks> | ||||
Component: | General | Assignee: | Denis Chaplygin <dchaplyg> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Raz Tamir <ratamir> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.20.9.3 | CC: | amureini, bugs, jbrooks, mperina, ravishankar, sabose | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-01-31 13:33:59 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jason Brooks
2017-12-21 17:40:45 UTC
I was able to activate my hosts, but my master storage domain was corrupted. I added another gluster-based storage domain and reinitialized the data center with it, but now the hosts are in a cycle of "contending." more from vdsm.log 2017-12-21 15:23:01,461-0500 INFO (jsonrpc/1) [vdsm.api] START spmStart(spUUID=u'00000001-0001-0001-0001-00000000025e', prevID=-1, prevLVER=u'-1', maxHostID=250, domVersion=u'4', options=None) from=::ffff:10.10.171.33,38244, flow_id=41d4c825, task_id=df32dac5-c49f-4980-a8c5-fa8d3ca36afe (api:46) 2017-12-21 15:23:01,462-0500 INFO (jsonrpc/1) [vdsm.api] FINISH spmStart return=None from=::ffff:10.10.171.33,38244, flow_id=41d4c825, task_id=df32dac5-c49f-4980-a8c5-fa8d3ca36afe (api:52) 2017-12-21 15:23:01,462-0500 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StoragePool.spmStart succeeded in 0.00 seconds (__init__:573) 2017-12-21 15:23:01,463-0500 INFO (tasks/0) [storage.ThreadPool.WorkerThread] START task df32dac5-c49f-4980-a8c5-fa8d3ca36afe (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x495cf80>>, args=None) (threadPool:208) 2017-12-21 15:23:01,466-0500 INFO (tasks/0) [storage.SANLock] Acquiring host id for domain 733b23d8-482d-4b0d-af84-4791d1285f8e (id=5, async=False) (clusterlock:284) 2017-12-21 15:23:01,466-0500 INFO (tasks/0) [storage.SANLock] Host id for domain 733b23d8-482d-4b0d-af84-4791d1285f8e already acquired (id=5, async=False) (clusterlock:312) 2017-12-21 15:23:01,466-0500 INFO (tasks/0) [storage.SANLock] Acquiring Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) for host id 5 (clusterlock:377) 2017-12-21 15:23:01,605-0500 INFO (tasks/0) [storage.SANLock] Successfully acquired Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) for host id 5 (clusterlock:415) 2017-12-21 15:23:01,621-0500 WARN (upgrade/616be2b) [storage.PersistentDict] Could not parse line ``. (persistent:244) 2017-12-21 15:23:01,622-0500 ERROR (upgrade/616be2b) [storage.StoragePool] FINISH thread <Thread(upgrade/616be2b, started daemon 140001144719104)> failed (concurrent:198) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 189, in _upgradePoolDomain domain = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION' 2017-12-21 15:23:01,628-0500 WARN (tasks/0) [storage.PersistentDict] Could not parse line ``. (persistent:244) 2017-12-21 15:23:01,629-0500 ERROR (tasks/0) [storage.StoragePool] Backup domain validation failed (sp:353) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 350, in startSpm self.checkBackupDomain(__securityOverride=True) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1532, in checkBackupDomain dom = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION' 2017-12-21 15:23:01,635-0500 WARN (tasks/0) [storage.PersistentDict] Could not parse line ``. (persistent:244) 2017-12-21 15:23:01,636-0500 ERROR (tasks/0) [storage.StoragePool] Unexpected error (sp:389) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 361, in startSpm self._updateDomainsRole() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 254, in _updateDomainsRole domain = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION' 2017-12-21 15:23:01,636-0500 ERROR (tasks/0) [storage.StoragePool] failed: 'VERSION' (sp:390) 2017-12-21 15:23:01,636-0500 INFO (tasks/0) [storage.SANLock] Releasing Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) (clusterlock:435) 2017-12-21 15:23:01,710-0500 INFO (tasks/0) [storage.SANLock] Successfully released Lease(name='SDM', path=u'/rhev/data-center/mnt/glusterSD/10.0.20.5:tmp2/733b23d8-482d-4b0d-af84-4791d1285f8e/dom_md/leases', offset=1048576) (clusterlock:444) 2017-12-21 15:23:01,710-0500 ERROR (tasks/0) [storage.TaskManager.Task] (Task='df32dac5-c49f-4980-a8c5-fa8d3ca36afe') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 361, in startSpm self._updateDomainsRole() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 254, in _updateDomainsRole domain = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 366, in __init__ manifest = self.manifestClass(domainPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 173, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__ self._domainLock = self._makeDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock domVersion = self.getVersion() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion return self.getMetaParam(DMDK_VERSION) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam return self._metadata[key] File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__ return dec(self._dict[key]) File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 203, in __getitem__ raise KeyError(key) KeyError: 'VERSION' My master domain's metadata file was corrupted. I remade a metadata file and my hosts were able to come up. Closing this issue. Do you have any logs to indicate how the metadata file got corrupted? vdsm/gluster mount logs? Re-opening this bug to understand the issue. How exactly have you performed upgrade of those hosts? 1. Manually from command line using yum update? If so were hosts in Maintenance during that operation? 2. Or using Upgrade option in webadmin? Also have you upgraded engine to 4.2 before upgrading hosts? (In reply to Martin Perina from comment #6) > How exactly have you performed upgrade of those hosts? > > 1. Manually from command line using yum update? If so were hosts in > Maintenance during that operation? > > 2. Or using Upgrade option in webadmin? > > Also have you upgraded engine to 4.2 before upgrading hosts? I upgraded the engine first, and then upgraded the hosts from the webadmin, the hosts were in maintenance mode before upgrading. Created attachment 1375886 [details]
gluster log for data domain
Ravi, can you take a look..the logs indicate split-brain Hi Jason, 1. When you say you re-created the metadata file in comment #3, how did you do it? Did you remove the old file from the mount and create a new one? 2. Do you happen to know the gfid and the filename of the old metadata file? That would be helpful in looking at the logs. I do see split-brain messages in the logs but these can also occur if there was only one good copy of the file and brick that had that copy went down. When that brick comes up, I/O will be allowed on the file again. Notes to self: split-brain messages in the logs for the following gfids: 00000000-0000-0000-0000-000000000001 33173ce2-b91e-441c-91a5-e65eba02a6eb 69b1370d-bc6f-4def-8896-4352ee8862c8 979f2db9-da06-4123-b873-4b9199346537 Multiple mounts/umounts seem to have happened: grep -rne "Started running" rhev-data-center-mnt-glusterSD-10.0.20.1__data.log-20171224 Multiple disconnects to bricks seem to have happened: grep -nE "Connected to|disconnected from" rhev-data-center-mnt-glusterSD-10.0.20.1__data.log-20171224 (In reply to Ravishankar N from comment #10) > Hi Jason, > 1. When you say you re-created the metadata file in comment #3, how did you > do it? Did you remove the old file from the mount and create a new one? Yes, deleted it and created a new one. > > 2. Do you happen to know the gfid and the filename of the old metadata file? > That would be helpful in looking at the logs. I do see split-brain messages > in the logs but these can also occur if there was only one good copy of the > file and brick that had that copy went down. When that brick comes up, I/O > will be allowed on the file again. The (mounted) filename/path is /rhev/data-center/mnt/glusterSD/10.0.20.1\:_data/616be2b6-71db-4f54-befd-be6a444775d7/dom_md/metadata I don't know how to get the gfid gfid (which is the inode number in gluster world) is stored as an extended attribute trusted.gfid on the file. `getfattr -d -m . -e hex /path/to/backend-brick/616be2b6-71db-4f54-befd-be6a444775d7/dom_md/metadata` should give you the gfid amongst other attributes. But since you deleted and re-created the file, it would have a gotten a different gfid, so it won't be of much help. Closing this bug as we don't have the afr extended attributes of the file in question to ascertain if it was indeed a split-brain. Note that we are in the process of fixing BZ 1384983 and BZ 1537480 for preventing a few known cases of split-brains in replica 3/ arbiter volumes. |