Bug 859401
Summary: | Self-heal failed when one of the replica pair was killed and restarted | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | shylesh <shmohan> | ||||
Component: | glusterfs | Assignee: | Pranith Kumar K <pkarampu> | ||||
Status: | CLOSED NOTABUG | QA Contact: | shylesh <shmohan> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | unspecified | CC: | grajaiya, iheim, rhinduja, rhs-bugs, shaines, spandura, vbellur, vinaraya | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-11-26 08:42:47 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
shylesh
2012-09-21 13:04:45 UTC
Version info ========== [root@rhs-gp-srv4 ~]# rpm -qa | grep glus glusterfs-server-3.3.0rhsvirt1-5.el6rhs.x86_64 vdsm-gluster-4.9.6-14.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-fuse-3.3.0rhsvirt1-5.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-5.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch gluster-swift-doc-1.4.8-4.el6.noarch glusterfs-3.3.0rhsvirt1-5.el6rhs.x86_64 glusterfs-geo-replication-3.3.0rhsvirt1-5.el6rhs.x86_64 gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch [root@rhs-gp-srv9 ~]# rpm -qa | grep rhev qemu-kvm-rhev-tools-0.12.1.2-2.295.el6_3.2.x86_64 qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 I am unable to attach the storage back to this host using rhev-m. vdsm.log: Thread-35::DEBUG::2012-09-22 16:02:40,646::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "9c17fd91-2e28-463d-b9b3-93fcd9a77679" not found\n'; <rc> = 5 Thread-35::WARNING::2012-09-22 16:02:40,649::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "9c17fd91-2e28-463d-b9b3-93fcd9a77679" not found'] Thread-35::DEBUG::2012-09-22 16:02:40,649::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-35::DEBUG::2012-09-22 16:02:40,652::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' (0 active users) Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' is free, finding out if anyone is waiting for it. Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1', Clearing records. Thread-35::ERROR::2012-09-22 16:02:40,654::task::853::TaskManager.Task::(_setError) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 811, in connectStoragePool return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 853, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 647, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1172, in __rebuild self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1515, in getMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=609c0436-4b5c-4639-89ed-1fc60ccaa9a1, msdUUID=9c17fd91-2e28-463d-b9b3-93fcd9a77679' Thread-35::DEBUG::2012-09-22 16:02:40,654::task::872::TaskManager.Task::(_run) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Task._run: 153517cd-8fe7-4dea-88a4-4d3e9e84833c ('609c0436-4b5c-4639-89ed-1fc60ccaa9a1', 1, '609c0436-4b5c-4639-89ed-1fc60ccaa9a1', '9c17fd91-2e28-463d-b9b3-93fcd9a77679', 1) {} failed - stopping task Thread-35::DEBUG::2012-09-22 16:02:40,655::task::1199::TaskManager.Task::(stop) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::stopping in state preparing (force False) Thread-35::DEBUG::2012-09-22 16:02:40,655::task::978::TaskManager.Task::(_decref) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::ref 1 aborting True Thread-35::INFO::2012-09-22 16:02:40,655::task::1157::TaskManager.Task::(prepare) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::aborting: Task is aborted: 'Cannot find master domain' - code 304 Thread-35::DEBUG::2012-09-22 16:02:40,656::task::1162::TaskManager.Task::(prepare) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Prepare: aborted: Cannot find master domain Thread-35::DEBUG::2012-09-22 16:02:40,656::task::978::TaskManager.Task::(_decref) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::ref 0 aborting True Thread-35::DEBUG::2012-09-22 16:02:40,656::task::913::TaskManager.Task::(_doAbort) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Task._doAbort: force False Thread-35::DEBUG::2012-09-22 16:02:40,657::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-35::DEBUG::2012-09-22 16:02:40,657::task::588::TaskManager.Task::(_updateState) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::moving from state preparing -> state aborting Thread-35::DEBUG::2012-09-22 16:02:40,657::task::537::TaskManager.Task::(__state_aborting) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::_aborting: recover policy none Thread-35::DEBUG::2012-09-22 16:02:40,658::task::588::TaskManager.Task::(_updateState) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::moving from state aborting -> state failed Thread-35::DEBUG::2012-09-22 16:02:40,658::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-35::DEBUG::2012-09-22 16:02:40,658::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-35::ERROR::2012-09-22 16:02:40,658::dispatcher::66::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=609c0436-4b5c-4639-89ed-1fc60ccaa9a1, msdUUID=9c17fd91-2e28-463d-b9b3-93fcd9a77679'", 'code': 304}} I see split-brain log message in the logs you pasted, but the steps don't indicate for the possibility of a split-brain. Do you have the full log I can inspect? Created attachment 617160 [details]
rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com_distrep2.log
Attaching all available relevant log files.
[2012-09-21 16:07:12.053257] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected [2012-09-21 16:07:22.631713] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:22.354632] I [client.c:2090:client_rpc_notify] 3-distrep2-client-2: disconnected [2012-09-21 16:10:25.636697] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected [2012-09-21 16:10:32.672928] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:35.677000] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:43.132422] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 3-distrep2-replicate-1: Unable to self-heal contents of '<gfid:48930e98-69c5-45d3-9875-ffda39031120>' (possible split-brain). Please delete the file from all but the preferred subvolume. Seems like a legitimate split-brain. so when the bricks go down, fops happen on the last guy alone here client-3 and when they come up it happens on the first-one alone here client-2 I see disconnects in a loop, just like the ones we observed in 865406 Closing this bug as it is a legitimate split brain from a flakey network. |