Description of problem: Created a distribute-replicate volume 2x2 configuration, this was used as VM store, added 2 more bricks and started rebalance. brought down one of the brick and after some time brought it back. seld-healing failed Version-Release number of selected component (if applicable): RHS2.0zz How reproducible: Steps to Reproduce: 1.Created a distribute-replicate volume 2x2 configuration 2.Crated some VM's on this volume 3.added 2 more bricks 4.started rebalance and brought down one of the brick from this newly added pair 5.after some time brought back the brick Actual results: VM is down and could see self-heal failed messages on the client Expected results: Additional info: [root@rhs-gp-srv9 ~]# tail -n 100 /var/log/glusterfs/rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com\:_distrep2.log 48: subvolumes distrep2-dht 49: end-volume 50: 51: volume distrep2 52: type debug/io-stats 53: option latency-measurement off 54: option count-fop-hits off 55: subvolumes distrep2-write-behind 56: end-volume +------------------------------------------------------------------------------+ [2012-09-21 16:06:23.701143] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-1: changing port to 24009 (from 0) [2012-09-21 16:06:23.701212] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-3: changing port to 24009 (from 0) [2012-09-21 16:06:23.701251] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-2: changing port to 24009 (from 0) [2012-09-21 16:06:23.701316] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-0: changing port to 24009 (from 0) [2012-09-21 16:06:27.591460] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:06:27.591807] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-1: Connected to 10.70.36.15:24009, attached to remote volume '/disk2'. [2012-09-21 16:06:27.591841] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:06:27.591909] I [afr-common.c:3631:afr_notify] 3-distrep2-replicate-0: Subvolume 'distrep2-client-1' came back up; going online. [2012-09-21 16:06:27.592072] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-1: Server lk version = 1 [2012-09-21 16:06:27.595178] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:06:27.595497] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-3: Connected to 10.70.36.19:24009, attached to remote volume '/disk2'. [2012-09-21 16:06:27.595520] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-3: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:06:27.595596] I [afr-common.c:3631:afr_notify] 3-distrep2-replicate-1: Subvolume 'distrep2-client-3' came back up; going online. [2012-09-21 16:06:27.595784] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-3: Server lk version = 1 [2012-09-21 16:06:27.598992] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:06:27.599358] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-2: Connected to 10.70.36.16:24009, attached to remote volume '/disk2'. [2012-09-21 16:06:27.599399] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-2: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:06:27.599656] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-2: Server lk version = 1 [2012-09-21 16:06:27.602948] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:06:27.603405] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-0: Connected to 10.70.36.8:24009, attached to remote volume '/disk2'. [2012-09-21 16:06:27.603437] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:06:27.608195] I [fuse-bridge.c:4222:fuse_graph_setup] 0-fuse: switched to graph 3 [2012-09-21 16:06:27.608242] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-0: Server lk version = 1 [2012-09-21 16:06:32.614502] E [socket.c:1715:socket_connect_finish] 2-distrep2-client-5: connection to 10.70.36.6:24010 failed (Connection refused) [2012-09-21 16:06:36.913296] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-distrep2-replicate-0: added root inode [2012-09-21 16:06:36.913982] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-distrep2-replicate-1: added root inode [2012-09-21 16:06:36.917080] I [client.c:2151:notify] 2-distrep2-client-0: current graph is no longer active, destroying rpc_client [2012-09-21 16:06:36.917149] I [client.c:2151:notify] 2-distrep2-client-1: current graph is no longer active, destroying rpc_client [2012-09-21 16:06:36.917176] I [client.c:2151:notify] 2-distrep2-client-2: current graph is no longer active, destroying rpc_client [2012-09-21 16:06:36.917199] I [client.c:2151:notify] 2-distrep2-client-3: current graph is no longer active, destroying rpc_client [2012-09-21 16:06:36.917228] I [client.c:2151:notify] 2-distrep2-client-4: current graph is no longer active, destroying rpc_client [2012-09-21 16:06:36.917233] I [client.c:2090:client_rpc_notify] 2-distrep2-client-0: disconnected [2012-09-21 16:06:36.917295] I [client.c:2090:client_rpc_notify] 2-distrep2-client-1: disconnected [2012-09-21 16:06:36.917243] I [client.c:2151:notify] 2-distrep2-client-5: current graph is no longer active, destroying rpc_client [2012-09-21 16:06:36.917313] E [afr-common.c:3668:afr_notify] 2-distrep2-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2012-09-21 16:06:36.917412] I [client.c:2090:client_rpc_notify] 2-distrep2-client-2: disconnected [2012-09-21 16:06:36.917445] I [client.c:2090:client_rpc_notify] 2-distrep2-client-3: disconnected [2012-09-21 16:06:36.917459] E [afr-common.c:3668:afr_notify] 2-distrep2-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up. [2012-09-21 16:07:12.053161] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-3: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.19:24009) [2012-09-21 16:07:12.053257] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected [2012-09-21 16:07:22.631713] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:07:22.632134] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-3: Connected to 10.70.36.19:24009, attached to remote volume '/disk2'. [2012-09-21 16:07:22.632157] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-3: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:07:22.632170] I [client-handshake.c:1282:client_post_handshake] 3-distrep2-client-3: 1 fds open - Delaying child_up until they are re-opened [2012-09-21 16:07:22.632560] I [client-lk.c:601:decrement_reopen_fd_count] 3-distrep2-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2012-09-21 16:07:22.632812] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-3: Server lk version = 1 [2012-09-21 16:09:55.281775] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-0: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.8:24009) [2012-09-21 16:09:55.281854] I [client.c:2090:client_rpc_notify] 3-distrep2-client-0: disconnected [2012-09-21 16:10:05.657960] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:05.658418] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-0: Connected to 10.70.36.8:24009, attached to remote volume '/disk2'. [2012-09-21 16:10:05.658451] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:10:05.658784] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-0: Server lk version = 1 [2012-09-21 16:10:15.796330] W [socket.c:1512:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.15:24007) [2012-09-21 16:10:16.001301] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-1: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.15:24009) [2012-09-21 16:10:16.001368] I [client.c:2090:client_rpc_notify] 3-distrep2-client-1: disconnected [2012-09-21 16:10:22.354566] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-2: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.16:24009) [2012-09-21 16:10:22.354632] I [client.c:2090:client_rpc_notify] 3-distrep2-client-2: disconnected [2012-09-21 16:10:25.636637] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-3: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.19:24009) [2012-09-21 16:10:25.636697] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected [2012-09-21 16:10:25.636716] E [afr-common.c:3668:afr_notify] 3-distrep2-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up. [2012-09-21 16:10:26.664948] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2012-09-21 16:10:26.668378] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:26.668743] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-1: Connected to 10.70.36.15:24009, attached to remote volume '/disk2'. [2012-09-21 16:10:26.668765] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:10:26.669003] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-1: Server lk version = 1 [2012-09-21 16:10:30.251134] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-0: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.8:24009) [2012-09-21 16:10:30.251196] I [client.c:2090:client_rpc_notify] 3-distrep2-client-0: disconnected [2012-09-21 16:10:32.672928] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:32.673194] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-2: Connected to 10.70.36.16:24009, attached to remote volume '/disk2'. [2012-09-21 16:10:32.673217] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-2: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:10:32.673241] I [client-handshake.c:1282:client_post_handshake] 3-distrep2-client-2: 1 fds open - Delaying child_up until they are re-opened [2012-09-21 16:10:32.673634] I [client-lk.c:601:decrement_reopen_fd_count] 3-distrep2-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2012-09-21 16:10:32.673677] I [afr-common.c:3631:afr_notify] 3-distrep2-replicate-1: Subvolume 'distrep2-client-2' came back up; going online. [2012-09-21 16:10:32.673842] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-2: Server lk version = 1 [2012-09-21 16:10:35.677000] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:35.677349] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-3: Connected to 10.70.36.19:24009, attached to remote volume '/disk2'. [2012-09-21 16:10:35.677373] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-3: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:10:35.677386] I [client-handshake.c:1282:client_post_handshake] 3-distrep2-client-3: 1 fds open - Delaying child_up until they are re-opened [2012-09-21 16:10:35.677860] I [client-lk.c:601:decrement_reopen_fd_count] 3-distrep2-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2012-09-21 16:10:35.678131] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-3: Server lk version = 1 [2012-09-21 16:10:40.681323] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:40.681722] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-0: Connected to 10.70.36.8:24009, attached to remote volume '/disk2'. [2012-09-21 16:10:40.681755] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-09-21 16:10:40.682067] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-0: Server lk version = 1 [2012-09-21 16:10:43.132422] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 3-distrep2-replicate-1: Unable to self-heal contents of '<gfid:48930e98-69c5-45d3-9875-ffda39031120>' (possible split-brain). Please delete the file from all but the preferred subvolume. [2012-09-21 16:10:43.132923] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 3-distrep2-replicate-1: background data self-heal failed on <gfid:48930e98-69c5-45d3-9875-ffda39031120> [2012-09-21 16:13:51.928306] I [fuse-bridge.c:4122:fuse_thread_proc] 0-fuse: unmounting /rhev/data-center/mnt/rhs-gp-srv11.lab.eng.blr.redhat.com:_distrep2 [2012-09-21 16:13:51.931002] W [glusterfsd.c:906:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x7f4e08d7f11d] (-->/lib64/libpthread.so.0(+0x7851) [0x7f4e093cb851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405d2d]))) 0-: received signum (15), shutting down [2012-09-21 16:13:51.931149] I [fuse-bridge.c:4707:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/rhs-gp-srv11.lab.eng.blr.redhat.com:_distrep2'.
Version info ========== [root@rhs-gp-srv4 ~]# rpm -qa | grep glus glusterfs-server-3.3.0rhsvirt1-5.el6rhs.x86_64 vdsm-gluster-4.9.6-14.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-fuse-3.3.0rhsvirt1-5.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-5.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch gluster-swift-doc-1.4.8-4.el6.noarch glusterfs-3.3.0rhsvirt1-5.el6rhs.x86_64 glusterfs-geo-replication-3.3.0rhsvirt1-5.el6rhs.x86_64 gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch
[root@rhs-gp-srv9 ~]# rpm -qa | grep rhev qemu-kvm-rhev-tools-0.12.1.2-2.295.el6_3.2.x86_64 qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64
I am unable to attach the storage back to this host using rhev-m. vdsm.log: Thread-35::DEBUG::2012-09-22 16:02:40,646::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "9c17fd91-2e28-463d-b9b3-93fcd9a77679" not found\n'; <rc> = 5 Thread-35::WARNING::2012-09-22 16:02:40,649::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "9c17fd91-2e28-463d-b9b3-93fcd9a77679" not found'] Thread-35::DEBUG::2012-09-22 16:02:40,649::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-35::DEBUG::2012-09-22 16:02:40,652::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' (0 active users) Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' is free, finding out if anyone is waiting for it. Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1', Clearing records. Thread-35::ERROR::2012-09-22 16:02:40,654::task::853::TaskManager.Task::(_setError) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 811, in connectStoragePool return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 853, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 647, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1172, in __rebuild self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1515, in getMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=609c0436-4b5c-4639-89ed-1fc60ccaa9a1, msdUUID=9c17fd91-2e28-463d-b9b3-93fcd9a77679' Thread-35::DEBUG::2012-09-22 16:02:40,654::task::872::TaskManager.Task::(_run) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Task._run: 153517cd-8fe7-4dea-88a4-4d3e9e84833c ('609c0436-4b5c-4639-89ed-1fc60ccaa9a1', 1, '609c0436-4b5c-4639-89ed-1fc60ccaa9a1', '9c17fd91-2e28-463d-b9b3-93fcd9a77679', 1) {} failed - stopping task Thread-35::DEBUG::2012-09-22 16:02:40,655::task::1199::TaskManager.Task::(stop) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::stopping in state preparing (force False) Thread-35::DEBUG::2012-09-22 16:02:40,655::task::978::TaskManager.Task::(_decref) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::ref 1 aborting True Thread-35::INFO::2012-09-22 16:02:40,655::task::1157::TaskManager.Task::(prepare) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::aborting: Task is aborted: 'Cannot find master domain' - code 304 Thread-35::DEBUG::2012-09-22 16:02:40,656::task::1162::TaskManager.Task::(prepare) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Prepare: aborted: Cannot find master domain Thread-35::DEBUG::2012-09-22 16:02:40,656::task::978::TaskManager.Task::(_decref) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::ref 0 aborting True Thread-35::DEBUG::2012-09-22 16:02:40,656::task::913::TaskManager.Task::(_doAbort) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Task._doAbort: force False Thread-35::DEBUG::2012-09-22 16:02:40,657::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-35::DEBUG::2012-09-22 16:02:40,657::task::588::TaskManager.Task::(_updateState) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::moving from state preparing -> state aborting Thread-35::DEBUG::2012-09-22 16:02:40,657::task::537::TaskManager.Task::(__state_aborting) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::_aborting: recover policy none Thread-35::DEBUG::2012-09-22 16:02:40,658::task::588::TaskManager.Task::(_updateState) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::moving from state aborting -> state failed Thread-35::DEBUG::2012-09-22 16:02:40,658::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-35::DEBUG::2012-09-22 16:02:40,658::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-35::ERROR::2012-09-22 16:02:40,658::dispatcher::66::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=609c0436-4b5c-4639-89ed-1fc60ccaa9a1, msdUUID=9c17fd91-2e28-463d-b9b3-93fcd9a77679'", 'code': 304}}
I see split-brain log message in the logs you pasted, but the steps don't indicate for the possibility of a split-brain. Do you have the full log I can inspect?
Created attachment 617160 [details] rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com_distrep2.log Attaching all available relevant log files.
[2012-09-21 16:07:12.053257] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected [2012-09-21 16:07:22.631713] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:22.354632] I [client.c:2090:client_rpc_notify] 3-distrep2-client-2: disconnected [2012-09-21 16:10:25.636697] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected [2012-09-21 16:10:32.672928] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:35.677000] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-09-21 16:10:43.132422] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 3-distrep2-replicate-1: Unable to self-heal contents of '<gfid:48930e98-69c5-45d3-9875-ffda39031120>' (possible split-brain). Please delete the file from all but the preferred subvolume. Seems like a legitimate split-brain. so when the bricks go down, fops happen on the last guy alone here client-3 and when they come up it happens on the first-one alone here client-2 I see disconnects in a loop, just like the ones we observed in 865406
Closing this bug as it is a legitimate split brain from a flakey network.