Bug 859401

Summary:

Self-heal failed when one of the replica pair was killed and restarted

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

shylesh <shmohan>

Component:

glusterfs

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED NOTABUG

QA Contact:

shylesh <shmohan>

Severity:

high

Docs Contact:

Priority:

high

Version:

unspecified

CC:

grajaiya, iheim, rhinduja, rhs-bugs, shaines, spandura, vbellur, vinaraya

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-11-26 08:42:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com_distrep2.log	none

Description shylesh 2012-09-21 13:04:45 UTC

Description of problem:
Created a distribute-replicate volume 2x2 configuration, this was used as VM store, added 2 more bricks and started rebalance. brought down one of the brick and after some time brought it back. seld-healing failed

Version-Release number of selected component (if applicable):
RHS2.0zz 

How reproducible:


Steps to Reproduce:
1.Created a distribute-replicate volume 2x2 configuration
2.Crated some VM's on this volume  
3.added 2 more bricks 
4.started rebalance and brought down one of the brick from this newly added pair
5.after some time brought back the brick 
  
Actual results:
VM is down and could see self-heal failed messages on the client

Expected results:


Additional info:
[root@rhs-gp-srv9 ~]# tail -n 100 /var/log/glusterfs/rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com\:_distrep2.log
 48:     subvolumes distrep2-dht
 49: end-volume
 50: 
 51: volume distrep2
 52:     type debug/io-stats
 53:     option latency-measurement off
 54:     option count-fop-hits off
 55:     subvolumes distrep2-write-behind
 56: end-volume

+------------------------------------------------------------------------------+
[2012-09-21 16:06:23.701143] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-1: changing port to 24009 (from 0)
[2012-09-21 16:06:23.701212] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-3: changing port to 24009 (from 0)
[2012-09-21 16:06:23.701251] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-2: changing port to 24009 (from 0)
[2012-09-21 16:06:23.701316] I [rpc-clnt.c:1659:rpc_clnt_reconfig] 3-distrep2-client-0: changing port to 24009 (from 0)
[2012-09-21 16:06:27.591460] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:06:27.591807] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-1: Connected to 10.70.36.15:24009, attached to remote volume '/disk2'.
[2012-09-21 16:06:27.591841] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:06:27.591909] I [afr-common.c:3631:afr_notify] 3-distrep2-replicate-0: Subvolume 'distrep2-client-1' came back up; going online.
[2012-09-21 16:06:27.592072] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-1: Server lk version = 1
[2012-09-21 16:06:27.595178] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:06:27.595497] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-3: Connected to 10.70.36.19:24009, attached to remote volume '/disk2'.
[2012-09-21 16:06:27.595520] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:06:27.595596] I [afr-common.c:3631:afr_notify] 3-distrep2-replicate-1: Subvolume 'distrep2-client-3' came back up; going online.
[2012-09-21 16:06:27.595784] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-3: Server lk version = 1
[2012-09-21 16:06:27.598992] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:06:27.599358] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-2: Connected to 10.70.36.16:24009, attached to remote volume '/disk2'.
[2012-09-21 16:06:27.599399] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:06:27.599656] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-2: Server lk version = 1
[2012-09-21 16:06:27.602948] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:06:27.603405] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-0: Connected to 10.70.36.8:24009, attached to remote volume '/disk2'.
[2012-09-21 16:06:27.603437] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:06:27.608195] I [fuse-bridge.c:4222:fuse_graph_setup] 0-fuse: switched to graph 3
[2012-09-21 16:06:27.608242] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-0: Server lk version = 1
[2012-09-21 16:06:32.614502] E [socket.c:1715:socket_connect_finish] 2-distrep2-client-5: connection to 10.70.36.6:24010 failed (Connection refused)
[2012-09-21 16:06:36.913296] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-distrep2-replicate-0: added root inode
[2012-09-21 16:06:36.913982] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-distrep2-replicate-1: added root inode
[2012-09-21 16:06:36.917080] I [client.c:2151:notify] 2-distrep2-client-0: current graph is no longer active, destroying rpc_client 
[2012-09-21 16:06:36.917149] I [client.c:2151:notify] 2-distrep2-client-1: current graph is no longer active, destroying rpc_client 
[2012-09-21 16:06:36.917176] I [client.c:2151:notify] 2-distrep2-client-2: current graph is no longer active, destroying rpc_client 
[2012-09-21 16:06:36.917199] I [client.c:2151:notify] 2-distrep2-client-3: current graph is no longer active, destroying rpc_client 
[2012-09-21 16:06:36.917228] I [client.c:2151:notify] 2-distrep2-client-4: current graph is no longer active, destroying rpc_client 
[2012-09-21 16:06:36.917233] I [client.c:2090:client_rpc_notify] 2-distrep2-client-0: disconnected
[2012-09-21 16:06:36.917295] I [client.c:2090:client_rpc_notify] 2-distrep2-client-1: disconnected
[2012-09-21 16:06:36.917243] I [client.c:2151:notify] 2-distrep2-client-5: current graph is no longer active, destroying rpc_client 
[2012-09-21 16:06:36.917313] E [afr-common.c:3668:afr_notify] 2-distrep2-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2012-09-21 16:06:36.917412] I [client.c:2090:client_rpc_notify] 2-distrep2-client-2: disconnected
[2012-09-21 16:06:36.917445] I [client.c:2090:client_rpc_notify] 2-distrep2-client-3: disconnected
[2012-09-21 16:06:36.917459] E [afr-common.c:3668:afr_notify] 2-distrep2-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2012-09-21 16:07:12.053161] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-3: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.19:24009)
[2012-09-21 16:07:12.053257] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected
[2012-09-21 16:07:22.631713] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:07:22.632134] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-3: Connected to 10.70.36.19:24009, attached to remote volume '/disk2'.
[2012-09-21 16:07:22.632157] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:07:22.632170] I [client-handshake.c:1282:client_post_handshake] 3-distrep2-client-3: 1 fds open - Delaying child_up until they are re-opened
[2012-09-21 16:07:22.632560] I [client-lk.c:601:decrement_reopen_fd_count] 3-distrep2-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2012-09-21 16:07:22.632812] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-3: Server lk version = 1
[2012-09-21 16:09:55.281775] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-0: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.8:24009)
[2012-09-21 16:09:55.281854] I [client.c:2090:client_rpc_notify] 3-distrep2-client-0: disconnected
[2012-09-21 16:10:05.657960] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:05.658418] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-0: Connected to 10.70.36.8:24009, attached to remote volume '/disk2'.
[2012-09-21 16:10:05.658451] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:10:05.658784] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-0: Server lk version = 1
[2012-09-21 16:10:15.796330] W [socket.c:1512:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.15:24007)
[2012-09-21 16:10:16.001301] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-1: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.15:24009)
[2012-09-21 16:10:16.001368] I [client.c:2090:client_rpc_notify] 3-distrep2-client-1: disconnected
[2012-09-21 16:10:22.354566] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-2: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.16:24009)
[2012-09-21 16:10:22.354632] I [client.c:2090:client_rpc_notify] 3-distrep2-client-2: disconnected
[2012-09-21 16:10:25.636637] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-3: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.19:24009)
[2012-09-21 16:10:25.636697] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected
[2012-09-21 16:10:25.636716] E [afr-common.c:3668:afr_notify] 3-distrep2-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2012-09-21 16:10:26.664948] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-09-21 16:10:26.668378] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:26.668743] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-1: Connected to 10.70.36.15:24009, attached to remote volume '/disk2'.
[2012-09-21 16:10:26.668765] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:10:26.669003] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-1: Server lk version = 1
[2012-09-21 16:10:30.251134] W [socket.c:1512:__socket_proto_state_machine] 3-distrep2-client-0: reading from socket failed. Error (Transport endpoint is not connected), peer (10.70.36.8:24009)
[2012-09-21 16:10:30.251196] I [client.c:2090:client_rpc_notify] 3-distrep2-client-0: disconnected
[2012-09-21 16:10:32.672928] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:32.673194] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-2: Connected to 10.70.36.16:24009, attached to remote volume '/disk2'.
[2012-09-21 16:10:32.673217] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:10:32.673241] I [client-handshake.c:1282:client_post_handshake] 3-distrep2-client-2: 1 fds open - Delaying child_up until they are re-opened
[2012-09-21 16:10:32.673634] I [client-lk.c:601:decrement_reopen_fd_count] 3-distrep2-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2012-09-21 16:10:32.673677] I [afr-common.c:3631:afr_notify] 3-distrep2-replicate-1: Subvolume 'distrep2-client-2' came back up; going online.
[2012-09-21 16:10:32.673842] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-2: Server lk version = 1
[2012-09-21 16:10:35.677000] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:35.677349] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-3: Connected to 10.70.36.19:24009, attached to remote volume '/disk2'.
[2012-09-21 16:10:35.677373] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:10:35.677386] I [client-handshake.c:1282:client_post_handshake] 3-distrep2-client-3: 1 fds open - Delaying child_up until they are re-opened
[2012-09-21 16:10:35.677860] I [client-lk.c:601:decrement_reopen_fd_count] 3-distrep2-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2012-09-21 16:10:35.678131] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-3: Server lk version = 1
[2012-09-21 16:10:40.681323] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:40.681722] I [client-handshake.c:1433:client_setvolume_cbk] 3-distrep2-client-0: Connected to 10.70.36.8:24009, attached to remote volume '/disk2'.
[2012-09-21 16:10:40.681755] I [client-handshake.c:1445:client_setvolume_cbk] 3-distrep2-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-09-21 16:10:40.682067] I [client-handshake.c:453:client_set_lk_version_cbk] 3-distrep2-client-0: Server lk version = 1
[2012-09-21 16:10:43.132422] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 3-distrep2-replicate-1: Unable to self-heal contents of '<gfid:48930e98-69c5-45d3-9875-ffda39031120>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-09-21 16:10:43.132923] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 3-distrep2-replicate-1: background  data self-heal failed on <gfid:48930e98-69c5-45d3-9875-ffda39031120>
[2012-09-21 16:13:51.928306] I [fuse-bridge.c:4122:fuse_thread_proc] 0-fuse: unmounting /rhev/data-center/mnt/rhs-gp-srv11.lab.eng.blr.redhat.com:_distrep2
[2012-09-21 16:13:51.931002] W [glusterfsd.c:906:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x7f4e08d7f11d] (-->/lib64/libpthread.so.0(+0x7851) [0x7f4e093cb851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405d2d]))) 0-: received signum (15), shutting down
[2012-09-21 16:13:51.931149] I [fuse-bridge.c:4707:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/rhs-gp-srv11.lab.eng.blr.redhat.com:_distrep2'.

Comment 2 shylesh 2012-09-21 13:09:12 UTC

Version info
==========

[root@rhs-gp-srv4 ~]# rpm -qa | grep glus
glusterfs-server-3.3.0rhsvirt1-5.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-fuse-3.3.0rhsvirt1-5.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-5.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-3.3.0rhsvirt1-5.el6rhs.x86_64
glusterfs-geo-replication-3.3.0rhsvirt1-5.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch

Comment 3 shylesh 2012-09-21 13:12:49 UTC

[root@rhs-gp-srv9 ~]# rpm -qa | grep rhev
qemu-kvm-rhev-tools-0.12.1.2-2.295.el6_3.2.x86_64
qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64

Comment 4 Gowrishankar Rajaiyan 2012-09-22 10:48:16 UTC

I am unable to attach the storage back to this host using rhev-m. 

vdsm.log:
Thread-35::DEBUG::2012-09-22 16:02:40,646::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = '  Volume group "9c17fd91-2e28-463d-b9b3-93fcd9a77679" not found\n'; <rc> = 5
Thread-35::WARNING::2012-09-22 16:02:40,649::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "9c17fd91-2e28-463d-b9b3-93fcd9a77679" not found']
Thread-35::DEBUG::2012-09-22 16:02:40,649::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-35::DEBUG::2012-09-22 16:02:40,652::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1'
Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' (0 active users)
Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1' is free, finding out if anyone is waiting for it.
Thread-35::DEBUG::2012-09-22 16:02:40,653::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.609c0436-4b5c-4639-89ed-1fc60ccaa9a1', Clearing records.
Thread-35::ERROR::2012-09-22 16:02:40,654::task::853::TaskManager.Task::(_setError) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 811, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 853, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 647, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1172, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1515, in getMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=609c0436-4b5c-4639-89ed-1fc60ccaa9a1, msdUUID=9c17fd91-2e28-463d-b9b3-93fcd9a77679'
Thread-35::DEBUG::2012-09-22 16:02:40,654::task::872::TaskManager.Task::(_run) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Task._run: 153517cd-8fe7-4dea-88a4-4d3e9e84833c ('609c0436-4b5c-4639-89ed-1fc60ccaa9a1', 1, '609c0436-4b5c-4639-89ed-1fc60ccaa9a1', '9c17fd91-2e28-463d-b9b3-93fcd9a77679', 1) {} failed - stopping task
Thread-35::DEBUG::2012-09-22 16:02:40,655::task::1199::TaskManager.Task::(stop) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::stopping in state preparing (force False)
Thread-35::DEBUG::2012-09-22 16:02:40,655::task::978::TaskManager.Task::(_decref) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::ref 1 aborting True
Thread-35::INFO::2012-09-22 16:02:40,655::task::1157::TaskManager.Task::(prepare) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::aborting: Task is aborted: 'Cannot find master domain' - code 304
Thread-35::DEBUG::2012-09-22 16:02:40,656::task::1162::TaskManager.Task::(prepare) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Prepare: aborted: Cannot find master domain
Thread-35::DEBUG::2012-09-22 16:02:40,656::task::978::TaskManager.Task::(_decref) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::ref 0 aborting True
Thread-35::DEBUG::2012-09-22 16:02:40,656::task::913::TaskManager.Task::(_doAbort) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::Task._doAbort: force False
Thread-35::DEBUG::2012-09-22 16:02:40,657::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-35::DEBUG::2012-09-22 16:02:40,657::task::588::TaskManager.Task::(_updateState) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::moving from state preparing -> state aborting
Thread-35::DEBUG::2012-09-22 16:02:40,657::task::537::TaskManager.Task::(__state_aborting) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::_aborting: recover policy none
Thread-35::DEBUG::2012-09-22 16:02:40,658::task::588::TaskManager.Task::(_updateState) Task=`153517cd-8fe7-4dea-88a4-4d3e9e84833c`::moving from state aborting -> state failed
Thread-35::DEBUG::2012-09-22 16:02:40,658::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-35::DEBUG::2012-09-22 16:02:40,658::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-35::ERROR::2012-09-22 16:02:40,658::dispatcher::66::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=609c0436-4b5c-4639-89ed-1fc60ccaa9a1, msdUUID=9c17fd91-2e28-463d-b9b3-93fcd9a77679'", 'code': 304}}

Comment 5 Pranith Kumar K 2012-09-25 16:39:38 UTC

I see split-brain log message in the logs you pasted, but the steps don't indicate for the possibility of a split-brain. Do you have the full log I can inspect?

Comment 6 Gowrishankar Rajaiyan 2012-09-25 17:58:23 UTC

Created attachment 617160 [details]
rhev-data-center-mnt-rhs-gp-srv11.lab.eng.blr.redhat.com_distrep2.log

Attaching all available relevant log files.

Comment 7 Pranith Kumar K 2012-10-15 14:41:12 UTC

[2012-09-21 16:07:12.053257] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected
[2012-09-21 16:07:22.631713] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:22.354632] I [client.c:2090:client_rpc_notify] 3-distrep2-client-2: disconnected
[2012-09-21 16:10:25.636697] I [client.c:2090:client_rpc_notify] 3-distrep2-client-3: disconnected
[2012-09-21 16:10:32.672928] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:35.677000] I [client-handshake.c:1636:select_server_supported_programs] 3-distrep2-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-09-21 16:10:43.132422] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 3-distrep2-replicate-1: Unable to self-heal contents of '<gfid:48930e98-69c5-45d3-9875-ffda39031120>' (possible split-brain). Please delete the file from all but the preferred subvolume.


Seems like a legitimate split-brain. so when the bricks go down, fops happen on the last guy alone here client-3 and when they come up it happens on the first-one alone here client-2

I see disconnects in a loop, just like the ones we observed in 865406

Comment 8 Vijay Bellur 2012-11-26 08:42:47 UTC

Closing this bug as it is a legitimate split brain from a flakey network.