Description of problem: When we tried migrating VMs after executing self-heal, the migration of VMs failed. They went into non-responsive state. The VMs were hosted on a 2x2 distributed-replicate setup. 1 brick in each of the replica pairs were brought down and brought up after a while. When those bricks were down, snapshots of the running VMs were created. After bringing all the bricks up using "volume start force", we tried migrating the VMs to another host in the datacenter. The migration failed and VMs went into non-responsive state. Version-Release number of selected component (if applicable): RHEV-M: rhevm-webadmin-portal-3.1.0-20.el6ev.noarch RHS: # rpm -qa | grep glusterfs glusterfs-fuse-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-devel-3.3.0rhsvirt1-7.el6rhs.x86_64 org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-server-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-debuginfo-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-geo-replication-3.3.0rhsvirt1-7.el6rhs.x86_64 How reproducible: The migration failures were consistent Steps to Reproduce: 1. Create a 2x2 distributed-replicate volume and add it as a storage domain on RHEV-M 2. Bring down of the bricks in each of the replica pair down. 3. Bring the bricks up: gluster volume start dist-replica force 4. Start self-heal - gluster volume heal dist-replica full 5. Migrate VMs from RHEV-M Actual results: VM is not responding. Migration failed. Expected results: Migration should succeed. Additional info: Volume Name: dist-replica Type: Distributed-Replicate Volume ID: 39e0c10c-12d8-4484-b21d-a3be0cd0b7aa Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhs-client36.lab.eng.blr.redhat.com:/dist-replica1 Brick2: rhs-client37.lab.eng.blr.redhat.com:/dist-replica1 Brick3: rhs-client43.lab.eng.blr.redhat.com:/dist-replica1 Brick4: rhs-client44.lab.eng.blr.redhat.com:/dist-replica1 Options Reconfigured: cluster.data-self-heal-algorithm: full performance.quick-read: disable performance.io-cache: disable performance.stat-prefetch: disable performance.read-ahead: disable storage.linux-aio: disable cluster.eager-lock: enable # gluster volume heal dist-replica info heal-failed Heal operation on volume dist-replica has been successful Brick rhs-client36.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 0 Brick rhs-client37.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 0 Brick rhs-client43.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 0 Brick rhs-client44.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 1 at path on brick ----------------------------------- 2012-10-12 10:18:49 <gfid:da8b8529-9238-46d8-950b-addd9b72b3a1> # gluster volume heal dist-replica info split-brain Heal operation on volume dist-replica has been successful Brick rhs-client36.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 0 Brick rhs-client37.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 0 Brick rhs-client43.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 7 at path on brick ----------------------------------- 2012-10-12 11:18:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 11:08:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:58:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:48:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:38:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:28:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:24:50 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a Brick rhs-client44.lab.eng.blr.redhat.com:/dist-replica1 Number of entries: 5 at path on brick ----------------------------------- 2012-10-12 11:08:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:58:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:48:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:38:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a 2012-10-12 10:28:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a glustershd.log on the brick which was brought down [2012-10-12 10:18:49.471799] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-10-12 10:18:49.472358] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-0: Server lk version = 1 [2012-10-12 10:18:49.477200] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-10-12 10:18:49.477467] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-3: Connected to 10.70.36.68:24010, attached to remote volume '/dist-replica1'. [2012-10-12 10:18:49.477493] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-3: Server and Client lk-version numbers are not same, reopening the fds [2012-10-12 10:18:49.477559] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-1: Subvolume 'dist-replica-client-3' came back up; going online. [2012-10-12 10:18:49.477689] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-3: Server lk version = 1 [2012-10-12 10:18:49.480999] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:da8b8529-9238-46d8-950b-addd9b72b3a1> failed on child dist-replica-client-2 (Transport endpoint is not connected) [2012-10-12 10:18:49.481526] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-10-12 10:18:49.481890] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-2: Connected to 10.70.36.67:24010, attached to remote volume '/dist-replica1'. [2012-10-12 10:18:49.481921] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-2: Server and Client lk-version numbers are not same, reopening the fds [2012-10-12 10:18:49.483603] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-2: Server lk version = 1 [2012-10-12 10:22:27.858673] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed [2012-10-12 10:22:28.892947] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed [2012-10-12 10:22:28.944479] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2012-10-12 10:24:58.848994] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:25:07.743106] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:25:10.991522] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:25:13.453255] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:25:13.454103] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:25:35.624170] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:26:50.308879] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:26:50.310607] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:26:52.155138] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:26:52.156739] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:26:53.918361] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:28:49.541026] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume. [2012-10-12 10:38:49.601079] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume. [2012-10-12 10:42:41.652959] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:48:49.673890] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume. [2012-10-12 10:57:03.012556] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) [2012-10-12 10:58:49.744783] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume. [2012-10-12 11:08:49.813066] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume. [2012-10-12 11:10:32.980028] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000) vdsm log: Thread-60009::DEBUG::2012-10-12 10:58:38,747::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60009::DEBUG::2012-10-12 10:58:38,747::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '55.72', 'cpuIdle': '0.16', 'elapsedTime': '229432', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]} Thread-60011::DEBUG::2012-10-12 10:58:40,806::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmGetStats with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8',) {} flowID [2ca2152f] Thread-60011::DEBUG::2012-10-12 10:58:40,806::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60011::DEBUG::2012-10-12 10:58:40,807::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '57.78', 'cpuIdle': '0.16', 'elapsedTime': '229434', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]} Thread-60013::DEBUG::2012-10-12 10:58:42,869::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmGetStats with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8',) {} flowID [f8b96b8] Thread-60013::DEBUG::2012-10-12 10:58:42,870::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60013::DEBUG::2012-10-12 10:58:42,870::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '59.84', 'cpuIdle': '0.16', 'elapsedTime': '229436', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]} Thread-35::ERROR::2012-10-12 10:58:43,001::utils::399::vm.Vm::(collect) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x248d6b0> Thread-60015::DEBUG::2012-10-12 10:58:44,937::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmGetStats with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8',) {} flowID [1579a788] Thread-60015::DEBUG::2012-10-12 10:58:44,937::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60015::DEBUG::2012-10-12 10:58:44,938::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '61.91', 'cpuIdle': '0.16', 'elapsedTime': '229438', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]} Thread-60017::DEBUG::2012-10-12 10:58:47,021::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60032::DEBUG::2012-10-12 10:58:57,331::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60040::DEBUG::2012-10-12 10:59:07,604::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60048::DEBUG::2012-10-12 10:59:17,878::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60056::DEBUG::2012-10-12 10:59:28,152::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available Thread-60061::DEBUG::2012-10-12 10:59:32,893::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmSetTicket with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'EzadNkg5n49j', '120', 'disconnect', {'userName': 'admin@internal', 'userId': 'fdfc627c-d875-11e0-90f0-83df133b58cc'}) {} flowID [38a214c6] Thread-60065::DEBUG::2012-10-12 10:59:38,440::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available vm details: id : f954ee50-9aa0-41a3-922d-6ceec1c483c8 name : vm_autotest_7 cluster-id : 29c0e55c-094b-11e2-a118-52540035ed48 cpu-topology-cores : 1 cpu-topology-sockets : 2 creation_time : 2012-10-09T09:42:02.927-04:00 display-address : 10.70.36.5 display-monitors : 1 display-port : 5910 display-secure_port : 5911 display-type : spice high_availability-enabled : False high_availability-priority: 1 host-id : 9b5dcde2-094b-11e2-8d8c-52540035ed48 memory : 4404019200 memory_policy-guaranteed : 4404019200 origin : rhev os-boot-dev : hd os-type : rhel_6x64 placement_policy-affinity : migratable quota-id : 00000000-0000-0000-0000-000000000000 start_time : 2012-10-09T09:44:41.555-04:00 stateless : False status-state : up template-id : 29e6cf38-c00c-4273-bfc1-3e851af255a7 type : server usb-enabled : False
Anush, Do you have the sos-reports of the bricks and client logs. Pranith
Created attachment 627297 [details] brick log
Created attachment 627298 [details] Brick 4
Created attachment 627299 [details] Brick 3
Created attachment 627300 [details] Brick 2
Created attachment 627301 [details] hypervisor log
This seems like a legitimate split-brain: there is a 2 second delay between disconnects of client-0 and client-1 so there will be some writes that happen only on client-1 there is a 2 second delay between connects of client-0 and client-1 so there will be some writes that happen only on client-0 That will lead to legitimate split-brain. Why did the disconnects happen on both the subvolumes is something we need to investigate. Will be moving the bug to NEEDINFO state to figure that out. [2012-10-11 15:09:20.722501] I [client.c:2090:client_rpc_notify] 0-dist-replica-client-0: disconnected [2012-10-11 15:09:22.854799] I [client.c:2090:client_rpc_notify] 0-dist-replica-client-1: disconnected [2012-10-11 15:09:31.409719] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-10-11 15:09:33.413786] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-10-11 15:09:36.024280] I [afr-self-heal-common.c:1189:afr_sh_missing_entry_call_impunge_recreate] 0-dist-replica-replicate-1: no missing files - /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a. proceeding to metadata check [2012-10-11 15:09:37.418613] E [afr-self-heal-data.c:1047:afr_sh_data_fxattrop_cbk] 0-dist-replica-replicate-1: /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a, inspecting change log succeeded on < 2 children [2012-10-11 15:09:37.419020] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-dist-replica-replicate-1: background data missing-entry gfid self-heal failed on /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a [2012-10-11 15:09:39.049557] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '/7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a' (possible split-brain). Please delete the file from all but the preferred subvolume.
Closing this bug as it is a legitimate split brain from a flakey network.