Bug 865674

Summary: [RHEV-RHS] VM migration fails after self-heal
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Anush Shetty <ashetty>
Component: glusterfsAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED NOTABUG QA Contact: Anush Shetty <ashetty>
Severity: unspecified Docs Contact:
Priority: high    
Version: 2.0CC: grajaiya, pkarampu, rhs-bugs, shaines, spandura, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-29 13:39:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
brick log
none
Brick 4
none
Brick 3
none
Brick 2
none
hypervisor log none

Description Anush Shetty 2012-10-12 06:01:01 UTC
Description of problem: When we tried migrating VMs after executing self-heal, the migration of VMs failed. They went into non-responsive state.

The VMs were hosted on a 2x2 distributed-replicate setup. 1 brick in each of the replica pairs were brought down and brought up after a while. When those bricks were down, snapshots of the running VMs were created. After bringing all the bricks up using "volume start force", we tried migrating the VMs to another host in the datacenter. The migration failed and VMs went into non-responsive state.

Version-Release number of selected component (if applicable):

RHEV-M: rhevm-webadmin-portal-3.1.0-20.el6ev.noarch

RHS:
# rpm -qa | grep glusterfs
glusterfs-fuse-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-devel-3.3.0rhsvirt1-7.el6rhs.x86_64
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-server-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-debuginfo-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-geo-replication-3.3.0rhsvirt1-7.el6rhs.x86_64


How reproducible: The migration failures were consistent


Steps to Reproduce:
1. Create a 2x2 distributed-replicate volume and add it as a storage domain on RHEV-M
2. Bring down of the bricks in each of the replica pair down. 
3. Bring the bricks up: gluster volume start dist-replica force
4. Start self-heal - gluster volume heal dist-replica full
5. Migrate VMs from RHEV-M
  
Actual results:
	
VM is not responding. Migration failed.

Expected results:

Migration should succeed.

Additional info:

Volume Name: dist-replica
Type: Distributed-Replicate
Volume ID: 39e0c10c-12d8-4484-b21d-a3be0cd0b7aa
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-client36.lab.eng.blr.redhat.com:/dist-replica1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/dist-replica1
Brick3: rhs-client43.lab.eng.blr.redhat.com:/dist-replica1
Brick4: rhs-client44.lab.eng.blr.redhat.com:/dist-replica1
Options Reconfigured:
cluster.data-self-heal-algorithm: full
performance.quick-read: disable
performance.io-cache: disable
performance.stat-prefetch: disable
performance.read-ahead: disable
storage.linux-aio: disable
cluster.eager-lock: enable


# gluster volume heal dist-replica info heal-failed
Heal operation on volume dist-replica has been successful

Brick rhs-client36.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 0

Brick rhs-client37.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 0

Brick rhs-client43.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 0

Brick rhs-client44.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 1
at                    path on brick
-----------------------------------
2012-10-12 10:18:49 <gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>



# gluster volume heal dist-replica info split-brain
Heal operation on volume dist-replica has been successful

Brick rhs-client36.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 0

Brick rhs-client37.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 0

Brick rhs-client43.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 7
at                    path on brick
-----------------------------------
2012-10-12 11:18:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 11:08:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:58:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:48:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:38:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:28:47 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:24:50 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a

Brick rhs-client44.lab.eng.blr.redhat.com:/dist-replica1
Number of entries: 5
at                    path on brick
-----------------------------------
2012-10-12 11:08:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:58:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:48:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:38:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
2012-10-12 10:28:49 /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a



glustershd.log on the brick which was brought down

[2012-10-12 10:18:49.471799] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-12 10:18:49.472358] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-0: Server lk version = 1
[2012-10-12 10:18:49.477200] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-3: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-12 10:18:49.477467] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-3: Connected to 10.70.36.68:24010, attached to remote volume '/dist-replica1'.
[2012-10-12 10:18:49.477493] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-12 10:18:49.477559] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-1: Subvolume 'dist-replica-client-3' came back up; going online.
[2012-10-12 10:18:49.477689] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-3: Server lk version = 1
[2012-10-12 10:18:49.480999] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:da8b8529-9238-46d8-950b-addd9b72b3a1> failed on child dist-replica-client-2 (Transport endpoint is not connected)
[2012-10-12 10:18:49.481526] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-12 10:18:49.481890] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-2: Connected to 10.70.36.67:24010, attached to remote volume '/dist-replica1'.
[2012-10-12 10:18:49.481921] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-12 10:18:49.483603] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-2: Server lk version = 1
[2012-10-12 10:22:27.858673] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-12 10:22:28.892947] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-12 10:22:28.944479] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-10-12 10:24:58.848994] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:25:07.743106] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:25:10.991522] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:25:13.453255] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:25:13.454103] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:25:35.624170] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:26:50.308879] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:26:50.310607] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:26:52.155138] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:26:52.156739] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:26:53.918361] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:28:49.541026] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-12 10:38:49.601079] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-12 10:42:41.652959] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:48:49.673890] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-12 10:57:03.012556] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)
[2012-10-12 10:58:49.744783] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-12 11:08:49.813066] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:da8b8529-9238-46d8-950b-addd9b72b3a1>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-12 11:10:32.980028] E [afr-self-heald.c:685:_link_inode_update_loc] 0-dist-replica-replicate-1: inode link failed on the inode (00000000-0000-0000-0000-000000000000)


vdsm log:
Thread-60009::DEBUG::2012-10-12 10:58:38,747::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60009::DEBUG::2012-10-12 10:58:38,747::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '55.72', 'cpuIdle': '0.16', 'elapsedTime': '229432', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}
Thread-60011::DEBUG::2012-10-12 10:58:40,806::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmGetStats with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8',) {} flowID [2ca2152f]
Thread-60011::DEBUG::2012-10-12 10:58:40,806::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60011::DEBUG::2012-10-12 10:58:40,807::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '57.78', 'cpuIdle': '0.16', 'elapsedTime': '229434', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}
Thread-60013::DEBUG::2012-10-12 10:58:42,869::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmGetStats with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8',) {} flowID [f8b96b8]
Thread-60013::DEBUG::2012-10-12 10:58:42,870::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60013::DEBUG::2012-10-12 10:58:42,870::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '59.84', 'cpuIdle': '0.16', 'elapsedTime': '229436', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}
Thread-35::ERROR::2012-10-12 10:58:43,001::utils::399::vm.Vm::(collect) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x248d6b0>
Thread-60015::DEBUG::2012-10-12 10:58:44,937::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmGetStats with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8',) {} flowID [1579a788]
Thread-60015::DEBUG::2012-10-12 10:58:44,937::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60015::DEBUG::2012-10-12 10:58:44,938::BindingXMLRPC::880::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '10513', 'displayIp': '0', 'displayPort': u'5910', 'session': 'Unknown', 'displaySecurePort': u'5911', 'timeOffset': '-43200', 'hash': '3659303903263255516', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet5': {'macAddr': '52:54:00:35:c9:c3', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet5'}}, 'vmId': 'f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'displayType': 'qxl', 'cpuUser': '99.84', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '851968', 'writeLatency': '0', 'imageID': 'a1342311-2a59-410a-b4b0-d6592338d8ce', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '851968', 'writeRate': '0.00'}, u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}}, 'monitorResponse': '-1', 'statsAge': '61.91', 'cpuIdle': '0.16', 'elapsedTime': '229438', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}
Thread-60017::DEBUG::2012-10-12 10:58:47,021::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60032::DEBUG::2012-10-12 10:58:57,331::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60040::DEBUG::2012-10-12 10:59:07,604::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60048::DEBUG::2012-10-12 10:59:17,878::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60056::DEBUG::2012-10-12 10:59:28,152::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available
Thread-60061::DEBUG::2012-10-12 10:59:32,893::BindingXMLRPC::874::vds::(wrapper) client [10.70.34.101]::call vmSetTicket with ('f954ee50-9aa0-41a3-922d-6ceec1c483c8', 'EzadNkg5n49j', '120', 'disconnect', {'userName': 'admin@internal', 'userId': 'fdfc627c-d875-11e0-90f0-83df133b58cc'}) {} flowID [38a214c6]
Thread-60065::DEBUG::2012-10-12 10:59:38,440::libvirtvm::243::vm.Vm::(_getDiskStats) vmId=`f954ee50-9aa0-41a3-922d-6ceec1c483c8`::Disk hdc stats not available

vm details:
id                        : f954ee50-9aa0-41a3-922d-6ceec1c483c8
name                      : vm_autotest_7
cluster-id                : 29c0e55c-094b-11e2-a118-52540035ed48
cpu-topology-cores        : 1
cpu-topology-sockets      : 2
creation_time             : 2012-10-09T09:42:02.927-04:00
display-address           : 10.70.36.5
display-monitors          : 1
display-port              : 5910
display-secure_port       : 5911
display-type              : spice
high_availability-enabled : False
high_availability-priority: 1
host-id                   : 9b5dcde2-094b-11e2-8d8c-52540035ed48
memory                    : 4404019200
memory_policy-guaranteed  : 4404019200
origin                    : rhev
os-boot-dev               : hd
os-type                   : rhel_6x64
placement_policy-affinity : migratable
quota-id                  : 00000000-0000-0000-0000-000000000000
start_time                : 2012-10-09T09:44:41.555-04:00
stateless                 : False
status-state              : up
template-id               : 29e6cf38-c00c-4273-bfc1-3e851af255a7
type                      : server
usb-enabled               : False

Comment 2 Pranith Kumar K 2012-10-15 05:31:04 UTC
Anush,
     Do you have the sos-reports of the bricks and client logs.

Pranith

Comment 3 Anush Shetty 2012-10-15 10:54:27 UTC
Created attachment 627297 [details]
brick log

Comment 4 Anush Shetty 2012-10-15 10:56:32 UTC
Created attachment 627298 [details]
Brick 4

Comment 5 Anush Shetty 2012-10-15 10:57:31 UTC
Created attachment 627299 [details]
Brick 3

Comment 6 Anush Shetty 2012-10-15 10:59:46 UTC
Created attachment 627300 [details]
Brick 2

Comment 7 Anush Shetty 2012-10-15 11:00:17 UTC
Created attachment 627301 [details]
hypervisor log

Comment 8 Pranith Kumar K 2012-10-15 12:03:58 UTC
This seems like a legitimate split-brain:
there is a 2 second delay between disconnects of client-0 and client-1 so there will be some writes that happen only on client-1

there is a 2 second delay between connects of client-0 and client-1 so there will be some writes that happen only on client-0

That will lead to legitimate split-brain. Why did the disconnects happen on both the subvolumes is something we need to investigate.

Will be moving the bug to NEEDINFO state to figure that out.

[2012-10-11 15:09:20.722501] I [client.c:2090:client_rpc_notify] 0-dist-replica-client-0: disconnected
[2012-10-11 15:09:22.854799] I [client.c:2090:client_rpc_notify] 0-dist-replica-client-1: disconnected
[2012-10-11 15:09:31.409719] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-11 15:09:33.413786] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-11 15:09:36.024280] I [afr-self-heal-common.c:1189:afr_sh_missing_entry_call_impunge_recreate] 0-dist-replica-replicate-1: no missing files - /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a. proceeding to metadata check
[2012-10-11 15:09:37.418613] E [afr-self-heal-data.c:1047:afr_sh_data_fxattrop_cbk] 0-dist-replica-replicate-1: /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a, inspecting change log succeeded on < 2 children
[2012-10-11 15:09:37.419020] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-dist-replica-replicate-1: background  data missing-entry gfid self-heal failed on /7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a
[2012-10-11 15:09:39.049557] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '/7746e77b-7475-4fb8-ab7f-fd85773c5762/images/5e36cc52-acf4-4db8-83b0-cba9d68867af/f913caef-ca44-416b-832c-25a233cb1b3a' (possible split-brain). Please delete the file from all but the preferred subvolume.

Comment 11 Pranith Kumar K 2012-11-29 13:39:41 UTC
Closing this bug as it is a legitimate split brain from a flakey network.