Bug 1024313

Summary:	self-heal happening from sink to source in 3 way replica
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	spandura
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED EOL	QA Contact:	spandura
Severity:	high	Docs Contact:
Priority:	high
Version:	2.1	CC:	nsathyan, ravishankar, rhs-bugs, rmekala, sdharane, storage-qa-internal, vagarwal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-03 17:19:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description spandura 2013-10-29 11:27:17 UTC

Description of problem:
=======================
In a 3-way replica setup self-heal is happening from sink node to source when nodes goes offline and comes back online. 

Version-Release number of selected component (if applicable):
===========================================================
glusterfs 3.4.0.36rhs built on Oct 22 2013 10:56:18

How reproducible:
=====================
Executed the test case only once.

Steps to Reproduce:
=====================
1.Create 1 x 3 replicate volume. Start the volume. Create a fuse mount

2.killall gluster process on node3 { killall glusterfs glusterfsd glusterd }

3.From mount point create a file "touch test_file"

4. killall gluster process on node1 { killall glusterfs glusterfsd glusterd }

5. From mount point execute: "touch test_file"

6. restart glusterd on all the nodes. 

Actual results:
================
1. Entry self-heal happened from node2. [ As expected ]

2. Metadata and data Self-heal happened from node1. 

[2013-10-29 09:14:25.720195] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-vol_rep-replicate-0:  metadata self heal  is successfully completed, foreground data self heal  is successfully completed,  from vol_rep-client-0 with 0 0 0  sizes - Pending matrix:  [ [ 0 0 2 ] [ 0 0 1 ] [ 0 0 0 ] ] on <gfid:b8309224-af45-440a-980e-aa588cbeeb8b>

3. In glustershd.log file the matrix shows wrong data. 

foreground data self heal  is successfully completed,  from vol_rep-client-0 with 0 0 0  sizes - Pending matrix:  [ [ 0 0 2 ] [ 0 0 1 ] [ 0 0 0 ] ] on <gfid:b8309224-af45-440a-980e-aa588cbeeb8b>

Extended attributes of file on Brick1 before self-heal
========================================================
root@rhs-client11 [Oct-29-2013- 9:13:48] >getfattr -d -e hex -m . /rhs/bricks/b1/test_file
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1/test_file
trusted.afr.vol_rep-client-0=0x000000000000000000000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.afr.vol_rep-client-2=0x000000010000000200000000
trusted.gfid=0xb8309224af45440a980eaa588cbeeb8b

Extended attributes of file on Brick2 before self-heal
========================================================
root@rhs-client12 [Oct-29-2013- 9:13:41] >getfattr -d -e hex -m . /rhs/bricks/b2/testfile
getfattr: /rhs/bricks/b2/testfile: No such file or directory
root@rhs-client12 [Oct-29-2013- 9:13:48] >getfattr -d -e hex -m . /rhs/bricks/b2/test_file
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b2/test_file
trusted.afr.vol_rep-client-0=0x000000000000000100000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.afr.vol_rep-client-2=0x000000010000000300000000
trusted.gfid=0xb8309224af45440a980eaa588cbeeb8b

Extended attributes of file on Brick3 before self-heal
========================================================
root@rhs-client13 [Oct-29-2013- 9:13:47] >getfattr -d -e hex -m . /rhs/bricks/b3/test_file
getfattr: /rhs/bricks/b3/test_file: No such file or directory

3. heal <volume_name> info healed command output : 
root@rhs-client11 [Oct-29-2013- 9:15:04] >gluster v heal vol_rep info healed
Gathering list of healed entries on volume vol_rep has been successful 

Brick rhs-client11:/rhs/bricks/b1
Number of entries: 1
at                    path on brick
-----------------------------------
2013-10-29 09:14:25 /test_file

Brick rhs-client12:/rhs/bricks/b2
Number of entries: 1
at                    path on brick
-----------------------------------
2013-10-29 09:14:26 /

Brick rhs-client13:/rhs/bricks/b3
Number of entries: 0


Expected results:
==================
Metadata self-heal and data self-heal should have been from node2 to node1 and node3. 

Additional info:
=================

root@rhs-client11 [Oct-29-2013-11:17:19] >gluster v info
 
Volume Name: vol_rep
Type: Replicate
Volume ID: 8e0dcde2-c326-492d-99fd-2421e951ec3c
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhs-client11:/rhs/bricks/b1
Brick2: rhs-client12:/rhs/bricks/b2
Brick3: rhs-client13:/rhs/bricks/b3

root@rhs-client11 [Oct-29-2013-11:18:21] >gluster v status
Status of volume: vol_rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick rhs-client11:/rhs/bricks/b1			49155	Y	8036
Brick rhs-client12:/rhs/bricks/b2			49155	Y	7695
Brick rhs-client13:/rhs/bricks/b3			49155	Y	26693
NFS Server on localhost					2049	Y	8045
Self-heal Daemon on localhost				N/A	Y	8049
NFS Server on rhs-client13				2049	Y	26702
Self-heal Daemon on rhs-client13			N/A	Y	26706
NFS Server on rhs-client12				2049	Y	9005
Self-heal Daemon on rhs-client12			N/A	Y	9012
 
There are no active volume tasks

Comment 3 spandura 2014-01-02 10:24:20 UTC

Verified the fix on the build "glusterfs 3.4.0.52rhs built on Dec 19 2013 12:20:16" using the same steps as mentioned in the description of the bug. Bug is not yet fixed. Even now getting the invalid matrix in the INFO message. Moving the bug to "ASSIGNED" state. 

Info message about heal in glustershd.log
===========================================
[2014-01-02 09:00:58.035489] I [afr-self-heal-common.c:2877:afr_log_self_heal_completion_status] 0-vol_rep-replicate-0:  metadata self heal  is successfully completed, foreground data self heal  is successfully completed,  data self heal from vol_rep-client-0  to sinks  vol_rep-client-2, with 0 bytes on vol_rep-client-0, 0 bytes on vol_rep-client-1, 0 bytes on vol_rep-client-2,  data - Pending matrix:  [ [ 0 0 2 ] [ 0 0 1 ] [ 0 0 0 ] ]  metadata self heal from source vol_rep-client-1 to vol_rep-client-0,  vol_rep-client-2,  metadata - Pending matrix:  [ [ 0 0 4 ] [ 1 0 4 ] [ 0 0 0 ] ], on <gfid:817340e2-9b15-4c18-813d-33d952839f06>

Extended attributes of file on brick1 before self-heal:
=======================================================
root@rhs-client11 [Jan-02-2014- 9:00:25] >gattr /rhs/bricks/b1/test_file
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1/test_file
trusted.afr.vol_rep-client-0=0x000000000000000000000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.afr.vol_rep-client-2=0x000000010000000200000000
trusted.gfid=0x817340e29b154c18813d33d952839f06


Extended attributes of file on brick2 before self-heal:
=======================================================
root@rhs-client12 [Jan-02-2014- 9:00:25] >gattr /rhs/bricks/b1-rep1/test_file
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1-rep1/test_file
trusted.afr.vol_rep-client-0=0x000000000000000100000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.afr.vol_rep-client-2=0x000000010000000300000000
trusted.gfid=0x817340e29b154c18813d33d952839f06


Extended attributes of file on brick3 before self-heal:
=======================================================
root@rhs-client13 [Jan-02-2014- 9:00:25] >gattr /rhs/bricks/b1-rep2/test_file
getfattr: /rhs/bricks/b1-rep2/test_file: No such file or directory

Comment 4 Vivek Agarwal 2014-01-03 05:33:50 UTC

Per tiger team call, removing it from corbett

Comment 9 RajeshReddy 2015-11-23 10:29:57 UTC

Tested with 3.1.2 (afrv2.0) and not able to reproduce the reported problem and as per the Dev this is fixed as part of v2 implementation so marking this bug as verified

Comment 10 Vivek Agarwal 2015-12-03 17:19:13 UTC

Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.