Bug 1030207 - AFR : self-heal always happens from node which has wise change-logs even though it is not the longest lived node
Summary: AFR : self-heal always happens from node which has wise change-logs even thou...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: 2.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Anuradha
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-14 07:24 UTC by spandura
Modified: 2016-09-20 02:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-03 17:21:33 UTC
Embargoed:


Attachments (Terms of Use)

Description spandura 2013-11-14 07:24:49 UTC
Description of problem:
========================
In the existing self-heal algorithm, the self-heal happens from the node which has wise change-logs. 

Lets say there are 3 bricks in the cluster (1 x 3 replicate volume). writes are happening on a file and brick1 crashes. writes still continue to happen from mount point. Brick1 is brought online. Before self-heal could complete brick2 crashed. At this point of time brick2 had [ 265 0 0 ] change-logs on a file. writes still continue to happen . brick3 crashes. (brick3 is the longest lived brick) . When both brick2 and brick3 are brought online, self-heal happens from brick2 to brick3 and brick1 even though brick3 has the latest data. This is because the change-logs of the file on brick2 indicates it's wise. 

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.4.0.35.1u2rhs built on Oct 21 2013 14:00:58

How reproducible:
===================

Steps to Reproduce:
======================
1. On a 1 x 3 replicate volume, opened fd on 10 files. started writing data on all the files. writes on the file were in progress all the time. {periodic writes }

2. Brought down brick1 (xfs_progs -> godown)

3. After some time brought back brick1.

4. Before the self-heal could complete, brick2 crashed.(xfs_progs -> godown)

At this point of time the extended attributes of one of the file on brick2 was, [ 265 0 0 ]

5. After some time brick3 crashed.(xfs_progs -> godown)

At this point of time the extended attributes of one of the file on brick2 was, [ 266 140 1 ]

6. brought back brick2 and brick3 at same time.

Actual results:
==================
Self-heal happened from brick2 to brick1 and brick3 on this files.

[2013-11-06 08:57:38.623632] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-vol_rep-replicate-0:  foreground data self heal  is successfully completed,  from vol_rep-client-1 with 1077862400 1075886080 1077309440  sizes - Pending matrix:  [ [ 2 195 56 ] [ 265 0 0 ] [ 266 140 1 ] ] on <gfid:16e4f46a-f1bb-4ab8-a8eb-36479a83fc82>

Expected results:
==================
self-heal should have happened from brick3 to brick1 and brick2.

Comment 2 Pranith Kumar K 2015-03-18 11:08:52 UTC
This will never happen for afr-v2. Just test it and close it

Comment 3 RajeshReddy 2015-11-23 10:31:38 UTC
Tested with 3.1.2 (afrv2.0) and not able to reproduce the reported problem and as per the Dev this is fixed as part of v2 implementation so marking this bug as verified

Comment 4 Vivek Agarwal 2015-12-03 17:21:33 UTC
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.


Note You need to log in before you can comment on or make changes to this bug.