Bug 1109692

Summary: Dist-geo-rep : snapshot with geo-rep, while creating files on master, resulted in failure to capture few files entry in changelogs.
Product: [Community] GlusterFS Reporter: Kotresh HR <khiremat>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: mainlineCC: aavati, bugs, csaba, david.macdonald, gluster-bugs, khiremat, nlevinki, ssamanta, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.6.0beta1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1109149 Environment:
Last Closed: 2014-11-11 08:35:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1109149    
Bug Blocks:    

Comment 1 Kotresh HR 2014-06-16 07:23:49 UTC
Description of problem:  snapshot with geo-rep, while creating files on master, resulted  in failure to capture few files entry in changelogs. In one of the case, active replica doesn't have entry for the missing file but passive replica has.
file in question and its gfid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#getfattr -n glusterfs.gfid.string  /mnt/master/thread0/level05/level15/539ab42c%%MFQLJBDHI3
getfattr: Removing leading '/' from absolute path names
# file: mnt/master/thread0/level05/level15/539ab42c%%MFQLJBDHI3
glusterfs.gfid.string="cc9ccc81-9af6-4ca4-8e21-4226f08543cf"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

on active replica
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# find /bricks/ | grep "539ab42c%%MFQLJBDHI3"
/bricks/brick2/master_b7/thread0/level05/level15/539ab42c%%MFQLJBDHI3
[root@redcell ~]# grep "MFQLJBDHI3" /bricks/brick
brick0/ brick1/ brick2/ brick3/ 
[root@redcell ~]# grep "MFQLJBDHI3" /bricks/brick2/master_b7/.glusterfs/changelogs/*
[root@redcell ~]# 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
on passive replica 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# find /bricks/ | grep "539ab42c%%MFQLJBDHI3"
/bricks/brick2/master_b8/thread0/level05/level15/539ab42c%%MFQLJBDHI3
[root@redeye ~]# grep "MFQLJBDHI3" /bricks/brick2/master_b8/.glusterfs/changelogs/*
Binary file /bricks/brick2/master_b8/.glusterfs/changelogs/CHANGELOG.1402647607 matches
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Explanation of the debugging explained in the additional info.

Version-Release number of selected component (if applicable): glusterfs-3.6.0.16-1.el6rhs


How reproducible: Doesn't happen everytime. 


Steps to Reproduce:
1. create and start a geo-rep relationship between master and slave. 
2. start creating data on master using the command "crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=10K --min=1K /mnt/master"
3. while creating data pause geo-rep
4. create snap-shot of slave,
5. create snap-shot of master
6. resume geo-rep 

Actual results: Few of the file are failed to get captured in changelog.


Expected results: None of the files should be missed in changelog.

Comment 2 Anand Avati 2014-06-16 07:25:04 UTC
REVIEW: http://review.gluster.org/8070 (features/changelog: Do not ignore self-heal fops in changelog) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 3 Anand Avati 2014-06-16 11:23:50 UTC
COMMIT: http://review.gluster.org/8070 committed in master by Vijay Bellur (vbellur) 
------
commit 62265f40d7201854dbf33d59a74286dda671a129
Author: Kotresh H R <khiremat>
Date:   Mon Jun 16 12:30:39 2014 +0530

    features/changelog: Do not ignore self-heal fops in changelog
    
    Problem: Geo-rep fails to sync some files to slave as the
    changelog entries are missing for those files.
    
    Cause: Fops happened when the active brick is down and
    self- healed later when it came up.
    
    Solution: Capture self-heal fops as well in changelog so
    those entries are not missed.
    
    Change-Id: Ibc288779421b5156dd1695e529aba0b602a530e0
    BUG: 1109692
    Signed-off-by: Kotresh H R <khiremat>
    Reviewed-on: http://review.gluster.org/8070
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 4 Anand Avati 2014-06-28 07:07:56 UTC
REVIEW: http://review.gluster.org/8196 (feature/changelog: Fix for missing changelogs at backend.) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 5 Anand Avati 2014-06-30 06:15:40 UTC
REVIEW: http://review.gluster.org/8196 (feature/changelog: Fix for missing changelogs at backend.) posted (#2) for review on master by Kotresh HR (khiremat)

Comment 6 Anand Avati 2014-06-30 11:25:57 UTC
COMMIT: http://review.gluster.org/8196 committed in master by Venky Shankar (vshankar) 
------
commit 2417de9c37d83e36567551dc682bb23f851fd2d7
Author: Kotresh H R <khiremat>
Date:   Sat Jun 28 12:18:52 2014 +0530

    feature/changelog: Fix for missing changelogs at backend.
    
    Problem:
           A few changelog files are missing at the backend
           during snapshot with changelog enabled.
    
    Cause:
           Race between actual rollover and explicit rollover.
    
           Changelog rollover can happen either due to actual
           or the explict rollover due to snapshot. Actual
           rollover is controlled by tuneable called rollover-time.
           The minimum granularity for rollover-time is 1 second
           Explicit rollover is asynchronous in nature and happens
           during snapshot.
    
           Basically, rollover renames the current CHANGELOG file
           to CHANGELOG.TIMESTAMP after rollover-time. Let's assume,
           at time 't1', actual and explicit rollover raced against
           each other and actual rollover won the race renaming the
           CHANGELOG file to CHANGELOG.t1 and opens a new
           CHANGELOG file. An immediate explicit rollover at time
           't1' happened with in the same second to rename
           CHANGELOG file to CHANGELOG.t1 resulting in purging the
           earlier CHANGELOG.t1 file created by actual rollover.
    
    Solution:
           Adding a delay of 1 sec guarantees unique CHANGELOG.TIMESTAMP
           during explicit rollover.
    
    Thanks Venky, for the all the help in root causing the issue.
    
    Change-Id: I8958824e107e16f61be9f09a11d95f8645ecf34d
    BUG: 1109692
    Signed-off-by: Kotresh H R <khiremat>
    Reviewed-on: http://review.gluster.org/8196
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>
    Tested-by: Venky Shankar <vshankar>

Comment 7 Niels de Vos 2014-09-22 12:42:56 UTC
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 8 Niels de Vos 2014-11-11 08:35:07 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users