Bug 1239044

Summary: [geo-rep]: killing brick from replica pair makes geo-rep session faulty with Traceback "ChangelogException"
Product: [Community] GlusterFS Reporter: Kotresh HR <khiremat>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, chrisw, csaba, gluster-bugs, nlevinki, nsathyan, rcyriac, rhinduja
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8rc2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1236546
: 1247882 (view as bug list) Environment:
Last Closed: 2016-06-16 13:19:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1236546    
Bug Blocks: 1236554, 1247882    

Description Kotresh HR 2015-07-03 11:01:33 UTC
+++ This bug was initially created as a clone of Bug #1236546 +++

Description of problem:
=======================
Even when the ntp is configured and the systems are in sync and same timezone. Killing the Active bricks makes the passive brick faulty too with the history crawl failing.

[2015-07-01 15:31:06.146286] I [master(/rhs/brick1/b1):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435744752, 0)
[2015-07-01 15:31:06.147336] E [repce(agent):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-07-01 15:31:06.149779]

It fails for first time and succeeds later.

Version-Release number of selected component (if applicable):
=============================================================

mainline


How reproducible:
=================

Always


Steps Carried:
==============

1. Create Master and Slave Cluster
2. Create and Start Master volume (4x2) from four nodes (node1..node4)
3. Create and Start slave volume (2x2)
4. Create Meta volume (1x3) (node1..node3)
5. Create geo-rep session between master and slave volume
6. Set the config use_meta_volume to true
7. Start the geo-rep session
8. Mount the volume on Fuse
9. Start creating data from fuse client
10. While data creation is in progress, kill few active bricks {kill -9 pid}. {Make sure that the corresponding replica brick is UP}
11. Check the geo-rep status and log.

Comment 1 Kotresh HR 2015-07-03 11:02:44 UTC
I got it the reason for first time failure. The register time is the end time we pass for the history API. Since the PASSIVE worker register much earlier along with ACTIVE worker and start time it passes the stime i.e., register time < stime

For history API, start time > end time which obviously fails.

When it registers for second time,  register time > stime and hence it passes.

There are no side effects with respect to DATA sync. It is just worker going down and coming back. We will fix this but not a BLOCKER definitely.

Comment 2 Anand Avati 2015-07-03 11:28:02 UTC
REVIEW: http://review.gluster.org/11524 (geo-rep: Fix history failure) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 3 Anand Avati 2015-07-24 05:44:02 UTC
REVIEW: http://review.gluster.org/11524 (geo-rep: Fix history failure) posted (#2) for review on master by Kotresh HR (khiremat)

Comment 4 Anand Avati 2015-07-29 07:00:52 UTC
COMMIT: http://review.gluster.org/11524 committed in master by Venky Shankar (vshankar) 
------
commit 62c2e7f8b9211ba149368d26f772f175fe51b43b
Author: Kotresh HR <khiremat>
Date:   Fri Jul 3 16:32:56 2015 +0530

    geo-rep: Fix history failure
    
    Both ACTIVE and PASSIVE workers register to changelog
    at almost same time. When PASSIVE worker becomes ACTIVE,
    the start and end time would be current stime and register_time
    repectively for history API. Hence register_time would be less
    then stime for which history obviously fails. But it will
    be successful for the next restart as new register_time > stime.
    
    Fix is to pass current time as the end time to history call
    instead of the register_time.
    
    Also improvised the logging for ACTIVE/PASSIVE switching.
    
    Change-Id: Idc08b4b55c7a4c575ba44918a98389164ccbee8f
    BUG: 1239044
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/11524
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: NetBSD Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>
    Reviewed-by: Venky Shankar <vshankar>

Comment 5 Nagaprasad Sathyanarayana 2015-10-25 15:11:29 UTC
Fix for this BZ is already present in a GlusterFS release. You can find clone of this BZ, fixed in a GlusterFS release and closed. Hence closing this mainline BZ as well.

Comment 6 Niels de Vos 2016-06-16 13:19:46 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user