Bug 1415053

Summary: geo-rep session faulty with ChangelogException "No such file or directory"
Product: [Community] GlusterFS Reporter: Kotresh HR <khiremat>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.8CC: avishwan, bugs, csaba, khiremat, olim, pdhange, rcyriac, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1413967
: 1415065 (view as bug list) Environment:
Last Closed: 2017-02-20 12:34:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1412883, 1413967    
Bug Blocks: 1415065    

Description Kotresh HR 2017-01-20 06:15:39 UTC
+++ This bug was initially created as a clone of Bug #1413967 +++

+++ This bug was initially created as a clone of Bug #1412883 +++


Description:

The geo-replica sessions are going faulty for most of the volumes. The most of the geo-replica session having faulty state has below changelog exception:

[2017-01-13 01:16:27.825808] I [master(/rhs/master/prd/soa/shared01/brick):519:crawlwrap] _GMaster: crawl interval: 1 seconds
[2017-01-13 01:16:27.834862] I [master(/rhs/master/prd/soa/shared01/brick):1163:crawl] _GMaster: starting history crawl... turns: 1, stime: (1484261733, 0), etime: 1484270187
[2017-01-13 01:16:27.836390] E [repce(agent):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2017-01-13 01:16:27.837673] E [repce(/rhs/master/prd/soa/shared01/brick):207:__call__] RepceClient: call 8225:140388583266112:1484270187.84 (history) failed on peer with ChangelogException
[2017-01-13 01:16:27.837953] E [resource(/rhs/master/prd/soa/shared01/brick):1506:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory

All the bricks were online for master volume and even all bricks for slave volume were online. 

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Rarely

Steps to Reproduce:
1. Create Master and Slave Cluster
2. Create 1x2 volume on master and slave cluster
3. Create geo-rep session between master and slave volume
4. Start the geo-rep session
5. Mount the volume over Fuse
6. Check the geo-rep status and log.

Actual results:
Getting changelog exception and nodes goes faulty

Expected results:
Geo-replica session should be in Active/Passive state and workers should not get exception

--- Additional comment from Kotresh HR on 2017-01-17 08:17:05 EST ---

Patch posted:
http://review.gluster.org/#/c/16420/

--- Additional comment from Worker Ant on 2017-01-19 04:39:46 EST ---

COMMIT: http://review.gluster.org/16420 committed in master by Aravinda VK (avishwan) 
------
commit 6f4811ca9331eee8c00861446f74ebe23626bbf8
Author: Kotresh HR <khiremat>
Date:   Tue Jan 17 06:39:25 2017 -0500

    features/changelog: Fix htime xattr during brick crash
    
    The htime file contains the path of all the changelogs
    that is rolloved over till now. It also maintains xattr
    which tracks the latest changelog file rolloved over
    and the number of changelogs. The path and and xattr
    update happens in two different system calls. If the
    brick is crashed between them, the xattr value becomes
    stale and can lead to the failure of gf_history_changelog.
    To identify this, the total number of changelogs is being
    calculated based on htime file size and the record
    length. The above value is used in case of mismatch.
    
    Change-Id: Ia1c3efcfda7b74227805bb2eb933c9bd4305000b
    BUG: 1413967
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/16420
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>

Comment 1 Worker Ant 2017-01-20 06:57:57 UTC
REVIEW: http://review.gluster.org/16437 (features/changelog: Fix htime xattr during brick crash) posted (#1) for review on release-3.8 by Kotresh HR (khiremat)

Comment 2 Worker Ant 2017-01-31 09:31:18 UTC
COMMIT: https://review.gluster.org/16437 committed in release-3.8 by Aravinda VK (avishwan) 
------
commit c2d854e4f6d8735a5669a4da90b7042418898bc7
Author: Kotresh HR <khiremat>
Date:   Tue Jan 17 06:39:25 2017 -0500

    features/changelog: Fix htime xattr during brick crash
    
    The htime file contains the path of all the changelogs
    that is rolloved over till now. It also maintains xattr
    which tracks the latest changelog file rolloved over
    and the number of changelogs. The path and and xattr
    update happens in two different system calls. If the
    brick is crashed between them, the xattr value becomes
    stale and can lead to the failure of gf_history_changelog.
    To identify this, the total number of changelogs is being
    calculated based on htime file size and the record
    length. The above value is used in case of mismatch.
    
    > Change-Id: Ia1c3efcfda7b74227805bb2eb933c9bd4305000b
    > BUG: 1413967
    > Signed-off-by: Kotresh HR <khiremat>
    > Reviewed-on: http://review.gluster.org/16420
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Aravinda VK <avishwan>
    
    Change-Id: Ia1c3efcfda7b74227805bb2eb933c9bd4305000b
    BUG: 1415053
    Signed-off-by: Kotresh HR <khiremat>
    (cherry picked from commit 6f4811ca9331eee8c00861446f74ebe23626bbf8)
    Reviewed-on: https://review.gluster.org/16437
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>

Comment 3 Niels de Vos 2017-02-20 12:34:24 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.9, please open a new bug report.

glusterfs-3.8.9 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2017-February/000066.html
[2] https://www.gluster.org/pipermail/gluster-users/