Bug 1225543

Summary: [geo-rep]: snapshot creation timesout even if geo-replication is in pause/stop/delete state
Product: [Community] GlusterFS Reporter: Aravinda VK <avishwan>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aavati, avishwan, bugs, csaba, gluster-bugs, khiremat, nlevinki, rhinduja, storage-qa-internal
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1225542 Environment:
Last Closed: 2015-06-02 06:20:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1225542    
Bug Blocks: 1219955, 1223636, 1225338    

Description Aravinda VK 2015-05-27 16:14:42 UTC
+++ This bug was initially created as a clone of Bug #1225542 +++

+++ This bug was initially created as a clone of Bug #1225338 +++

Description of problem:
=======================

From use case point of view: Created geo-rep session. Paused it and tried to create a snapshot. Snapshot hungs and timesout after 2 min of cli/barrier timeout. 

Problem is with the changelog/changelog on. Tried the following on the cleaned up system.

1. Create a volume
2. Set the changelog.changelog to on
3. create a snapshot, it times out as

[root@georep1 scripts]# gluster snapshot create snapa master
Error : Request timed out
Snapshot command failed
[root@georep1 scripts]# 

Brick Log snippet:
===================

[2015-05-26 17:34:59.595211] I [changelog.c:2043:notify] 0-master-changelog: Barrier on notification
[2015-05-26 17:34:59.595394] I [changelog-helpers.c:838:changelog_snap_logging_start] 0-master-changelog: Now starting to log in call path
[2015-05-26 17:34:59.595410] E [changelog.c:2064:notify] 0-master-changelog: Received another barrier on notification when last one is not served yet
[2015-05-26 17:34:59.595434] I [socket.c:3432:socket_submit_reply] 0-socket.glusterfsd: not connected (priv->connected = -1)
[2015-05-26 17:34:59.595464] E [rpcsvc.c:1312:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: Gluster Brick operations, ProgVers: 2, Proc: 10) to rpc-transport (socket.glusterfsd)
[2015-05-26 17:34:59.595480] E [glusterfsd-mgmt.c:149:glusterfs_submit_reply] 0-glusterfs: Reply submission failed
[2015-05-26 17:34:59.595501] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2015-05-26 17:34:59.596373] E [socket.c:3421:socket_submit_reply] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x31c5624fb0] (--> /usr/lib64/glusterfs/3.7.0/rpc-transport/socket.so(+0x6f2f)[0x7fa8cdec7f2f] (--> /usr/lib64/libgfrpc.so.0(rpcsvc_transport_submit+0x76)[0x31c5a089a6] (--> /usr/lib64/libgfrpc.so.0(rpcsvc_submit_generic+0x1c8)[0x31c5a091f8] (--> /usr/lib64/libgfrpc.so.0(rpcsvc_error_reply+0x66)[0x31c5a09726] ))))) 0-socket: invalid argument: this->private
[2015-05-26 17:34:59.596394] E [rpcsvc.c:1312:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: Gluster Brick operations, ProgVers: 2, Proc: 10) to rpc-transport (socket.glusterfsd)
[2015-05-26 17:34:59.596568] C [mem-pool.c:560:mem_put] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x31c5624fb0] (--> /usr/lib64/libglusterfs.so.0(mem_put+0x105)[0x31c5655895] (--> /usr/lib64/libgfrpc.so.0(rpcsvc_submit_generic+0x256)[0x31c5a09286] (--> /usr/lib64/libgfrpc.so.0(rpcsvc_error_reply+0x66)[0x31c5a09726] (--> /usr/lib64/libgfrpc.so.0(rpcsvc_check_and_reply_error+0x6b)[0x31c5a0979b] ))))) 0-mem-pool: mem_put called on freed ptr 0x6d2d84 of mem pool 0x6d1610
[2015-05-26 17:34:59.597962] W [rpcsvc.c:571:rpcsvc_check_and_reply_error] 0-rpcsvc: failed to queue error reply
[2015-05-26 17:34:59.598024] E [barrier.c:522:notify] 0-master-barrier: Already enabled
[2015-05-26 17:34:59.598381] I [changelog.c:1989:notify] 0-master-changelog: Barrier off notification
[2015-05-26 17:34:59.598688] I [changelog-helpers.c:860:changelog_snap_logging_stop] 0-master-changelog: Stopped to log in call path
[2015-05-26 17:34:59.598713] E [changelog.c:2030:notify] 0-master-changelog: Changelog barrier already disabled
(END) 



Version-Release number of selected component (if applicable):
=============================================================



How reproducible:
=================

always


Steps to Reproduce:

Way1:
=====
1. Create master and slave volume
2. Create geo-replication between them
3. Start and Pause the geo-rep session
4. Try to create the snapshot. It fails

Way2:
=====
1. Create a volume
2. Set the volume option changelog.changelog on
3. Try to create the snapshot. It fails

Actual results:
===============

Snapshot creation fails with timeout


Expected results:
=================

Snapshot creation should succeed


Additional info:
================

Comment 1 Anand Avati 2015-05-31 14:51:51 UTC
REVIEW: http://review.gluster.org/10988 (featuress/changelog: On snapshot, notify irrespective of failures) posted (#2) for review on release-3.7 by Venky Shankar (vshankar)

Comment 2 Anand Avati 2015-05-31 17:41:28 UTC
REVIEW: http://review.gluster.org/10988 (featuress/changelog: On snapshot, notify irrespective of failures) posted (#3) for review on release-3.7 by Venky Shankar (vshankar)

Comment 3 Anand Avati 2015-06-01 03:21:47 UTC
COMMIT: http://review.gluster.org/10988 committed in release-3.7 by Venky Shankar (vshankar) 
------
commit 8a1e0e2d535f42bf76384d81a2e6dbd0364adea5
Author: Kotresh HR <khiremat>
Date:   Wed May 27 16:27:25 2015 +0530

    featuress/changelog: On snapshot, notify irrespective of failures
    
    During snapshot, changelog barrier is enabled and a
    explicit rollover of changelog is initiated. During
    rollover of changelog, if any error or changelog is
    empty, the notification was not sent to reconfigure
    and hence snapshot was failing because of timeout.
    This patch addresses it by sending notification
    irrespective of failures and sends error if any
    back to barrier.
    
    BUG: 1225543
    Change-Id: I971ef3bdc63bb50bda0b655e55cd814e44254ba9
    Reviewed-On: http://review.gluster.org/10951
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/10988
    Tested-by: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>