Bug 1445213 - Unable to take snapshot on a geo-replicated volume, even after stopping the session
Summary: Unable to take snapshot on a geo-replicated volume, even after stopping the s...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: snapshot
Version: 3.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
Assignee: Kotresh HR
QA Contact:
URL:
Whiteboard:
Depends On: 1416024 1443977
Blocks: glusterfs-3.8.12 1445209
TreeView+ depends on / blocked
 
Reported: 2017-04-25 09:23 UTC by Kotresh HR
Modified: 2017-05-29 04:59 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.8.12
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1443977
Environment:
Last Closed: 2017-05-29 04:59:32 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Kotresh HR 2017-04-25 09:24:28 UTC
Description of problem:
========================
Had two 4node clusters, with one as master and the other acting as slave. Both were part of RHGS-Console. Had 2 geo-rep sessions created in 3.7.9-12 build. Upgraded the RHGS bits to 3.8.4-12 by following the procedure mentioned in the guide. 

Tried to take a snapshot on the master volume, and it complained: 'the geo-rep session is running. Please stop before taking a snapshot.' Stopped the geo-rep session and again tried to take a snapshot. It complained with the same error as before - 'that it found a running geo-rep session', even though the session was stopped. 

Found a way to reproduce it consistently

1. Have a geo-rep session in 'started' state between 'master' and 'slave' volumes
2. Restart glusterd on one of the master nodes
3. Stop the session between 'master' and 'slave' volumes
4. Take a snapshot on 'master'

Expected result: Snapshot creation should succeed.
Actual result: Snapshot creation fails with the error - 'found a running geo-rep session'

Version-Release number of selected component (if applicable):
============================================================
mainline

How reproducible:
================
Seeing it on 2 of my geo-rep sessions.


Additional info:
=================
[root@dhcp47-26 ~]# gluster v geo-rep status
 
MASTER NODE                         MASTER VOL    MASTER BRICK                SLAVE USER    SLAVE                                                  SLAVE NODE                           STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.70.47.26                         masterB       /bricks/brick1/masterB_1    root          ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB        dhcp35-100.lab.eng.blr.redhat.com    Active     Changelog Crawl    2017-01-12 11:56:35          
10.70.47.26                         masterD       /bricks/brick0/masterD_2    us2           ssh://us2.eng.blr.redhat.com::slaveD    10.70.35.101                         Active     Changelog Crawl    2017-01-24 11:21:10          
10.70.47.26                         mm            /bricks/brick0/mm2          geo           ssh://geo.35.115::ss                             10.70.35.101                         Active     Changelog Crawl    2017-01-17 11:21:46          
10.70.47.60                         masterB       /bricks/brick1/masterB_3    root          ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB        10.70.35.101                         Active     Changelog Crawl    2017-01-12 11:56:43          
10.70.47.60                         masterD       /bricks/brick0/masterD_0    us2           ssh://us2.eng.blr.redhat.com::slaveD    10.70.35.115                         Active     Changelog Crawl    2017-01-24 11:21:14          
10.70.47.60                         mm            /bricks/brick0/mm0          geo           ssh://geo.35.115::ss                             10.70.35.115                         Active     Changelog Crawl    2017-01-17 11:21:33          
dhcp47-27.lab.eng.blr.redhat.com    masterB       /bricks/brick1/masterB_0    root          ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB        10.70.35.115                         Active     Changelog Crawl    2017-01-12 11:56:35          
10.70.47.27                         masterD       /bricks/brick0/masterD_3    us2           ssh://us2.eng.blr.redhat.com::slaveD    10.70.35.100                         Passive    N/A                N/A                          
10.70.47.27                         mm            /bricks/brick0/mm3          geo           ssh://geo.35.115::ss                             10.70.35.100                         Passive    N/A                N/A                          
10.70.47.61                         masterB       /bricks/brick1/masterB_2    root          ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB        10.70.35.104                         Active     Changelog Crawl    2017-01-12 11:56:35          
10.70.47.61                         masterD       /bricks/brick0/masterD_1    us2           ssh://us2.eng.blr.redhat.com::slaveD    10.70.35.104                         Passive    N/A                N/A                          
10.70.47.61                         mm            /bricks/brick0/mm1          geo           ssh://geo.35.115::ss                             10.70.35.104                         Passive    N/A                N/A                          
[root@dhcp47-26 ~]# gluster v geo-rep masterB dhcp35-100.lab.eng.blr.redhat.com::slaveB status
 
MASTER NODE                         MASTER VOL    MASTER BRICK                SLAVE USER    SLAVE                                        SLAVE NODE                           STATUS    CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.70.47.26                         masterB       /bricks/brick1/masterB_1    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    dhcp35-100.lab.eng.blr.redhat.com    Active    Changelog Crawl    2017-01-12 11:56:35          
10.70.47.61                         masterB       /bricks/brick1/masterB_2    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    10.70.35.104                         Active    Changelog Crawl    2017-01-12 11:56:35          
dhcp47-27.lab.eng.blr.redhat.com    masterB       /bricks/brick1/masterB_0    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    10.70.35.115                         Active    Changelog Crawl    2017-01-12 11:56:35          
10.70.47.60                         masterB       /bricks/brick1/masterB_3    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    10.70.35.101                         Active    Changelog Crawl    2017-01-12 11:56:43          
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# gluster v geo-rep masterB dhcp35-100.lab.eng.blr.redhat.com::slaveB status
 
MASTER NODE                         MASTER VOL    MASTER BRICK                SLAVE USER    SLAVE                                        SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.70.47.26                         masterB       /bricks/brick1/masterB_1    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    N/A           Stopped    N/A             N/A                  
10.70.47.60                         masterB       /bricks/brick1/masterB_3    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    N/A           Stopped    N/A             N/A                  
10.70.47.61                         masterB       /bricks/brick1/masterB_2    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    N/A           Stopped    N/A             N/A                  
dhcp47-27.lab.eng.blr.redhat.com    masterB       /bricks/brick1/masterB_0    root          dhcp35-100.lab.eng.blr.redhat.com::slaveB    N/A           Stopped    N/A             N/A                  
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# gluster snap create masterB_snap1 
Invalid Syntax.
Usage: snapshot create <snapname> <volname> [no-timestamp] [description <description>] [force]
[root@dhcp47-26 ~]# gluster snap create masterB_snap1 masterB no-timestamp
snapshot create: failed: geo-replication session is running for the volume masterB. Session needs to be stopped before taking a snapshot.
Snapshot command failed
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss status
 
MASTER NODE    MASTER VOL    MASTER BRICK          SLAVE USER    SLAVE                   SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
---------------------------------------------------------------------------------------------------------------------------------------------------------
10.70.47.26    mm            /bricks/brick0/mm2    geo           geo.35.115::ss    10.70.35.101    Active     Changelog Crawl    2017-01-17 11:21:46          
10.70.47.27    mm            /bricks/brick0/mm3    geo           geo.35.115::ss    10.70.35.100    Passive    N/A                N/A                          
10.70.47.60    mm            /bricks/brick0/mm0    geo           geo.35.115::ss    10.70.35.115    Active     Changelog Crawl    2017-01-17 11:21:33          
10.70.47.61    mm            /bricks/brick0/mm1    geo           geo.35.115::ss    10.70.35.104    Passive    N/A                N/A                          
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss stop
Stopping geo-replication session between mm & geo.35.115::ss has been successful
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss status
 
MASTER NODE    MASTER VOL    MASTER BRICK          SLAVE USER    SLAVE                   SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
--------------------------------------------------------------------------------------------------------------------------------------------
10.70.47.26    mm            /bricks/brick0/mm2    geo           geo.35.115::ss    N/A           Stopped    N/A             N/A                  
10.70.47.61    mm            /bricks/brick0/mm1    geo           geo.35.115::ss    N/A           Stopped    N/A             N/A                  
10.70.47.27    mm            /bricks/brick0/mm3    geo           geo.35.115::ss    N/A           Stopped    N/A             N/A                  
10.70.47.60    mm            /bricks/brick0/mm0    geo           geo.35.115::ss    N/A           Stopped    N/A             N/A                  
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# gluster snap create mm_snap mm
snapshot create: failed: geo-replication session is running for the volume mm. Session needs to be stopped before taking a snapshot.
Snapshot command failed
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# vim /var/log/glusterfs/geo-replication/mm/ssh%3A%2F%2Fgeo%4010.70.35.115%3Agluster%3A%2F%2F127.0.0.1%3Ass.log
[root@dhcp47-26 ~]# 
[root@dhcp47-26 ~]# 
[root@dhcp47-60 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp47-27.lab.eng.blr.redhat.com
Uuid: 6eb0185c-cc76-4bd1-a691-2ecb6a652901
State: Peer in Cluster (Connected)

Hostname: 10.70.47.61
Uuid: 3f350e37-69aa-4fc3-b9af-70c4db688721
State: Peer in Cluster (Connected)

Hostname: 10.70.47.26
Uuid: 53883823-cb8e-4da1-b6ee-a53e0ef7cd9a
State: Peer in Cluster (Connected)
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# rpm -qa | grep gluster
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-api-3.8.4-12.el7rhgs.x86_64
glusterfs-libs-3.8.4-12.el7rhgs.x86_64
python-gluster-3.8.4-12.el7rhgs.noarch
glusterfs-3.8.4-12.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-12.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-3.8.4-12.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-12.el7rhgs.x86_64
glusterfs-server-3.8.4-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-12.el7rhgs.x86_64
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# gluster v list
gluster_shared_storage
masterA
masterB
masterD
mm
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# gluster v info mm
 
Volume Name: mm
Type: Distributed-Replicate
Volume ID: 4c435eff-24de-4030-a8dc-769bbaf292a4
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.60:/bricks/brick0/mm0
Brick2: 10.70.47.61:/bricks/brick0/mm1
Brick3: 10.70.47.26:/bricks/brick0/mm2
Brick4: 10.70.47.27:/bricks/brick0/mm3
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
performance.readdir-ahead: on
nfs.disable: off
transport.address-family: inet
cluster.enable-shared-storage: enable
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]#

Comment 2 Worker Ant 2017-04-25 09:26:19 UTC
REVIEW: https://review.gluster.org/17109 (glusterd/geo-rep: Fix snapshot create in geo-rep setup) posted (#1) for review on release-3.8 by Kotresh HR (khiremat)

Comment 3 Worker Ant 2017-04-28 09:39:23 UTC
COMMIT: https://review.gluster.org/17109 committed in release-3.8 by Aravinda VK (avishwan) 
------
commit 57b481a071c13078c603cf2d96f9a04b9ebc39b4
Author: Kotresh HR <khiremat>
Date:   Thu Apr 20 07:18:52 2017 -0400

    glusterd/geo-rep: Fix snapshot create in geo-rep setup
    
    glusterd persists geo-rep sessions in glusterd
    info file which is represented by dictionary
    'volinfo->gsync_slaves' in memory. Glusterd also
    maintains in memory active geo-rep sessions in
    dictionary 'volinfo->gsync_active_slaves' whose key
    is "<slave_url>::<slavhost>".
    
    When glusterd is restarted while the geo-rep sessions
    are active, it builds the 'volinfo->gsync_active_slaves'
    from persisted glusterd info file. Since slave volume
    uuid is added to "voinfo->gsync_slaves" with the commit
    "http://review.gluster.org/13111", it builds it with key
    "<slave_url>::<slavehost>:<slavevol_uuid>" which is
    wrong. So during snapshot pre-validation which checks
    whether geo-rep is active or not, it always says it is
    ACTIVE, as geo-rep stop would not deleted this key.
    Fixed the same in this patch.
    
    
    > BUG: 1443977
    > Signed-off-by: Kotresh HR <khiremat>
    > Reviewed-on: https://review.gluster.org/17093
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Atin Mukherjee <amukherj>
    (cherry picked from commit f071d2a285ea4802fe8f328f9f275180983fbbba)
    
    Change-Id: I185178910b4b8a62e66aba406d88d12fabc5c122
    BUG: 1445213
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: https://review.gluster.org/17109
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>

Comment 4 Niels de Vos 2017-05-29 04:59:32 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.12, please open a new bug report.

glusterfs-3.8.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2017-May/000072.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.