Description of problem: ======================== Had two 4node clusters, with one as master and the other acting as slave. Both were part of RHGS-Console. Had 2 geo-rep sessions created in 3.7.9-12 build. Upgraded the RHGS bits to 3.8.4-12 by following the procedure mentioned in the guide. Tried to take a snapshot on the master volume, and it complained: 'the geo-rep session is running. Please stop before taking a snapshot.' Stopped the geo-rep session and again tried to take a snapshot. It complained with the same error as before - 'that it found a running geo-rep session', even though the session was stopped. Found a way to reproduce it consistently 1. Have a geo-rep session in 'started' state between 'master' and 'slave' volumes 2. Restart glusterd on one of the master nodes 3. Stop the session between 'master' and 'slave' volumes 4. Take a snapshot on 'master' Expected result: Snapshot creation should succeed. Actual result: Snapshot creation fails with the error - 'found a running geo-rep session' Version-Release number of selected component (if applicable): ============================================================ mainline How reproducible: ================ Seeing it on 2 of my geo-rep sessions. Additional info: ================= [root@dhcp47-26 ~]# gluster v geo-rep status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 masterB /bricks/brick1/masterB_1 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB dhcp35-100.lab.eng.blr.redhat.com Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.26 masterD /bricks/brick0/masterD_2 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.101 Active Changelog Crawl 2017-01-24 11:21:10 10.70.47.26 mm /bricks/brick0/mm2 geo ssh://geo.35.115::ss 10.70.35.101 Active Changelog Crawl 2017-01-17 11:21:46 10.70.47.60 masterB /bricks/brick1/masterB_3 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.101 Active Changelog Crawl 2017-01-12 11:56:43 10.70.47.60 masterD /bricks/brick0/masterD_0 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.115 Active Changelog Crawl 2017-01-24 11:21:14 10.70.47.60 mm /bricks/brick0/mm0 geo ssh://geo.35.115::ss 10.70.35.115 Active Changelog Crawl 2017-01-17 11:21:33 dhcp47-27.lab.eng.blr.redhat.com masterB /bricks/brick1/masterB_0 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.115 Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.27 masterD /bricks/brick0/masterD_3 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.100 Passive N/A N/A 10.70.47.27 mm /bricks/brick0/mm3 geo ssh://geo.35.115::ss 10.70.35.100 Passive N/A N/A 10.70.47.61 masterB /bricks/brick1/masterB_2 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.104 Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.61 masterD /bricks/brick0/masterD_1 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.104 Passive N/A N/A 10.70.47.61 mm /bricks/brick0/mm1 geo ssh://geo.35.115::ss 10.70.35.104 Passive N/A N/A [root@dhcp47-26 ~]# gluster v geo-rep masterB dhcp35-100.lab.eng.blr.redhat.com::slaveB status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 masterB /bricks/brick1/masterB_1 root dhcp35-100.lab.eng.blr.redhat.com::slaveB dhcp35-100.lab.eng.blr.redhat.com Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.61 masterB /bricks/brick1/masterB_2 root dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.104 Active Changelog Crawl 2017-01-12 11:56:35 dhcp47-27.lab.eng.blr.redhat.com masterB /bricks/brick1/masterB_0 root dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.115 Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.60 masterB /bricks/brick1/masterB_3 root dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.101 Active Changelog Crawl 2017-01-12 11:56:43 [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep masterB dhcp35-100.lab.eng.blr.redhat.com::slaveB status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 masterB /bricks/brick1/masterB_1 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A 10.70.47.60 masterB /bricks/brick1/masterB_3 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A 10.70.47.61 masterB /bricks/brick1/masterB_2 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A dhcp47-27.lab.eng.blr.redhat.com masterB /bricks/brick1/masterB_0 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster snap create masterB_snap1 Invalid Syntax. Usage: snapshot create <snapname> <volname> [no-timestamp] [description <description>] [force] [root@dhcp47-26 ~]# gluster snap create masterB_snap1 masterB no-timestamp snapshot create: failed: geo-replication session is running for the volume masterB. Session needs to be stopped before taking a snapshot. Snapshot command failed [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED --------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 mm /bricks/brick0/mm2 geo geo.35.115::ss 10.70.35.101 Active Changelog Crawl 2017-01-17 11:21:46 10.70.47.27 mm /bricks/brick0/mm3 geo geo.35.115::ss 10.70.35.100 Passive N/A N/A 10.70.47.60 mm /bricks/brick0/mm0 geo geo.35.115::ss 10.70.35.115 Active Changelog Crawl 2017-01-17 11:21:33 10.70.47.61 mm /bricks/brick0/mm1 geo geo.35.115::ss 10.70.35.104 Passive N/A N/A [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss stop Stopping geo-replication session between mm & geo.35.115::ss has been successful [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 mm /bricks/brick0/mm2 geo geo.35.115::ss N/A Stopped N/A N/A 10.70.47.61 mm /bricks/brick0/mm1 geo geo.35.115::ss N/A Stopped N/A N/A 10.70.47.27 mm /bricks/brick0/mm3 geo geo.35.115::ss N/A Stopped N/A N/A 10.70.47.60 mm /bricks/brick0/mm0 geo geo.35.115::ss N/A Stopped N/A N/A [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster snap create mm_snap mm snapshot create: failed: geo-replication session is running for the volume mm. Session needs to be stopped before taking a snapshot. Snapshot command failed [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# vim /var/log/glusterfs/geo-replication/mm/ssh%3A%2F%2Fgeo%4010.70.35.115%3Agluster%3A%2F%2F127.0.0.1%3Ass.log [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-60 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp47-27.lab.eng.blr.redhat.com Uuid: 6eb0185c-cc76-4bd1-a691-2ecb6a652901 State: Peer in Cluster (Connected) Hostname: 10.70.47.61 Uuid: 3f350e37-69aa-4fc3-b9af-70c4db688721 State: Peer in Cluster (Connected) Hostname: 10.70.47.26 Uuid: 53883823-cb8e-4da1-b6ee-a53e0ef7cd9a State: Peer in Cluster (Connected) [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# rpm -qa | grep gluster vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-api-3.8.4-12.el7rhgs.x86_64 glusterfs-libs-3.8.4-12.el7rhgs.x86_64 python-gluster-3.8.4-12.el7rhgs.noarch glusterfs-3.8.4-12.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-12.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-cli-3.8.4-12.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-12.el7rhgs.x86_64 glusterfs-server-3.8.4-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-12.el7rhgs.x86_64 glusterfs-fuse-3.8.4-12.el7rhgs.x86_64 [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# gluster v list gluster_shared_storage masterA masterB masterD mm [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# gluster v info mm Volume Name: mm Type: Distributed-Replicate Volume ID: 4c435eff-24de-4030-a8dc-769bbaf292a4 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.60:/bricks/brick0/mm0 Brick2: 10.70.47.61:/bricks/brick0/mm1 Brick3: 10.70.47.26:/bricks/brick0/mm2 Brick4: 10.70.47.27:/bricks/brick0/mm3 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on performance.readdir-ahead: on nfs.disable: off transport.address-family: inet cluster.enable-shared-storage: enable [root@dhcp47-60 ~]# [root@dhcp47-60 ~]#
REVIEW: https://review.gluster.org/17109 (glusterd/geo-rep: Fix snapshot create in geo-rep setup) posted (#1) for review on release-3.8 by Kotresh HR (khiremat)
COMMIT: https://review.gluster.org/17109 committed in release-3.8 by Aravinda VK (avishwan) ------ commit 57b481a071c13078c603cf2d96f9a04b9ebc39b4 Author: Kotresh HR <khiremat> Date: Thu Apr 20 07:18:52 2017 -0400 glusterd/geo-rep: Fix snapshot create in geo-rep setup glusterd persists geo-rep sessions in glusterd info file which is represented by dictionary 'volinfo->gsync_slaves' in memory. Glusterd also maintains in memory active geo-rep sessions in dictionary 'volinfo->gsync_active_slaves' whose key is "<slave_url>::<slavhost>". When glusterd is restarted while the geo-rep sessions are active, it builds the 'volinfo->gsync_active_slaves' from persisted glusterd info file. Since slave volume uuid is added to "voinfo->gsync_slaves" with the commit "http://review.gluster.org/13111", it builds it with key "<slave_url>::<slavehost>:<slavevol_uuid>" which is wrong. So during snapshot pre-validation which checks whether geo-rep is active or not, it always says it is ACTIVE, as geo-rep stop would not deleted this key. Fixed the same in this patch. > BUG: 1443977 > Signed-off-by: Kotresh HR <khiremat> > Reviewed-on: https://review.gluster.org/17093 > Smoke: Gluster Build System <jenkins.org> > NetBSD-regression: NetBSD Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Atin Mukherjee <amukherj> (cherry picked from commit f071d2a285ea4802fe8f328f9f275180983fbbba) Change-Id: I185178910b4b8a62e66aba406d88d12fabc5c122 BUG: 1445213 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: https://review.gluster.org/17109 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Aravinda VK <avishwan>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.12, please open a new bug report. glusterfs-3.8.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2017-May/000072.html [2] https://www.gluster.org/pipermail/gluster-users/