Description of problem: ======================== Had two 4node clusters, with one as master and the other acting as slave. Both were part of RHGS-Console. Had 2 geo-rep sessions created in 3.7.9-12 build. Upgraded the RHGS bits to 3.8.4-12 by following the procedure mentioned in the guide. Tried to take a snapshot on the master volume, and it complained: 'the geo-rep session is running. Please stop before taking a snapshot.' Stopped the geo-rep session and again tried to take a snapshot. It complained with the same error as before - 'that it found a running geo-rep session', even though the session was stopped. Found a way to reproduce it consistently (thanks to Avra and Kotresh) 1. Have a geo-rep session in 'started' state between 'master' and 'slave' volumes 2. Restart glusterd on one of the master nodes 3. Stop the session between 'master' and 'slave' volumes 4. Take a snapshot on 'master' Expected result: Snapshot creation should succeed. Actual result: Snapshot creation fails with the error - 'found a running geo-rep session' Version-Release number of selected component (if applicable): ============================================================ 3.8.4-12 How reproducible: ================ Seeing it on 2 of my geo-rep sessions. Additional info: ================= [root@dhcp47-26 ~]# gluster v geo-rep status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 masterB /bricks/brick1/masterB_1 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB dhcp35-100.lab.eng.blr.redhat.com Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.26 masterD /bricks/brick0/masterD_2 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.101 Active Changelog Crawl 2017-01-24 11:21:10 10.70.47.26 mm /bricks/brick0/mm2 geo ssh://geo.35.115::ss 10.70.35.101 Active Changelog Crawl 2017-01-17 11:21:46 10.70.47.60 masterB /bricks/brick1/masterB_3 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.101 Active Changelog Crawl 2017-01-12 11:56:43 10.70.47.60 masterD /bricks/brick0/masterD_0 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.115 Active Changelog Crawl 2017-01-24 11:21:14 10.70.47.60 mm /bricks/brick0/mm0 geo ssh://geo.35.115::ss 10.70.35.115 Active Changelog Crawl 2017-01-17 11:21:33 dhcp47-27.lab.eng.blr.redhat.com masterB /bricks/brick1/masterB_0 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.115 Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.27 masterD /bricks/brick0/masterD_3 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.100 Passive N/A N/A 10.70.47.27 mm /bricks/brick0/mm3 geo ssh://geo.35.115::ss 10.70.35.100 Passive N/A N/A 10.70.47.61 masterB /bricks/brick1/masterB_2 root ssh://dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.104 Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.61 masterD /bricks/brick0/masterD_1 us2 ssh://us2.eng.blr.redhat.com::slaveD 10.70.35.104 Passive N/A N/A 10.70.47.61 mm /bricks/brick0/mm1 geo ssh://geo.35.115::ss 10.70.35.104 Passive N/A N/A [root@dhcp47-26 ~]# gluster v geo-rep masterB dhcp35-100.lab.eng.blr.redhat.com::slaveB status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 masterB /bricks/brick1/masterB_1 root dhcp35-100.lab.eng.blr.redhat.com::slaveB dhcp35-100.lab.eng.blr.redhat.com Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.61 masterB /bricks/brick1/masterB_2 root dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.104 Active Changelog Crawl 2017-01-12 11:56:35 dhcp47-27.lab.eng.blr.redhat.com masterB /bricks/brick1/masterB_0 root dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.115 Active Changelog Crawl 2017-01-12 11:56:35 10.70.47.60 masterB /bricks/brick1/masterB_3 root dhcp35-100.lab.eng.blr.redhat.com::slaveB 10.70.35.101 Active Changelog Crawl 2017-01-12 11:56:43 [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep masterB dhcp35-100.lab.eng.blr.redhat.com::slaveB status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 masterB /bricks/brick1/masterB_1 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A 10.70.47.60 masterB /bricks/brick1/masterB_3 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A 10.70.47.61 masterB /bricks/brick1/masterB_2 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A dhcp47-27.lab.eng.blr.redhat.com masterB /bricks/brick1/masterB_0 root dhcp35-100.lab.eng.blr.redhat.com::slaveB N/A Stopped N/A N/A [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster snap create masterB_snap1 Invalid Syntax. Usage: snapshot create <snapname> <volname> [no-timestamp] [description <description>] [force] [root@dhcp47-26 ~]# gluster snap create masterB_snap1 masterB no-timestamp snapshot create: failed: geo-replication session is running for the volume masterB. Session needs to be stopped before taking a snapshot. Snapshot command failed [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED --------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 mm /bricks/brick0/mm2 geo geo.35.115::ss 10.70.35.101 Active Changelog Crawl 2017-01-17 11:21:46 10.70.47.27 mm /bricks/brick0/mm3 geo geo.35.115::ss 10.70.35.100 Passive N/A N/A 10.70.47.60 mm /bricks/brick0/mm0 geo geo.35.115::ss 10.70.35.115 Active Changelog Crawl 2017-01-17 11:21:33 10.70.47.61 mm /bricks/brick0/mm1 geo geo.35.115::ss 10.70.35.104 Passive N/A N/A [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss stop Stopping geo-replication session between mm & geo.35.115::ss has been successful [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster v geo-rep mm geo.35.115::ss status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------- 10.70.47.26 mm /bricks/brick0/mm2 geo geo.35.115::ss N/A Stopped N/A N/A 10.70.47.61 mm /bricks/brick0/mm1 geo geo.35.115::ss N/A Stopped N/A N/A 10.70.47.27 mm /bricks/brick0/mm3 geo geo.35.115::ss N/A Stopped N/A N/A 10.70.47.60 mm /bricks/brick0/mm0 geo geo.35.115::ss N/A Stopped N/A N/A [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# gluster snap create mm_snap mm snapshot create: failed: geo-replication session is running for the volume mm. Session needs to be stopped before taking a snapshot. Snapshot command failed [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# vim /var/log/glusterfs/geo-replication/mm/ssh%3A%2F%2Fgeo%4010.70.35.115%3Agluster%3A%2F%2F127.0.0.1%3Ass.log [root@dhcp47-26 ~]# [root@dhcp47-26 ~]# [root@dhcp47-60 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp47-27.lab.eng.blr.redhat.com Uuid: 6eb0185c-cc76-4bd1-a691-2ecb6a652901 State: Peer in Cluster (Connected) Hostname: 10.70.47.61 Uuid: 3f350e37-69aa-4fc3-b9af-70c4db688721 State: Peer in Cluster (Connected) Hostname: 10.70.47.26 Uuid: 53883823-cb8e-4da1-b6ee-a53e0ef7cd9a State: Peer in Cluster (Connected) [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# rpm -qa | grep gluster vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-api-3.8.4-12.el7rhgs.x86_64 glusterfs-libs-3.8.4-12.el7rhgs.x86_64 python-gluster-3.8.4-12.el7rhgs.noarch glusterfs-3.8.4-12.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-12.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-cli-3.8.4-12.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-12.el7rhgs.x86_64 glusterfs-server-3.8.4-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-12.el7rhgs.x86_64 glusterfs-fuse-3.8.4-12.el7rhgs.x86_64 [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# gluster v list gluster_shared_storage masterA masterB masterD mm [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# [root@dhcp47-60 ~]# gluster v info mm Volume Name: mm Type: Distributed-Replicate Volume ID: 4c435eff-24de-4030-a8dc-769bbaf292a4 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.60:/bricks/brick0/mm0 Brick2: 10.70.47.61:/bricks/brick0/mm1 Brick3: 10.70.47.26:/bricks/brick0/mm2 Brick4: 10.70.47.27:/bricks/brick0/mm3 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on performance.readdir-ahead: on nfs.disable: off transport.address-family: inet cluster.enable-shared-storage: enable [root@dhcp47-60 ~]# [root@dhcp47-60 ~]#
Sosreport at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1416024/ [qe@rhsqe-repo 1416024]$ [qe@rhsqe-repo 1416024]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1416024]$ [qe@rhsqe-repo 1416024]$ pwd /home/repo/sosreports/1416024 [qe@rhsqe-repo 1416024]$ [qe@rhsqe-repo 1416024]$ ll total 205720 -rwxr-xr-x. 1 qe qe 65690780 Jan 24 17:57 sosreport-dhcp47-26.lab.eng.blr.redhat.com-20170124171546.tar.xz -rwxr-xr-x. 1 qe qe 46635984 Jan 24 17:57 sosreport-dhcp47-27.lab.eng.blr.redhat.com-20170124171556.tar.xz -rwxr-xr-x. 1 qe qe 51462404 Jan 24 17:57 sosreport-dhcp47-60.lab.eng.blr.redhat.com-20170124171526.tar.xz -rwxr-xr-x. 1 qe qe 46860840 Jan 24 17:57 sosreport-dhcp47-61.lab.eng.blr.redhat.com-20170124171538.tar.xz [qe@rhsqe-repo 1416024]$
WORKAROUND: 1. Stop geo-replication: gluster vol geo-rep <mastervol> <user@slavehost>::<slavevol> stop 2. Restart glusterd service 3. Start geo-replication: gluster vol geo-rep <mastervol> <user@slavehost>::<slavevol> start Now take snapshot: 1. Pause/Stop geo-replication 2. Take snapshot 3. Resume/start geo-replication
Upstream Patch: https://review.gluster.org/#/c/17093/
Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/104414/
Validated with build : glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64 With stop and restart of glusterd ================================== [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.160 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:53:14 10.70.41.160 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:53:15 10.70.41.155 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.233 Passive N/A N/A 10.70.41.155 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.233 Active Changelog Crawl 2017-05-24 11:53:03 10.70.41.156 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.150 Passive N/A N/A 10.70.41.156 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.150 Passive N/A N/A [root@dhcp41-160 ~]# service glusterd restart Redirecting to /bin/systemctl restart glusterd.service [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave stop Stopping geo-replication session between master & 10.70.43.52::slave has been successful [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED --------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.160 master /rhs/brick1/b1 root 10.70.43.52::slave N/A Stopped N/A N/A 10.70.41.160 master /rhs/brick2/b1 root 10.70.43.52::slave N/A Stopped N/A N/A 10.70.41.155 master /rhs/brick1/b1 root 10.70.43.52::slave N/A Stopped N/A N/A 10.70.41.155 master /rhs/brick2/b1 root 10.70.43.52::slave N/A Stopped N/A N/A 10.70.41.156 master /rhs/brick1/b1 root 10.70.43.52::slave N/A Stopped N/A N/A 10.70.41.156 master /rhs/brick2/b1 root 10.70.43.52::slave N/A Stopped N/A N/A [root@dhcp41-160 ~]# gluster snapshot create SNAP1 master snapshot create: success: Snap SNAP1_GMT-2017.05.24-11.56.44 created successfully [root@dhcp41-160 ~]# gluster snapshot list SNAP1_GMT-2017.05.24-11.56.44 [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave start Starting geo-replication session between master & 10.70.43.52::slave has been successful [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.160 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:54:59 10.70.41.160 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:54:45 10.70.41.156 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.150 Passive N/A N/A 10.70.41.156 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.150 Active Changelog Crawl 2017-05-24 11:54:48 10.70.41.155 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.233 Passive N/A N/A 10.70.41.155 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.233 Passive N/A N/A With pause and restart of glusterd =================================== [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.160 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:54:59 10.70.41.160 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:54:45 10.70.41.156 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.150 Passive N/A N/A 10.70.41.156 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.150 Active Changelog Crawl 2017-05-24 11:54:48 10.70.41.155 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.233 Passive N/A N/A 10.70.41.155 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.233 Passive N/A N/A [root@dhcp41-160 ~]# service glusterd restart Redirecting to /bin/systemctl restart glusterd.service [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave pause Pausing geo-replication session between master & 10.70.43.52::slave has been successful [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.160 master /rhs/brick1/b1 root 10.70.43.52::slave N/A Paused N/A N/A 10.70.41.160 master /rhs/brick2/b1 root 10.70.43.52::slave N/A Paused N/A N/A 10.70.41.156 master /rhs/brick1/b1 root 10.70.43.52::slave N/A Paused N/A N/A 10.70.41.156 master /rhs/brick2/b1 root 10.70.43.52::slave N/A Paused N/A N/A 10.70.41.155 master /rhs/brick1/b1 root 10.70.43.52::slave N/A Paused N/A N/A 10.70.41.155 master /rhs/brick2/b1 root 10.70.43.52::slave N/A Paused N/A N/A [root@dhcp41-160 ~]# gluster snapshot create SNAP2 master snapshot create: success: Snap SNAP2_GMT-2017.05.24-11.59.29 created successfully [root@dhcp41-160 ~]# gluster snapshot list SNAP1_GMT-2017.05.24-11.56.44 SNAP2_GMT-2017.05.24-11.59.29 [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave resume Resuming geo-replication session between master & 10.70.43.52::slave has been successful [root@dhcp41-160 ~]# gluster volume geo-replication master 10.70.43.52::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.160 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:54:59 10.70.41.160 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.52 Active Changelog Crawl 2017-05-24 11:58:30 10.70.41.156 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.150 Passive N/A N/A 10.70.41.156 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.150 Active Changelog Crawl 2017-05-24 11:58:33 10.70.41.155 master /rhs/brick1/b1 root 10.70.43.52::slave 10.70.43.233 Passive N/A N/A 10.70.41.155 master /rhs/brick2/b1 root 10.70.43.52::slave 10.70.43.233 Passive N/A N/A Basic validation is done. Moving this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774