Description of problem: ======================= Ran the script while there was no IO inprogress, checkpoint never reached for few of the active workers and eventually the script never completed. The reason is not to touch the mount point in every iteration. Modified script provided by dev works: [root@dhcp37-182 ~]# diff /usr/share/glusterfs/scripts/schedule_georep.py /tmp/schedule_georep.py 134d133 < "--xlator-option=\"*dht.lookup-unhashed=off\"", 138d136 < "--client-pid=-1", 142d139 < 148c145 < #cleanup(hostname, volname, mnt) --- > cleanup(hostname, volname, mnt) 416,422d412 < if not summary["checkpoints_ok"]: < # If Checkpoint is not complete after a iteration means brick < # was down and came online now. SETATTR on mount is not < # recorded, So again issue touch on mount root So that < # Stime will increase and Checkpoint will complete. < touch_mount_root(args.mastervol) < 432a423,428 > else: > # If Checkpoint is not complete after a iteration means brick > # was down and came online now. SETATTR on mount is not > # recorded, So again issue touch on mount root So that > # Stime will increase and Checkpoint will complete. > touch_mount_root(args.mastervol) [root@dhcp37-182 ~]# Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.7.9-1.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: =================== 1. Create data on master volume (6x2) 2. Create geo-rep session 3. Run the script
Upstream patch sent. http://review.gluster.org/14029 As a workaround, Touch the Master mount once script sets checkpoint.
Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/73033/
Verified with the build: glusterfs-3.7.9-3.el7rhgs.x86_64 glusterfs-geo-replication-3.7.9-3.el7rhgs.x86_64 Ran the script when no IO was in progress, script successfully stopped the geo-rep, started, set checkpoint and stopped before exit. [root@dhcp37-182 scripts]# python /usr/share/glusterfs/scripts/schedule_georep.py Tom 10.70.37.122 Jerry [ OK] Stopped Geo-replication [ OK] Set Checkpoint [ OK] Started Geo-replication and watching Status for Checkpoint completion [ OK] All Checkpoints NOT COMPLETE, All status OK (Turns 1) [ OK] All Checkpoints COMPLETE, All status OK (Turns 2) [ OK] Stopping Geo-replication session now [root@dhcp37-182 scripts]# Moving the bug to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240