Description of problem: ======================= What's broken: schedule_georep.py fails to complete the transition. -------------- What this tool does: -------------------- schedule_georep.py is a tool to run Geo-replication when required. This can be used to schedule the Geo-replication to run once in a day using # Run daily at 08:30pm 30 20 * * * root python /usr/share/glusterfs/scripts/schedule_georep.py \ --no-color gv1 fvm1 gv2 >> /var/log/glusterfs/schedule_georep.log 2>&1 This tool does the following, 1. Stop Geo-replication if Started 2. Start Geo-replication 3. Set Checkpoint 4. Check the Status and see Checkpoint is Complete.(LOOP) 5. If checkpoint complete, Stop Geo-replication Actual error: ------------- [root@dhcp41-227 ~]# python /usr/share/glusterfs/scripts/schedule_georep.py vol0 10.70.42.9 vol1 [ OK] Stopped Geo-replication [ OK] Set Checkpoint [ OK] Started Geo-replication and watching Status for Checkpoint completion [NOT OK] Unable to Remove temp directory /tmp/georepsetup_xO3cms rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy [root@dhcp41-227 ~]# df: doesnt show the mount --- [root@dhcp41-227 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/rhgs-root 17811456 2572524 15238932 15% / devtmpfs 3992712 0 3992712 0% /dev tmpfs 4004780 0 4004780 0% /dev/shm tmpfs 4004780 17124 3987656 1% /run tmpfs 4004780 0 4004780 0% /sys/fs/cgroup /dev/sda1 1038336 163040 875296 16% /boot tmpfs 800956 0 800956 0% /run/user/0 /dev/mapper/RHS_vg1-RHS_lv1 8330240 33524 8296716 1% /rhs/brick1 /dev/mapper/RHS_vg2-RHS_lv2 8330240 33524 8296716 1% /rhs/brick2 /dev/mapper/RHS_vg3-RHS_lv3 8330240 33524 8296716 1% /rhs/brick3 10.70.41.227:/gluster_shared_storage 17811456 2756096 15055360 16% /run/gluster/shared_storage [root@dhcp41-227 ~]# process still lists as mounted ------------------------------ [root@dhcp41-227 ~]# ps -eaf | grep glusterfs | grep tmp root 21976 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id vol0 -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_xO3cms root 22096 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick3%2Fb8.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-gnFTXn root 22098 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick2%2Fb5.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-vce8hM root 22112 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick1%2Fb2.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-CX9Ct9 [root@dhcp41-227 ~]# Manually umount also fails -------------------------- [root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms umount: /tmp/georepsetup_xO3cms: not mounted [root@dhcp41-227 ~]# rmdir /tmp/georepsetup_xO3cms rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy [root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms umount: /tmp/georepsetup_xO3cms: not mounted [root@dhcp41-227 ~]# [root@dhcp41-227 ~]# echo $? 32 [root@dhcp41-227 ~]# Additional information: ----------------------- 1. Script has a check for failure of umount before rmdir which is passing through. 2. Manually umount also fails for a said directory, however if the script is re-executed the earlier directory gets removed successfully but it fails for new mount directory Version-Release number of selected component (if applicable): ============================================================== glusterfs-geo-replication-3.12.2-7.el7rhgs.x86_64 How reproducible: ================= Always Steps to Reproduce: =================== 1. Setup geo-replication between master and slave 2. Run the tool with master, slave host and slave vol parameters Actual results: =============== Tool doesn't complete the transition from "touch mount" to "status complete"
I don't know of any recent changes in the mount code. I wonder what the following commands give (where $georep_mount is the stuck moount, ie. in the above case study it would be /tmp/georepsetup_xO3cms): ls -l /etc/mtab grep -n ^ /etc/mtab /proc/self/mounts namei -x $georep_mount fuser -m $georep_mount fuser -m $georep_mount | xargs -r ps ww
I'm facing the same issue. [root@gluster-1 ~]# python /usr/share/glusterfs/scripts/schedule_georep.py glustervol dr-gluster-1.dc2.example.com dr-vol [ OK] Stopped Geo-replication [ OK] Set Checkpoint [ OK] Started Geo-replication and watching Status for Checkpoint completion [NOT OK] Unable to Remove temp directory /tmp/georepsetup_GnvrcD rmdir: failed to remove ‘/tmp/georepsetup_GnvrcD’: Device or resource busy [root@gluster-1 ~]# ps -eaf | grep glusterfs | grep tmp root 6975 1 0 19:04 ? 00:00:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id glustervol -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_GnvrcD root 7026 1 5 19:04 ? 00:00:06 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/glustervol/ssh%3A%2F%2Froot%4010.1.0.31%3Agluster%3A%2F%2F127.0.0.1%3Adr-vol.%2Fgluster%2Fbrick1%2Fglusterbrick1.gluster.log --volfile-server=localhost --volfile-id=glustervol --client-pid=-1 /tmp/gsyncd-aux-mount-A4yHfm [root@gluster-1 ~]# rpm -q glusterfs glusterfs-3.8.4-54.8.el7rhgs.x86_64 It's failing every time. I don't know if it matters, but I've already executed "failover and failback" a couple of times before, waiting for checkpoints to complete instead of using this tool.
A bit of information that may be useful in debugging. I manually executed commands that I think the python code[0] is trying to execute: [root@gluster-1 ~]# mkdir /tmp/test_aaa && /usr/sbin/glusterfs --volfile-server localhost --volfile-id glustervol -l /tmp/mylog /tmp/test_aaa && touch /tmp/test_aaa && umount /tmp/test_aaa && rmdir /tmp/test_aaa [root@gluster-1 ~]# echo $? 0 [root@gluster-1 ~]# ps -eaf | grep glusterfs | grep test_aaa [root@gluster-1 ~]# In my logfile I get this: ~~~ [2018-05-16 00:24:37.060416] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22 [2018-05-16 00:24:37.062851] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-glustervol-replicate-0: selecting local read_child glustervol-client-0 [2018-05-16 00:24:37.076260] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: initating unmount of /tmp/test_aaa [2018-05-16 00:24:37.076840] W [glusterfsd.c:1300:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f60f45f0dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55a134b880c5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55a134b87eeb] ) 0-: received signum (15), shutting down [2018-05-16 00:24:37.076897] I [fuse-bridge.c:5834:fini] 0-fuse: Unmounting '/tmp/test_aaa'. [2018-05-16 00:24:37.076919] I [fuse-bridge.c:5839:fini] 0-fuse: Closing fuse connection to '/tmp/test_aaa'. ~~~ Note the lines showing that the directory is being unmounted and fuse connection to the directory is being closed. Now, look at the relevant part of the log when I run the python code: ~~~ [2018-05-16 00:17:55.593833] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22 [2018-05-16 00:17:55.595851] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-glustervol-replicate-0: selecting local read_child glustervol-client-0 ~~~ Note that there's nothing in the log file showing that the directory is being unmounted, albeit it is actually unmounted when the python program exits. But although the directory is unmounted, the glusterfs process is still running and the directory cannot be removed. I'm running RHGS on RHEL 7.5 if that matters. [root@gluster-1 ~]# rpm -q glusterfs kernel python glusterfs-3.8.4-54.8.el7rhgs.x86_64 kernel-3.10.0-693.17.1.el7.x86_64 kernel-3.10.0-862.2.3.el7.x86_64 python-2.7.5-68.el7.x86_64 [0] <..snip..> def execute(cmd, success_msg="", failure_msg="", exitcode=-1): """ Generic wrapper to execute the CLI commands. Returns Output if success. On success it can print message in stdout if specified. On failure, exits after writing to stderr. """ p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) out, err = p.communicate() if p.returncode == 0: if success_msg: output_ok(success_msg) return out else: err_msg = err if err else out output_notok(failure_msg, err=err_msg, exitcode=exitcode) <..snip..> def cleanup(hostname, volname, mnt): """ Unmount the Volume and Remove the temporary directory """ execute(["/usr/bin/umount", mnt], failure_msg="Unable to Unmount Gluster Volume " "{0}:{1}(Mounted at {2})".format(hostname, volname, mnt)) execute(["rmdir", mnt], failure_msg="Unable to Remove temp directory " "{0}".format(mnt)) @contextmanager def glustermount(hostname, volname): """ Context manager for Mounting Gluster Volume Use as with glustermount(HOSTNAME, VOLNAME) as MNT: # Do your stuff Automatically unmounts it in case of Exceptions/out of context """ mnt = tempfile.mkdtemp(prefix="georepsetup_") execute(["/usr/sbin/glusterfs", "--volfile-server", hostname, "--volfile-id", volname, "-l", SESSION_MOUNT_LOG_FILE, mnt], failure_msg="Unable to Mount Gluster Volume " "{0}:{1}".format(hostname, volname)) if os.path.ismount(mnt): yield mnt else: output_notok("Unable to Mount Gluster Volume " "{0}:{1}".format(hostname, volname)) cleanup(hostname, volname, mnt) <..snip..> def touch_mount_root(mastervol): # Create a Mount and Touch the Mount point root, # Hack to make sure some event available after # setting Checkpoint. Without this their is a chance of # Checkpoint never completes. with glustermount("localhost", mastervol) as mnt: execute(["touch", mnt])
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607