Description of problem: ======================= What's broken: schedule_georep.py fails to complete the transition. -------------- What this tool does: -------------------- schedule_georep.py is a tool to run Geo-replication when required. This can be used to schedule the Geo-replication to run once in a day using # Run daily at 08:30pm 30 20 * * * root python /usr/share/glusterfs/scripts/schedule_georep.py \ --no-color gv1 fvm1 gv2 >> /var/log/glusterfs/schedule_georep.log 2>&1 This tool does the following, 1. Stop Geo-replication if Started 2. Start Geo-replication 3. Set Checkpoint 4. Check the Status and see Checkpoint is Complete.(LOOP) 5. If checkpoint complete, Stop Geo-replication Actual error: ------------- [root@dhcp41-227 ~]# python /usr/share/glusterfs/scripts/schedule_georep.py vol0 10.70.42.9 vol1 [ OK] Stopped Geo-replication [ OK] Set Checkpoint [ OK] Started Geo-replication and watching Status for Checkpoint completion [NOT OK] Unable to Remove temp directory /tmp/georepsetup_xO3cms rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy [root@dhcp41-227 ~]# df: doesnt show the mount --- [root@dhcp41-227 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/rhgs-root 17811456 2572524 15238932 15% / devtmpfs 3992712 0 3992712 0% /dev tmpfs 4004780 0 4004780 0% /dev/shm tmpfs 4004780 17124 3987656 1% /run tmpfs 4004780 0 4004780 0% /sys/fs/cgroup /dev/sda1 1038336 163040 875296 16% /boot tmpfs 800956 0 800956 0% /run/user/0 /dev/mapper/RHS_vg1-RHS_lv1 8330240 33524 8296716 1% /rhs/brick1 /dev/mapper/RHS_vg2-RHS_lv2 8330240 33524 8296716 1% /rhs/brick2 /dev/mapper/RHS_vg3-RHS_lv3 8330240 33524 8296716 1% /rhs/brick3 10.70.41.227:/gluster_shared_storage 17811456 2756096 15055360 16% /run/gluster/shared_storage [root@dhcp41-227 ~]# process still lists as mounted ------------------------------ [root@dhcp41-227 ~]# ps -eaf | grep glusterfs | grep tmp root 21976 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id vol0 -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_xO3cms root 22096 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick3%2Fb8.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-gnFTXn root 22098 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick2%2Fb5.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-vce8hM root 22112 1 0 11:23 ? 00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick1%2Fb2.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-CX9Ct9 [root@dhcp41-227 ~]# Manually umount also fails -------------------------- [root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms umount: /tmp/georepsetup_xO3cms: not mounted [root@dhcp41-227 ~]# rmdir /tmp/georepsetup_xO3cms rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy [root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms umount: /tmp/georepsetup_xO3cms: not mounted [root@dhcp41-227 ~]# [root@dhcp41-227 ~]# echo $? 32 [root@dhcp41-227 ~]# Additional information: ----------------------- 1. Script has a check for failure of umount before rmdir which is passing through. 2. Manually umount also fails for a said directory, however if the script is re-executed the earlier directory gets removed successfully but it fails for new mount directory Version-Release number of selected component (if applicable): ============================================================== mainline How reproducible: ================= Alwasy on Centos7 Steps to Reproduce: =================== 1. Setup geo-replication between master and slave 2. Run the tool with master, slave host and slave vol parameters Actual results: =============== Tool doesn't complete the transition from "touch mount" to "status complete"
Analysis: Thanks Csaba(csaba) for the detailed analysis and help. Following is the analysis done by Csaba. ================================================================================= When you run the schedule script and it ends up in mess, there are three bogus ex-mountpoints: [root@dhcp42-119 ~]# rmdir -v /tmp/g* 2>&1 | grep -B1 busy rmdir: removing directory, ‘/tmp/georepsetup_qKu_20’ rmdir: failed to remove ‘/tmp/georepsetup_qKu_20’: Device or resource busy rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-1rWuvM’ rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-1rWuvM’: Device or resource busy rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-SWi0_m’ rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-SWi0_m’: Device or resource busy Let me tell you about two kernel inspection features. Each mount has a “device id” associated with it (unique in the given moment). For any file, stat(3) result carries the device id of the mount it’s within in the st_dev field; stat(1) shows this info as “Device” (see also sys_stat(3p)). You can see mounts with associated device ids in /proc/self/mountinfo. Also fuse comes with a fusectl virtual fs (commonly mounted on /sys/fs/fuse/connections/) that collects metadata and some control hooks for the existing fuse mounts in directories named with the device ids of the certain mounts. So the following command lists those fuse connections which outlived their mounts, ie. they are listed in fusectl but not in mountinfo: for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done In this situation you’ll get: [root@dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done BAD 41 BAD 42 BAD 43 (Actually these are not necessarily bad — this discrepancy is expected when a lazy umount occurs and the fs survives this because a process still using it. But in this case AFAIK neither of them was lazily unmount occurred; and even if I’m wrong about it, after a lazy unmount the directory should not be busy.) Let’s see which fuse servers are behind these entries, by checking the processes that are using /dev/fuse: [root@dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps ww | grep /tmp 13769 ? Ssl 0:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id master -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_qKu_20 13863 ? Ssl 0:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick0%2Fb1.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-SWi0_m 13864 ? Ssl 0:01 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick1%2Fb4.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-1rWuvM (We can do a grep /tmp because we know the bogus mounts are in /tmp). Continuing the investigation we can see they are still sitting in a read to /dev/fuse: [root@dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps ww | grep /tmp | \ awk '{print $1}' | while read l; do head /proc/$l/task/*/stack; done | grep -B1 fuse ---------------------------------------------------------------- ==> /proc/13769/task/13846/stack <== [<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse] [<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse] -- ==> /proc/13863/task/13884/stack <== [<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse] [<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse] -- ==> /proc/13864/task/13888/stack <== [<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse] [<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse] -------------------------------------------------------------------- When a fuse mount is decommissioned then the fuse server’s read call to /dev/fuse should return with ENODEV. That apparently didn’t happen here. [root@dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $1" "$2" "$NF }' | uniq python 13800 /tmp/gsyncd-aux-mount-SWi0_m python 13802 /tmp/gsyncd-aux-mount-1rWuvM Curiosly, there is no reference to one of the bogus dirs, the /tmp/georepsetup* one. Let’s see whom are the culprits: [root@dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $2}' | uniq | xargs ps ww PID TTY STAT TIME COMMAND 13800 ? Sl 0:09 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1 --path=/bricks/brick1/b4 -c /var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f 10.70.42.88::slave -N -p --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d --feedback-fd 15 --local-path /bricks/brick0/b1 --local-node 10.70.42.119 --local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id .%2Fbricks%2Fbrick0%2Fb1 --rpc-fd 11,9,8,13 --subvol-num 1 --resource-remote ssh://root.42.88:gluster://localhost:slave 13802 ? Sl 0:06 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1 --path=/bricks/brick1/b4 -c /var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f 10.70.42.88::slave -N -p --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d --feedback-fd 15 --local-path /bricks/brick1/b4 --local-node 10.70.42.119 --local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id .%2Fbricks%2Fbrick1%2Fb4 --rpc-fd 11,9,8,12 --subvol-num 2 --resource-remote ssh://root.42.88:gluster://localhost:slave That is, the gsyncd workers which have a resource-remote option: [root@dhcp42-119 ~]# pgrep -f resource-remote 13800 13802 I found that all this badness goes away with stopping geo-rep, so as a more targeted action we can just try to kill the above processes. [root@dhcp42-119 ~]# pkill -f resource-remote [root@dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done [root@dhcp42-119 ~]# All the badness is gone! ================================================================================
From Csaba's analysis in comment 3, The master volume mount done by scheduler script is being used by geo-rep worker. That's really strange. When it's analysed deeper, it's found out that it's related to private mount namespace being used by worker. The monitor spawns the worker with private mount namespace. Once the workers are up, the mounts done outside worker are not propagated to worker and vice versa. But the global mounts done prior to the worker spawn are propagated to the worker. Even though it's propagated, it should not block the original process which had mounted the global mount to umount and clean up the directory. But that's the behaviour in the system that is reproduced (Centos 7). So mitigating the issue by mounting it after the worker process is up. [root@dhcp42-119 ~]# uname -a Linux dhcp42-119.lab.eng.blr.redhat.com 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
REVIEW: https://review.gluster.org/20507 (geo-rep/scheduler: Fix EBUSY trace back) posted (#1) for review on release-4.1 by Kotresh HR
COMMIT: https://review.gluster.org/20507 committed in release-4.1 by "Shyamsundar Ranganathan" <srangana> with a commit message- geo-rep/scheduler: Fix EBUSY trace back Fix the trace back during temporary mount cleanup. Temporary mount is done to touch the root required for checkpoint to complete. Backport of: > Patch: https://review.gluster.org/#/c/20476/ > BUG: 1598977 > Change-Id: I97fea538e92c4ef0747747e981ef98499504e336 > Signed-off-by: Kotresh HR <khiremat> (cherry picked from commit 271510b09d0613887f983b1e3dad5c55cdc6cd07) fixes: bz#1600941 Change-Id: I97fea538e92c4ef0747747e981ef98499504e336 Signed-off-by: Kotresh HR <khiremat>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.2, please open a new bug report. glusterfs-4.1.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-July/000106.html [2] https://www.gluster.org/pipermail/gluster-users/