Bug 1598977
Summary: | [geo-rep]: geo-replication scheduler is failing due to unsuccessful umount | |||
---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Kotresh HR <khiremat> | |
Component: | geo-replication | Assignee: | Kotresh HR <khiremat> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | mainline | CC: | amukherj, bugs, csaba, khiremat, rallan, rhinduja, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vdas, vfarias | |
Target Milestone: | --- | Keywords: | Regression | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-5.0 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | 1568896 | |||
: | 1600941 (view as bug list) | Environment: | ||
Last Closed: | 2018-10-23 15:13:48 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1568896 | |||
Bug Blocks: | 1600941 |
Description
Kotresh HR
2018-07-07 13:15:14 UTC
Upstream Patch: https://review.gluster.org/20476 REVIEW: https://review.gluster.org/20476 (geo-rep/scheduler: Fix EBUSY trace back) posted (#1) for review on master by Kotresh HR Analysis: Thanks Csaba(csaba) for the detailed analysis and help. Following is the analysis done by Csaba. ================================================================================= When you run the schedule script and it ends up in mess, there are three bogus ex-mountpoints: [root@dhcp42-119 ~]# rmdir -v /tmp/g* 2>&1 | grep -B1 busy rmdir: removing directory, ‘/tmp/georepsetup_qKu_20’ rmdir: failed to remove ‘/tmp/georepsetup_qKu_20’: Device or resource busy rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-1rWuvM’ rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-1rWuvM’: Device or resource busy rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-SWi0_m’ rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-SWi0_m’: Device or resource busy Let me tell you about two kernel inspection features. Each mount has a “device id” associated with it (unique in the given moment). For any file, stat(3) result carries the device id of the mount it’s within in the st_dev field; stat(1) shows this info as “Device” (see also sys_stat(3p)). You can see mounts with associated device ids in /proc/self/mountinfo. Also fuse comes with a fusectl virtual fs (commonly mounted on /sys/fs/fuse/connections/) that collects metadata and some control hooks for the existing fuse mounts in directories named with the device ids of the certain mounts. So the following command lists those fuse connections which outlived their mounts, ie. they are listed in fusectl but not in mountinfo: for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done In this situation you’ll get: [root@dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done BAD 41 BAD 42 BAD 43 (Actually these are not necessarily bad — this discrepancy is expected when a lazy umount occurs and the fs survives this because a process still using it. But in this case AFAIK neither of them was lazily unmount occurred; and even if I’m wrong about it, after a lazy unmount the directory should not be busy.) Let’s see which fuse servers are behind these entries, by checking the processes that are using /dev/fuse: [root@dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps ww | grep /tmp 13769 ? Ssl 0:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id master -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_qKu_20 13863 ? Ssl 0:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick0%2Fb1.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-SWi0_m 13864 ? Ssl 0:01 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick1%2Fb4.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-1rWuvM (We can do a grep /tmp because we know the bogus mounts are in /tmp). Continuing the investigation we can see they are still sitting in a read to /dev/fuse: [root@dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps ww | grep /tmp | \ awk '{print $1}' | while read l; do head /proc/$l/task/*/stack; done | grep -B1 fuse ---------------------------------------------------------------- ==> /proc/13769/task/13846/stack <== [<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse] [<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse] -- ==> /proc/13863/task/13884/stack <== [<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse] [<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse] -- ==> /proc/13864/task/13888/stack <== [<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse] [<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse] -------------------------------------------------------------------- When a fuse mount is decommissioned then the fuse server’s read call to /dev/fuse should return with ENODEV. That apparently didn’t happen here. [root@dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $1" "$2" "$NF }' | uniq python 13800 /tmp/gsyncd-aux-mount-SWi0_m python 13802 /tmp/gsyncd-aux-mount-1rWuvM Curiosly, there is no reference to one of the bogus dirs, the /tmp/georepsetup* one. Let’s see whom are the culprits: [root@dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $2}' | uniq | xargs ps ww PID TTY STAT TIME COMMAND 13800 ? Sl 0:09 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1 --path=/bricks/brick1/b4 -c /var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f 10.70.42.88::slave -N -p --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d --feedback-fd 15 --local-path /bricks/brick0/b1 --local-node 10.70.42.119 --local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id .%2Fbricks%2Fbrick0%2Fb1 --rpc-fd 11,9,8,13 --subvol-num 1 --resource-remote ssh://root.42.88:gluster://localhost:slave 13802 ? Sl 0:06 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1 --path=/bricks/brick1/b4 -c /var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f 10.70.42.88::slave -N -p --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d --feedback-fd 15 --local-path /bricks/brick1/b4 --local-node 10.70.42.119 --local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id .%2Fbricks%2Fbrick1%2Fb4 --rpc-fd 11,9,8,12 --subvol-num 2 --resource-remote ssh://root.42.88:gluster://localhost:slave That is, the gsyncd workers which have a resource-remote option: [root@dhcp42-119 ~]# pgrep -f resource-remote 13800 13802 I found that all this badness goes away with stopping geo-rep, so as a more targeted action we can just try to kill the above processes. [root@dhcp42-119 ~]# pkill -f resource-remote [root@dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done [root@dhcp42-119 ~]# All the badness is gone! ================================================================================ From Csaba's analysis in comment 3, The master volume mount done by scheduler script is being used by geo-rep worker. That's really strange. When it's analysed deeper, it's found out that it's related to private mount namespace being used by worker. The monitor spawns the worker with private mount namespace. Once the workers are up, the mounts done outside worker are not propagated to worker and vice versa. But the global mounts done prior to the worker spawn are propagated to the worker. Even though it's propagated, it should not block the original process which had mounted the global mount to umount and clean up the directory. But that's the behaviour in the system that is reproduced (Centos 7). So mitigating the issue by mounting it after the worker process is up. [root@dhcp42-119 ~]# uname -a Linux dhcp42-119.lab.eng.blr.redhat.com 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux COMMIT: https://review.gluster.org/20476 committed in master by "Aravinda VK" <avishwan> with a commit message- geo-rep/scheduler: Fix EBUSY trace back Fix the trace back during temporary mount cleanup. Temporary mount is done to touch the root required for checkpoint to complete. fixes: bz#1598977 Change-Id: I97fea538e92c4ef0747747e981ef98499504e336 Signed-off-by: Kotresh HR <khiremat> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/ |