Bug 1598977

Summary:	[geo-rep]: geo-replication scheduler is failing due to unsuccessful umount
Product:	[Community] GlusterFS	Reporter:	Kotresh HR <khiremat>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs, csaba, khiremat, rallan, rhinduja, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vdas, vfarias
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-5.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1568896
Clones:	1600941 (view as bug list)		Environment:
Last Closed:	2018-10-23 15:13:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1568896
Bug Blocks:	1600941

Description Kotresh HR 2018-07-07 13:15:14 UTC

+++ This bug was initially created as a clone of Bug #1568896 +++

Description of problem:
=======================

What's broken: schedule_georep.py fails to complete the transition.
--------------

What this tool does: 
--------------------
schedule_georep.py is a tool to run Geo-replication when required. This can be used to schedule the Geo-replication to run once in a day using

   # Run daily at 08:30pm
   30 20 * * * root python /usr/share/glusterfs/scripts/schedule_georep.py \
      --no-color gv1 fvm1 gv2 >> /var/log/glusterfs/schedule_georep.log 2>&1

This tool does the following,

1. Stop Geo-replication if Started
2. Start Geo-replication
3. Set Checkpoint
4. Check the Status and see Checkpoint is Complete.(LOOP)
5. If checkpoint complete, Stop Geo-replication


Actual error:
-------------
[root@dhcp41-227 ~]# python /usr/share/glusterfs/scripts/schedule_georep.py vol0 10.70.42.9 vol1
[    OK] Stopped Geo-replication
[    OK] Set Checkpoint
[    OK] Started Geo-replication and watching Status for Checkpoint completion
[NOT OK] Unable to Remove temp directory /tmp/georepsetup_xO3cms
rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy

[root@dhcp41-227 ~]#

df: doesnt show the mount
---

[root@dhcp41-227 ~]# df
Filesystem                           1K-blocks    Used Available Use% Mounted on
/dev/mapper/rhgs-root                 17811456 2572524  15238932  15% /
devtmpfs                               3992712       0   3992712   0% /dev
tmpfs                                  4004780       0   4004780   0% /dev/shm
tmpfs                                  4004780   17124   3987656   1% /run
tmpfs                                  4004780       0   4004780   0% /sys/fs/cgroup
/dev/sda1                              1038336  163040    875296  16% /boot
tmpfs                                   800956       0    800956   0% /run/user/0
/dev/mapper/RHS_vg1-RHS_lv1            8330240   33524   8296716   1% /rhs/brick1
/dev/mapper/RHS_vg2-RHS_lv2            8330240   33524   8296716   1% /rhs/brick2
/dev/mapper/RHS_vg3-RHS_lv3            8330240   33524   8296716   1% /rhs/brick3
10.70.41.227:/gluster_shared_storage  17811456 2756096  15055360  16% /run/gluster/shared_storage
[root@dhcp41-227 ~]#

process still lists as mounted
------------------------------

[root@dhcp41-227 ~]# ps -eaf | grep glusterfs | grep tmp
root     21976     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id vol0 -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_xO3cms
root     22096     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick3%2Fb8.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-gnFTXn
root     22098     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick2%2Fb5.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-vce8hM
root     22112     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick1%2Fb2.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-CX9Ct9
[root@dhcp41-227 ~]#

Manually umount also fails
--------------------------
[root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms
umount: /tmp/georepsetup_xO3cms: not mounted
[root@dhcp41-227 ~]# rmdir /tmp/georepsetup_xO3cms
rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy
[root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms
umount: /tmp/georepsetup_xO3cms: not mounted
[root@dhcp41-227 ~]# 
[root@dhcp41-227 ~]# echo $?
32
[root@dhcp41-227 ~]# 

Additional information:
-----------------------

1. Script has a check for failure of umount before rmdir which is passing through. 
2. Manually umount also fails for a said directory, however if the script is re-executed the earlier directory gets removed successfully but it fails for new mount directory


Version-Release number of selected component (if applicable):
==============================================================

mainline

How reproducible:
=================

Alwasy on Centos7


Steps to Reproduce:
===================
1. Setup geo-replication between master and slave
2. Run the tool with master, slave host and slave vol parameters


Actual results:
===============
Tool doesn't complete the transition from "touch mount" to "status complete"

Comment 1 Kotresh HR 2018-07-07 13:17:54 UTC

Upstream Patch:

https://review.gluster.org/20476

Comment 2 Worker Ant 2018-07-07 13:18:48 UTC

REVIEW: https://review.gluster.org/20476 (geo-rep/scheduler: Fix EBUSY trace back) posted (#1) for review on master by Kotresh HR

Comment 3 Kotresh HR 2018-07-12 10:02:36 UTC

Analysis:

Thanks Csaba(csaba) for the detailed analysis and help. Following is the analysis done by Csaba.

=================================================================================
When you run the schedule script and it ends up in mess, there are three bogus ex-mountpoints:

[root@dhcp42-119 ~]# rmdir -v /tmp/g* 2>&1 | grep -B1 busy
rmdir: removing directory, ‘/tmp/georepsetup_qKu_20’
rmdir: failed to remove ‘/tmp/georepsetup_qKu_20’: Device or resource busy
rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-1rWuvM’
rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-1rWuvM’: Device or resource busy
rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-SWi0_m’
rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-SWi0_m’: Device or resource busy

Let me tell you about two kernel inspection features.

Each mount has a “device id” associated with it (unique in the given moment). For any file, stat(3) result carries the device id of the mount it’s within in the st_dev field; stat(1) shows this info as “Device” (see also sys_stat(3p)).

You can see mounts with associated device ids in /proc/self/mountinfo.

Also fuse comes with a fusectl virtual fs (commonly mounted on /sys/fs/fuse/connections/) that collects metadata and some control hooks for the existing fuse mounts in directories named with the device ids of the certain mounts.

So the following command lists those fuse connections which outlived their mounts, ie. they are listed in fusectl but not in mountinfo:

 for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done

In this situation you’ll get:

[root@dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done
BAD 41
BAD 42
BAD 43

(Actually these are not necessarily bad — this discrepancy is expected when a lazy umount occurs and the fs survives this because a process still using it. But in this case AFAIK neither of them was lazily unmount occurred; and even if I’m wrong about it, after a lazy unmount the directory should not be busy.)

Let’s see which fuse servers are behind these entries, by checking the processes that are using /dev/fuse:

[root@dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps ww | grep /tmp
13769 ?        Ssl    0:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id master -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_qKu_20
13863 ?        Ssl    0:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick0%2Fb1.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-SWi0_m
13864 ?        Ssl    0:01 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick1%2Fb4.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-1rWuvM

(We can do a grep /tmp because we know the bogus mounts are in /tmp).

Continuing the investigation we can see they are still sitting in a read to /dev/fuse:

[root@dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps ww | grep /tmp | \
awk '{print $1}' | while read l; do head /proc/$l/task/*/stack; done | grep -B1 fuse

----------------------------------------------------------------
==> /proc/13769/task/13846/stack <==
[<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse]
[<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse]
--
==> /proc/13863/task/13884/stack <==
[<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse]
[<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse]
--
==> /proc/13864/task/13888/stack <==
[<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse]
[<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse]
--------------------------------------------------------------------

When a fuse mount is decommissioned then the fuse server’s read call to /dev/fuse should return with ENODEV. That apparently didn’t happen here.

[root@dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $1" "$2" "$NF }' | uniq
python 13800 /tmp/gsyncd-aux-mount-SWi0_m
python 13802 /tmp/gsyncd-aux-mount-1rWuvM

Curiosly, there is no reference to one of the bogus dirs, the /tmp/georepsetup* one.

Let’s see whom are the culprits:

[root@dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $2}' | uniq | xargs ps ww
  PID TTY      STAT   TIME COMMAND
13800 ?        Sl     0:09 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1 --path=/bricks/brick1/b4  -c /var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f 10.70.42.88::slave -N -p  --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d --feedback-fd 15 --local-path /bricks/brick0/b1 --local-node 10.70.42.119 --local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id .%2Fbricks%2Fbrick0%2Fb1 --rpc-fd 11,9,8,13 --subvol-num 1 --resource-remote ssh://root.42.88:gluster://localhost:slave
13802 ?        Sl     0:06 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1 --path=/bricks/brick1/b4  -c /var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f 10.70.42.88::slave -N -p  --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d --feedback-fd 15 --local-path /bricks/brick1/b4 --local-node 10.70.42.119 --local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id .%2Fbricks%2Fbrick1%2Fb4 --rpc-fd 11,9,8,12 --subvol-num 2 --resource-remote ssh://root.42.88:gluster://localhost:slave

That is, the gsyncd workers which have a resource-remote option:

[root@dhcp42-119 ~]# pgrep -f resource-remote
13800
13802

I found that all this badness goes away with stopping geo-rep, so as a more targeted action we can just try to kill the above processes.

[root@dhcp42-119 ~]# pkill -f resource-remote

[root@dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do  grep -q 0:$c /proc/self/mountinfo || echo BAD $c; done
[root@dhcp42-119 ~]#

All the badness is gone!
================================================================================

Comment 5 Kotresh HR 2018-07-12 10:05:47 UTC

From Csaba's analysis in comment 3, The master volume mount done by scheduler script is being used by geo-rep worker. That's really strange. When it's analysed deeper, it's found out that it's related to private mount namespace being used by worker. The monitor spawns the worker with private mount namespace. Once the workers are up, the mounts done outside worker are not propagated to worker and vice versa. But the global mounts done prior to the worker spawn are propagated to the worker. Even though it's propagated, it should not block the original process which had mounted the global mount to umount and clean up the directory.

But that's the behaviour in the system that is reproduced (Centos 7).
So mitigating the issue by mounting it after the worker process is up.

[root@dhcp42-119 ~]# uname -a
Linux dhcp42-119.lab.eng.blr.redhat.com 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 6 Worker Ant 2018-07-12 15:14:17 UTC

COMMIT: https://review.gluster.org/20476 committed in master by "Aravinda VK" <avishwan> with a commit message- geo-rep/scheduler: Fix EBUSY trace back

Fix the trace back during temporary mount
cleanup. Temporary mount is done to touch
the root required for checkpoint to complete.

fixes: bz#1598977
Change-Id: I97fea538e92c4ef0747747e981ef98499504e336
Signed-off-by: Kotresh HR <khiremat>

Comment 7 Shyamsundar 2018-10-23 15:13:48 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/