Bug 1568896 - [geo-rep]: geo-replication scheduler is failing due to unsuccessful umount
Summary: [geo-rep]: geo-replication scheduler is failing due to unsuccessful umount
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.3
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.4.0
Assignee: Kotresh HR
QA Contact: Rochelle
URL:
Whiteboard:
Depends On:
Blocks: 1503137 1598977 1600941
TreeView+ depends on / blocked
 
Reported: 2018-04-18 11:53 UTC by Rahul Hinduja
Modified: 2018-09-14 05:12 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.12.2-14
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1598977 (view as bug list)
Environment:
Last Closed: 2018-09-04 06:46:03 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:47:38 UTC

Description Rahul Hinduja 2018-04-18 11:53:47 UTC
Description of problem:
=======================

What's broken: schedule_georep.py fails to complete the transition.
--------------

What this tool does: 
--------------------
schedule_georep.py is a tool to run Geo-replication when required. This can be used to schedule the Geo-replication to run once in a day using

   # Run daily at 08:30pm
   30 20 * * * root python /usr/share/glusterfs/scripts/schedule_georep.py \
      --no-color gv1 fvm1 gv2 >> /var/log/glusterfs/schedule_georep.log 2>&1

This tool does the following,

1. Stop Geo-replication if Started
2. Start Geo-replication
3. Set Checkpoint
4. Check the Status and see Checkpoint is Complete.(LOOP)
5. If checkpoint complete, Stop Geo-replication


Actual error:
-------------
[root@dhcp41-227 ~]# python /usr/share/glusterfs/scripts/schedule_georep.py vol0 10.70.42.9 vol1
[    OK] Stopped Geo-replication
[    OK] Set Checkpoint
[    OK] Started Geo-replication and watching Status for Checkpoint completion
[NOT OK] Unable to Remove temp directory /tmp/georepsetup_xO3cms
rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy

[root@dhcp41-227 ~]#

df: doesnt show the mount
---

[root@dhcp41-227 ~]# df
Filesystem                           1K-blocks    Used Available Use% Mounted on
/dev/mapper/rhgs-root                 17811456 2572524  15238932  15% /
devtmpfs                               3992712       0   3992712   0% /dev
tmpfs                                  4004780       0   4004780   0% /dev/shm
tmpfs                                  4004780   17124   3987656   1% /run
tmpfs                                  4004780       0   4004780   0% /sys/fs/cgroup
/dev/sda1                              1038336  163040    875296  16% /boot
tmpfs                                   800956       0    800956   0% /run/user/0
/dev/mapper/RHS_vg1-RHS_lv1            8330240   33524   8296716   1% /rhs/brick1
/dev/mapper/RHS_vg2-RHS_lv2            8330240   33524   8296716   1% /rhs/brick2
/dev/mapper/RHS_vg3-RHS_lv3            8330240   33524   8296716   1% /rhs/brick3
10.70.41.227:/gluster_shared_storage  17811456 2756096  15055360  16% /run/gluster/shared_storage
[root@dhcp41-227 ~]#

process still lists as mounted
------------------------------

[root@dhcp41-227 ~]# ps -eaf | grep glusterfs | grep tmp
root     21976     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id vol0 -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_xO3cms
root     22096     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick3%2Fb8.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-gnFTXn
root     22098     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick2%2Fb5.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-vce8hM
root     22112     1  0 11:23 ?        00:00:00 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/vol0/ssh%3A%2F%2Froot%4010.70.42.9%3Agluster%3A%2F%2F127.0.0.1%3Avol1.%2Frhs%2Fbrick1%2Fb2.gluster.log --volfile-server=localhost --volfile-id=vol0 --client-pid=-1 /tmp/gsyncd-aux-mount-CX9Ct9
[root@dhcp41-227 ~]#

Manually umount also fails
--------------------------
[root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms
umount: /tmp/georepsetup_xO3cms: not mounted
[root@dhcp41-227 ~]# rmdir /tmp/georepsetup_xO3cms
rmdir: failed to remove ‘/tmp/georepsetup_xO3cms’: Device or resource busy
[root@dhcp41-227 ~]# umount /tmp/georepsetup_xO3cms
umount: /tmp/georepsetup_xO3cms: not mounted
[root@dhcp41-227 ~]# 
[root@dhcp41-227 ~]# echo $?
32
[root@dhcp41-227 ~]# 

Additional information:
-----------------------

1. Script has a check for failure of umount before rmdir which is passing through. 
2. Manually umount also fails for a said directory, however if the script is re-executed the earlier directory gets removed successfully but it fails for new mount directory


Version-Release number of selected component (if applicable):
==============================================================

glusterfs-geo-replication-3.12.2-7.el7rhgs.x86_64

How reproducible:
=================

Always


Steps to Reproduce:
===================
1. Setup geo-replication between master and slave
2. Run the tool with master, slave host and slave vol parameters


Actual results:
===============
Tool doesn't complete the transition from "touch mount" to "status complete"

Comment 7 Csaba Henk 2018-05-03 06:30:26 UTC
I don't know of any recent changes in the mount code.

I wonder what the following commands give (where $georep_mount is the stuck moount, ie. in the above case study it would be /tmp/georepsetup_xO3cms):

ls -l /etc/mtab
grep -n ^ /etc/mtab /proc/self/mounts
namei -x $georep_mount
fuser -m $georep_mount
fuser -m $georep_mount | xargs -r ps ww

Comment 15 Vagner Farias 2018-05-15 23:14:15 UTC
I'm facing the same issue.

[root@gluster-1 ~]# python /usr/share/glusterfs/scripts/schedule_georep.py glustervol dr-gluster-1.dc2.example.com dr-vol
[    OK] Stopped Geo-replication
[    OK] Set Checkpoint
[    OK] Started Geo-replication and watching Status for Checkpoint completion
[NOT OK] Unable to Remove temp directory /tmp/georepsetup_GnvrcD
rmdir: failed to remove ‘/tmp/georepsetup_GnvrcD’: Device or resource busy

[root@gluster-1 ~]# ps -eaf | grep glusterfs | grep tmp
root      6975     1  0 19:04 ?        00:00:00 /usr/sbin/glusterfs --volfile-server localhost --volfile-id glustervol -l /var/log/glusterfs/geo-replication/schedule_georep.mount.log /tmp/georepsetup_GnvrcD
root      7026     1  5 19:04 ?        00:00:06 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/glustervol/ssh%3A%2F%2Froot%4010.1.0.31%3Agluster%3A%2F%2F127.0.0.1%3Adr-vol.%2Fgluster%2Fbrick1%2Fglusterbrick1.gluster.log --volfile-server=localhost --volfile-id=glustervol --client-pid=-1 /tmp/gsyncd-aux-mount-A4yHfm

[root@gluster-1 ~]# rpm -q glusterfs
glusterfs-3.8.4-54.8.el7rhgs.x86_64

It's failing every time.

I don't know if it matters, but I've already executed "failover and failback" a couple of times before, waiting for checkpoints to complete instead of using this tool.

Comment 16 Vagner Farias 2018-05-16 00:41:11 UTC
A bit of information that may be useful in debugging.

I manually executed commands that I think the python code[0] is trying to execute:

[root@gluster-1 ~]# mkdir /tmp/test_aaa && /usr/sbin/glusterfs --volfile-server localhost --volfile-id glustervol -l /tmp/mylog /tmp/test_aaa && touch /tmp/test_aaa && umount /tmp/test_aaa && rmdir /tmp/test_aaa
[root@gluster-1 ~]# echo $?
0
[root@gluster-1 ~]# ps -eaf | grep glusterfs | grep test_aaa
[root@gluster-1 ~]# 

In my logfile I get this:

~~~
[2018-05-16 00:24:37.060416] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2018-05-16 00:24:37.062851] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-glustervol-replicate-0: selecting local read_child glustervol-client-0
[2018-05-16 00:24:37.076260] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: initating unmount of /tmp/test_aaa
[2018-05-16 00:24:37.076840] W [glusterfsd.c:1300:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f60f45f0dd5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55a134b880c5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55a134b87eeb] ) 0-: received signum (15), shutting down
[2018-05-16 00:24:37.076897] I [fuse-bridge.c:5834:fini] 0-fuse: Unmounting '/tmp/test_aaa'.
[2018-05-16 00:24:37.076919] I [fuse-bridge.c:5839:fini] 0-fuse: Closing fuse connection to '/tmp/test_aaa'.
~~~

Note the lines showing that the directory is being unmounted and fuse connection to the directory is being closed.

Now, look at the relevant part of the log when I run the python code:

~~~
[2018-05-16 00:17:55.593833] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2018-05-16 00:17:55.595851] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-glustervol-replicate-0: selecting local read_child glustervol-client-0
~~~

Note that there's nothing in the log file showing that the directory is being unmounted, albeit it is actually unmounted when the python program exits. But although the directory is unmounted, the glusterfs process is still running and the directory cannot be removed.

I'm running RHGS on RHEL 7.5 if that matters.

[root@gluster-1 ~]# rpm -q glusterfs kernel python
glusterfs-3.8.4-54.8.el7rhgs.x86_64
kernel-3.10.0-693.17.1.el7.x86_64
kernel-3.10.0-862.2.3.el7.x86_64
python-2.7.5-68.el7.x86_64


[0]
<..snip..>
def execute(cmd, success_msg="", failure_msg="", exitcode=-1):
    """
    Generic wrapper to execute the CLI commands. Returns Output if success.
    On success it can print message in stdout if specified.
    On failure, exits after writing to stderr.
    """
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = p.communicate()
    if p.returncode == 0:
        if success_msg:
            output_ok(success_msg)
        return out
    else:
        err_msg = err if err else out
        output_notok(failure_msg, err=err_msg, exitcode=exitcode)

<..snip..>

def cleanup(hostname, volname, mnt):
    """
    Unmount the Volume and Remove the temporary directory
    """
    execute(["/usr/bin/umount", mnt],
            failure_msg="Unable to Unmount Gluster Volume "
            "{0}:{1}(Mounted at {2})".format(hostname, volname, mnt))
    execute(["rmdir", mnt],
            failure_msg="Unable to Remove temp directory "
            "{0}".format(mnt))

@contextmanager
def glustermount(hostname, volname):
    """
    Context manager for Mounting Gluster Volume
    Use as
        with glustermount(HOSTNAME, VOLNAME) as MNT:
            # Do your stuff
    Automatically unmounts it in case of Exceptions/out of context
    """
    mnt = tempfile.mkdtemp(prefix="georepsetup_")
    execute(["/usr/sbin/glusterfs",
             "--volfile-server", hostname,
             "--volfile-id", volname,
             "-l", SESSION_MOUNT_LOG_FILE,
             mnt],
            failure_msg="Unable to Mount Gluster Volume "
            "{0}:{1}".format(hostname, volname))
    if os.path.ismount(mnt):
        yield mnt
    else:
        output_notok("Unable to Mount Gluster Volume "
                     "{0}:{1}".format(hostname, volname))
    cleanup(hostname, volname, mnt)

<..snip..>

def touch_mount_root(mastervol):
    # Create a Mount and Touch the Mount point root,
    # Hack to make sure some event available after
    # setting Checkpoint. Without this their is a chance of
    # Checkpoint never completes.
    with glustermount("localhost", mastervol) as mnt:
        execute(["touch", mnt])

Comment 34 errata-xmlrpc 2018-09-04 06:46:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.