Bug 1171662 - libgfapi crashes in glfs_fini for RDMA type volumes
Summary: libgfapi crashes in glfs_fini for RDMA type volumes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: samba
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: RHGS 3.1.0
Assignee: Ira Cooper
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard: gfapi
Depends On: 1153610
Blocks: 1202842
TreeView+ depends on / blocked
 
Reported: 2014-12-08 10:30 UTC by Anoop C S
Modified: 2015-07-29 04:37 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.7.1-1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1153610
Environment:
Last Closed: 2015-07-29 04:37:02 UTC
Embargoed:


Attachments (Terms of Use)
gfapi_perf_test.c (29.75 KB, text/x-csrc)
2015-07-20 11:23 UTC, Nag Pavan Chilakam
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Description Anoop C S 2014-12-08 10:30:01 UTC
+++ This bug was initially created as a clone of Bug #1153610 +++

Description of problem:

C program which uses libgfapi for RDMA volume crashes in glfs_fini().

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. Create a 1-brick RDMA volume.
2. Compile the attached C program that uses libgfapi
gcc -pthreads -g -O0  -Wall --pedantic -o gfapi_perf_test -I /usr/include/glusterfs/api gfapi_perf_test.c  -lgfapi -lrt

3. Do the following exports
export LD_LIBRARY_PATH=/usr/local/lib
export GFAPI_HOSTNAME=<server ip>
export GFAPI_TRANSPORT=rdma
export GFAPI_VOLNAME=<volume name>

4.Run the compiled output as
GFAPI_FILES=1 GFAPI_RECSZ=1024 GFAPI_FSZ=1048576 ./gfapi_perf_test

Actual results:
Segmentation fault (core dumped)

Expected results:
Program executes successfully.


Additional info:

--- Additional comment from Anoop C S on 2014-11-03 01:23:39 EST ---

Root cause of the crash was identified as follows:

When main() returns from a C program, global prioritized destructor functions are called in priority order before the process terminates. According to rdmacm standard library, rdma_cma_fini() is defined as a destructor function. The same function is also invoked as part of rdma_disconnect() initiated through xlator_notify() inside glfs_fini().

Due to the increased delay in rdma_disconnect() and associated cleanup, there will be a race between the main thread and rdma thread to execute rdma_cm_fini(), which will result in a segmentation fault.

--- Additional comment from Anand Avati on 2014-11-05 23:45:26 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#1) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-11-06 12:14:58 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#2) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-11-06 23:35:12 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#3) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-11-27 04:28:57 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#4) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-11-27 08:30:46 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#5) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-11-28 05:23:47 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#6) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-01 02:47:28 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#7) for review on master by Poornima G (pgurusid)

--- Additional comment from Anand Avati on 2014-12-02 01:23:21 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#8) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-02 04:23:54 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#9) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-03 06:55:55 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#10) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-03 09:32:38 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#11) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-03 12:09:47 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#12) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-04 07:40:34 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#13) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-04 09:58:54 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#14) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-05 03:39:42 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#15) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-05 08:03:28 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#16) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-05 08:23:56 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#17) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-05 12:05:58 EST ---

REVIEW: http://review.gluster.org/9060 (libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()) posted (#18) for review on master by Anoop C S (achiraya)

--- Additional comment from Anand Avati on 2014-12-08 04:54:52 EST ---

COMMIT: http://review.gluster.org/9060 committed in master by Niels de Vos (ndevos) 
------
commit cd6ffa93dc2a3cb1fcc5438086aebc54f368c2e9
Author: Anoop C S <achiraya>
Date:   Wed Oct 29 09:12:46 2014 -0400

    libgfapi: Wait for GF_EVENT_CHILD_DOWN in glfs_fini()
    
    Whenever glfs_fini() is being called, currently no
    check is made inside the function to determine whether
    the child is already down or not. This patch will wait
    for GF_EVENT_CHILD_DOWN for the active subvol and
    then exits.
    
    TBD:
    Apart from the active subvol, wait for other CHILD_DOWN
    events generated through operations like volume set in
    future.
    
    Change-Id: I81c64ac07b463bfed48bf306f9e8f46ba0f0a76f
    BUG: 1153610
    Signed-off-by: Anoop C S <achiraya>
    Reviewed-on: http://review.gluster.org/9060
    Reviewed-by: Shyamsundar Ranganathan <srangana>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra G <rgowdapp>
    Reviewed-by: Niels de Vos <ndevos>

Comment 2 Poornima G 2015-03-25 12:25:43 UTC
Fixed this as a part of fixing bug-1104039 patch merged @https://code.engineering.redhat.com/gerrit/#/c/42957/

Moving it to ON_QA

Comment 5 Nag Pavan Chilakam 2015-07-20 11:16:19 UTC
I ran the steps which caused the bug, and found that it is fixed.Hence moving to verified
Steps:
1)identified machines to be used for server and clients which have rdma cards.
  Also, I installed gluster-devel and gluster-devel-api packages on client and server.
2)enabled RDMA on both server and client using http://fpaste.org/245970/73750811/
3)create a volume with one-brick with transport type as RDMA and started it. For the brick i had used RDMA IP
4)now copied the c program gfapi_perf_test.c to the client(avaialble on the parent bug, from which this bug was cloned)
5.Compile the above C program that uses libgfapi
gcc -pthread -g -O0  -Wall --pedantic -o gfapi_perf_test -I /usr/include/glusterfs/api gfapi_perf_test.c  -lgfapi -lrt

6. Do the following exports by issuing the command on cli of client as it is:
export GFAPI_HOSTNAME=<server ip>(rdma IP of the server node)
export GFAPI_TRANSPORT=rdma
export GFAPI_VOLNAME=<volume name>

7. Create a directory "tmp" under /mnt as the program requires this directory
8.Run the compiled output as
GFAPI_FILES=1 GFAPI_RECSZ=1024 GFAPI_FSZ=1048576 ./gfapi_perf_test

9. Mount the volume to see that files are created by the c program under /mnt/tmp
Behavior before the bug was fixed:
===================================
Segmentation fault (core dumped)

Fixed behavior:
==============
Program executes successfully.



Version of test where QA found the issue fixed:


[root@rhs-client21 rdma_new]# cat /etc/redhat-*
Red Hat Enterprise Linux Server release 6.7 (Santiago)
Red Hat Gluster Storage Server 3.1
[root@rhs-client21 rdma_new]# rpm -qa|grep gluster
gluster-nagios-common-0.2.0-1.el6rhs.noarch
gluster-nagios-addons-0.2.4-2.el6rhs.x86_64
glusterfs-3.7.1-10.el6rhs.x86_64
glusterfs-cli-3.7.1-10.el6rhs.x86_64
python-gluster-3.7.1-10.el6rhs.x86_64
glusterfs-libs-3.7.1-10.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-10.el6rhs.x86_64
glusterfs-fuse-3.7.1-10.el6rhs.x86_64
glusterfs-server-3.7.1-10.el6rhs.x86_64
glusterfs-rdma-3.7.1-10.el6rhs.x86_64
glusterfs-ganesha-3.7.1-10.el6rhs.x86_64
vdsm-gluster-4.16.20-1.2.el6rhs.noarch
glusterfs-api-3.7.1-10.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-10.el6rhs.x86_64
nfs-ganesha-gluster-2.2.0-5.el6rhs.x86_64
[root@rhs-client21 rdma_new]# gluster --version
glusterfs 3.7.1 built on Jul 15 2015 10:20:49
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@rhs-client21 rdma_new]#

Comment 6 Nag Pavan Chilakam 2015-07-20 11:22:03 UTC
(For information)QE sosreports for verification at qe-admin@rhsqe-repo:/home/repo/sosreports/bug.1171662

CLI logs:
Latest results on successful run

[root@rhs-client36 ~]# mount -t glusterfs 192.168.44.124:/rdma_new
/mnt/
[root@rhs-client36 ~]# ls /mnt/
[root@rhs-client36 ~]# mkdir /mnt/tmp
[root@rhs-client36 ~]# umount /mnt/
[root@rhs-client36 ~]# GFAPI_FILES=1 GFAPI_RECSZ=1024 GFAPI_FSZ=1048576
./gfapi_perf_test
GLUSTER: 
  volume=rdma_new
  transport=rdma
  host=192.168.44.124
  port=24007
  fuse?No
  trace level=0
  start timeout=60
WORKLOAD:
  type = seq-wr 
  threads/proc = 1
  base directory = /tmp
  prefix=f
  file size = 1048576 KB
  file count = 1
  record size = 1024 KB
  files/dir=1000
  fsync-at-close? No 
thread   0:   files written = 1
  files done = 1
  I/O (record) transfers = 1024
  total bytes = 1073741824
  elapsed time    = 1.85      sec
  throughput      = 554.35    MB/sec
  IOPS            = 554.35    (sequential write)
aggregate:   files written = 1
  files done = 1
  I/O (record) transfers = 1024
  total bytes = 1073741824
  elapsed time    = 1.85      sec
  throughput      = 554.35    MB/sec
  IOPS            = 554.35    (sequential write)
[root@rhs-client36 ~]# mount -t glusterfs 192.168.44.124:/rdma_new
/mnt/
[root@rhs-client36 ~]# ls /mnt/tmp/
thrd000-d0000
[root@rhs-client36 ~]# ls /mnt/tmp/thrd000-d0000/
f.0000000

Comment 7 Nag Pavan Chilakam 2015-07-20 11:23:23 UTC
Created attachment 1053839 [details]
gfapi_perf_test.c

Comment 9 errata-xmlrpc 2015-07-29 04:37:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html


Note You need to log in before you can comment on or make changes to this bug.