Bug 1360553 - Gluster fuse client crashed generating core dump
Summary: Gluster fuse client crashed generating core dump
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: transport
Version: 3.7.13
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
Depends On: 1343320 1343374
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-27 04:33 UTC by Nithya Balachandran
Modified: 2016-08-02 07:25 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.7.14
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1343374
Environment:
Last Closed: 2016-08-02 07:25:06 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Nithya Balachandran 2016-07-27 04:33:27 UTC
+++ This bug was initially created as a clone of Bug #1343374 +++

This bug was initially created as a clone of Bug #1343320 +++

Description of problem:
Client crash with core dump due to excessive memory consumption


Version-Release number of selected component (if applicable):

3.7.5-19.el7rhgs.x86_64
RHEL 5

Additional info:
lots of DNS resolution error found in client logs

The following logs 
I can see continuous error messages :
[2016-04-27 10:33:29.833969] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:32.843124] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:35.850581] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:38.858181] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:41.865251] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
The message "E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)" repeated 39 times between [2016-04-27 10:31:44.561995] and [2016-04-27 10:33:41.865245]
[2016-04-27 10:33:44.873510] E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2016-04-27 10:33:44.873599] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:47.881687] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:50.890768] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
.
.
.
.
.
 
and after sometime(almost after 27 hour) :
 
[2016-04-28 13:47:23.002272] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:23.002528] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0
[2016-04-28 13:47:26.008762] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1

[2016-04-28 13:47:26.008933] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:26.009134] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0
[2016-04-28 13:47:29.015862] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1



.
.This continued for almost a week
.

.
.Followed by
.
.

.

[2016-05-15 04:12:17.272132] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
[2016-05-15 04:12:18.863904] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
.
.
.
.
.
.
[2016-05-15 04:12:31.572526] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (124) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(mem_get+0xb8)[0x3bb805be98]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
.
.
.
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-05-15 04:12:31
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x338)[0x3bb8042378]
/lib64/libc.so.6[0x34f2030030]
/usr/lib64/libglusterfs.so.0(mem_get+0x6e)[0x3bb805be4e]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
/usr/lib64/libglusterfs.so.0(get_new_data+0x20)[0x3bb801f260]
/usr/lib64/libglusterfs.so.0(dict_unserialize+0xf4)[0x3bb801f374]
/usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so(client3_3_lookup_cbk+0x7bc)[0x2b2a741f5acc]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa0)[0x3bb7c0fa70]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1b4)[0x3bb7c0fd34]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x3bb7c0b517]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a3f68]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a4994]
/usr/lib64/libglusterfs.so.0[0x3bb808b363]
/lib64/libpthread.so.0[0x34f280683d]
/lib64/libc.so.6(clone+0x6d)[0x34f20d4fcd]

--- Additional comment from Nithya Balachandran on 2016-06-07 04:50:10 EDT ---

RCA:

There is a memory leak in the socket_connect code in case of failure. 

In socket_connect ():

        /* if sock != -1, then cleanup is done from the event handler */
        if (ret == -1 && sock == -1) {
                /* Cleaup requires to send notification to upper layer which
                   intern holds the big_lock. There can be dead-lock situation
                   if big_lock is already held by the current thread. 
                   So transfer the ownership to seperate thread for cleanup.
                */      
                arg = GF_CALLOC (1, sizeof (*arg), 
                                 gf_sock_connect_error_state_t);
                arg->this = THIS; 
                arg->trans = this; 
                arg->refd = refd; 
                th_ret = pthread_create (&th_id, NULL, socket_connect_error_cbk,
                                         arg);   
                if (th_ret) {
                       gf_log (this->name, GF_LOG_ERROR, "pthread_create"
                               "failed: %s", strerror(errno));
                        GF_FREE (arg);
                        GF_ASSERT (0);
                }       
        }       


pthread_create does not create a detached thread so the thread resources are not cleaned up. socket_connect is called at 3 second intervals so this quickly adds up causing the process to run out of memory.

--- Additional comment from Vijay Bellur on 2016-06-07 04:56:31 EDT ---

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleanup up) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Nithya Balachandran on 2016-06-07 05:01:17 EDT ---

Fix:

Create a detached thread so all thread resources are cleaned up automatically.

--- Additional comment from Vijay Bellur on 2016-06-07 05:09:38 EDT ---

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha)

--- Additional comment from Vijay Bellur on 2016-07-08 01:18:43 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Vijay Bellur on 2016-07-08 01:22:40 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha)

--- Additional comment from Vijay Bellur on 2016-07-08 01:54:26 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#3) for review on master by N Balachandran (nbalacha)

--- Additional comment from Vijay Bellur on 2016-07-08 16:17:16 EDT ---

COMMIT: http://review.gluster.org/14875 committed in master by Jeff Darcy (jdarcy) 
------
commit 9886d568a7a8839bf3acc81cb1111fa372ac5270
Author: N Balachandran <nbalacha>
Date:   Fri Jul 8 10:46:46 2016 +0530

    rpc/socket: pthread resources are not cleaned up
    
    A socket_connect failure creates a new pthread which
    is not a detached thread. As no pthread_join is called,
    the thread resources are not cleaned up causing a memory leak.
    
    Now, socket_connect creates a detached thread to handle failure.
    
    Change-Id: Idbf25d312f91464ae20c97d501b628bfdec7cf0c
    BUG: 1343374
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: http://review.gluster.org/14875
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 1 Vijay Bellur 2016-07-27 05:03:43 UTC
REVIEW: http://review.gluster.org/15019 (rpc/socket: pthread resources are not cleaned up) posted (#1) for review on release-3.7 by N Balachandran (nbalacha)

Comment 2 Vijay Bellur 2016-07-27 10:13:02 UTC
COMMIT: http://review.gluster.org/15019 committed in release-3.7 by Raghavendra G (rgowdapp) 
------
commit f32fd3b0807e9eeeb3e7deb664459493a099010f
Author: N Balachandran <nbalacha>
Date:   Wed Jul 27 09:59:20 2016 +0530

    rpc/socket: pthread resources are not cleaned up
    
    A socket_connect failure creates a new pthread which
    is not a detached thread. As no pthread_join is called,
    the thread resources are not cleaned up causing a memory leak.
    
    Now, socket_connect creates a detached thread to handle failure.
    
    > Change-Id: Idbf25d312f91464ae20c97d501b628bfdec7cf0c
    > BUG: 1343374
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: http://review.gluster.org/14875
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Atin Mukherjee <amukherj>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Jeff Darcy <jdarcy>
    (cherry picked from commit 9886d568a7a8839bf3acc81cb1111fa372ac5270)
    
    Change-Id: If0a65c50fef2a32148cf3a1d7992e63f044bf0ad
    BUG: 1360553
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: http://review.gluster.org/15019
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Tested-by: Oleksandr Natalenko <oleksandr>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 3 Kaushal 2016-08-02 07:25:06 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.14, please open a new bug report.

glusterfs-3.7.14 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050319.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.