Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1343374

Summary:	Gluster fuse client crashed generating core dump
Product:	[Community] GlusterFS	Reporter:	Nithya Balachandran <nbalacha>
Component:	transport	Assignee:	Nithya Balachandran <nbalacha>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	mainline	CC:	bkunal, bugs, csaba, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.9.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1343320
Clones:	1354250 1360553 (view as bug list)		Environment:
Last Closed:	2017-03-27 18:16:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1343320
Bug Blocks:	1354250, 1360553

Description Nithya Balachandran 2016-06-07 08:45:34 UTC

This bug was initially created as a clone of Bug #1343320 +++

Description of problem:
Client crash with core dump due to excessive memory consumption


Version-Release number of selected component (if applicable):

3.7.5-19.el7rhgs.x86_64
RHEL 5

Additional info:
lots of DNS resolution error found in client logs

The following logs 
I can see continuous error messages :
[2016-04-27 10:33:29.833969] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:32.843124] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:35.850581] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:38.858181] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:41.865251] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
The message "E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)" repeated 39 times between [2016-04-27 10:31:44.561995] and [2016-04-27 10:33:41.865245]
[2016-04-27 10:33:44.873510] E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2016-04-27 10:33:44.873599] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:47.881687] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:50.890768] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
.
.
.
.
.
 
and after sometime(almost after 27 hour) :
 
[2016-04-28 13:47:23.002272] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:23.002528] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0
[2016-04-28 13:47:26.008762] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1

[2016-04-28 13:47:26.008933] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:26.009134] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0
[2016-04-28 13:47:29.015862] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1



.
.This continued for almost a week
.

.
.Followed by
.
.

.

[2016-05-15 04:12:17.272132] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
[2016-05-15 04:12:18.863904] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
.
.
.
.
.
.
[2016-05-15 04:12:31.572526] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (124) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(mem_get+0xb8)[0x3bb805be98]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
.
.
.
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-05-15 04:12:31
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x338)[0x3bb8042378]
/lib64/libc.so.6[0x34f2030030]
/usr/lib64/libglusterfs.so.0(mem_get+0x6e)[0x3bb805be4e]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
/usr/lib64/libglusterfs.so.0(get_new_data+0x20)[0x3bb801f260]
/usr/lib64/libglusterfs.so.0(dict_unserialize+0xf4)[0x3bb801f374]
/usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so(client3_3_lookup_cbk+0x7bc)[0x2b2a741f5acc]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa0)[0x3bb7c0fa70]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1b4)[0x3bb7c0fd34]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x3bb7c0b517]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a3f68]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a4994]
/usr/lib64/libglusterfs.so.0[0x3bb808b363]
/lib64/libpthread.so.0[0x34f280683d]
/lib64/libc.so.6(clone+0x6d)[0x34f20d4fcd]

Comment 1 Nithya Balachandran 2016-06-07 08:50:10 UTC

RCA:

There is a memory leak in the socket_connect code in case of failure. 

In socket_connect ():

        /* if sock != -1, then cleanup is done from the event handler */
        if (ret == -1 && sock == -1) {
                /* Cleaup requires to send notification to upper layer which
                   intern holds the big_lock. There can be dead-lock situation
                   if big_lock is already held by the current thread. 
                   So transfer the ownership to seperate thread for cleanup.
                */      
                arg = GF_CALLOC (1, sizeof (*arg), 
                                 gf_sock_connect_error_state_t);
                arg->this = THIS; 
                arg->trans = this; 
                arg->refd = refd; 
                th_ret = pthread_create (&th_id, NULL, socket_connect_error_cbk,
                                         arg);   
                if (th_ret) {
                       gf_log (this->name, GF_LOG_ERROR, "pthread_create"
                               "failed: %s", strerror(errno));
                        GF_FREE (arg);
                        GF_ASSERT (0);
                }       
        }       


pthread_create does not create a detached thread so the thread resources are not cleaned up. socket_connect is called at 3 second intervals so this quickly adds up causing the process to run out of memory.

Comment 2 Vijay Bellur 2016-06-07 08:56:31 UTC

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleanup up) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 3 Nithya Balachandran 2016-06-07 09:01:17 UTC

Fix:

Create a detached thread so all thread resources are cleaned up automatically.

Comment 4 Vijay Bellur 2016-06-07 09:09:38 UTC

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha)

Comment 5 Vijay Bellur 2016-07-08 05:18:43 UTC

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 6 Vijay Bellur 2016-07-08 05:22:40 UTC

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha)

Comment 7 Vijay Bellur 2016-07-08 05:54:26 UTC

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#3) for review on master by N Balachandran (nbalacha)

Comment 8 Vijay Bellur 2016-07-08 20:17:16 UTC

COMMIT: http://review.gluster.org/14875 committed in master by Jeff Darcy (jdarcy) 
------
commit 9886d568a7a8839bf3acc81cb1111fa372ac5270
Author: N Balachandran <nbalacha>
Date:   Fri Jul 8 10:46:46 2016 +0530

    rpc/socket: pthread resources are not cleaned up
    
    A socket_connect failure creates a new pthread which
    is not a detached thread. As no pthread_join is called,
    the thread resources are not cleaned up causing a memory leak.
    
    Now, socket_connect creates a detached thread to handle failure.
    
    Change-Id: Idbf25d312f91464ae20c97d501b628bfdec7cf0c
    BUG: 1343374
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: http://review.gluster.org/14875
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 9 Shyamsundar 2017-03-27 18:16:04 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/