This bug was initially created as a clone of Bug #1343320 +++ Description of problem: Client crash with core dump due to excessive memory consumption Version-Release number of selected component (if applicable): 3.7.5-19.el7rhgs.x86_64 RHEL 5 Additional info: lots of DNS resolution error found in client logs The following logs I can see continuous error messages : [2016-04-27 10:33:29.833969] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 [2016-04-27 10:33:32.843124] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 [2016-04-27 10:33:35.850581] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 [2016-04-27 10:33:38.858181] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 [2016-04-27 10:33:41.865251] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 The message "E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)" repeated 39 times between [2016-04-27 10:31:44.561995] and [2016-04-27 10:33:41.865245] [2016-04-27 10:33:44.873510] E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) [2016-04-27 10:33:44.873599] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 [2016-04-27 10:33:47.881687] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 [2016-04-27 10:33:50.890768] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 . . . . . and after sometime(almost after 27 hour) : [2016-04-28 13:47:23.002272] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory [2016-04-28 13:47:23.002528] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0 [2016-04-28 13:47:26.008762] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 [2016-04-28 13:47:26.008933] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory [2016-04-28 13:47:26.009134] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0 [2016-04-28 13:47:29.015862] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1 . .This continued for almost a week . . .Followed by . . . [2016-05-15 04:12:17.272132] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows] /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395] /usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e] /usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0] /usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21] /usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9] [2016-05-15 04:12:18.863904] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows] /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395] /usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e] /usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0] /usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21] /usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9] . . . . . . [2016-05-15 04:12:31.572526] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (124) [call stack follows] /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395] /usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e] /usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0] /usr/lib64/libglusterfs.so.0(mem_get+0xb8)[0x3bb805be98] /usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b] pending frames: frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) . . . patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2016-05-15 04:12:31 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.5 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x338)[0x3bb8042378] /lib64/libc.so.6[0x34f2030030] /usr/lib64/libglusterfs.so.0(mem_get+0x6e)[0x3bb805be4e] /usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b] /usr/lib64/libglusterfs.so.0(get_new_data+0x20)[0x3bb801f260] /usr/lib64/libglusterfs.so.0(dict_unserialize+0xf4)[0x3bb801f374] /usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so(client3_3_lookup_cbk+0x7bc)[0x2b2a741f5acc] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa0)[0x3bb7c0fa70] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1b4)[0x3bb7c0fd34] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x3bb7c0b517] /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a3f68] /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a4994] /usr/lib64/libglusterfs.so.0[0x3bb808b363] /lib64/libpthread.so.0[0x34f280683d] /lib64/libc.so.6(clone+0x6d)[0x34f20d4fcd]
RCA: There is a memory leak in the socket_connect code in case of failure. In socket_connect (): /* if sock != -1, then cleanup is done from the event handler */ if (ret == -1 && sock == -1) { /* Cleaup requires to send notification to upper layer which intern holds the big_lock. There can be dead-lock situation if big_lock is already held by the current thread. So transfer the ownership to seperate thread for cleanup. */ arg = GF_CALLOC (1, sizeof (*arg), gf_sock_connect_error_state_t); arg->this = THIS; arg->trans = this; arg->refd = refd; th_ret = pthread_create (&th_id, NULL, socket_connect_error_cbk, arg); if (th_ret) { gf_log (this->name, GF_LOG_ERROR, "pthread_create" "failed: %s", strerror(errno)); GF_FREE (arg); GF_ASSERT (0); } } pthread_create does not create a detached thread so the thread resources are not cleaned up. socket_connect is called at 3 second intervals so this quickly adds up causing the process to run out of memory.
REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleanup up) posted (#1) for review on master by N Balachandran (nbalacha)
Fix: Create a detached thread so all thread resources are cleaned up automatically.
REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha)
REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#1) for review on master by N Balachandran (nbalacha)
REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha)
REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#3) for review on master by N Balachandran (nbalacha)
COMMIT: http://review.gluster.org/14875 committed in master by Jeff Darcy (jdarcy) ------ commit 9886d568a7a8839bf3acc81cb1111fa372ac5270 Author: N Balachandran <nbalacha> Date: Fri Jul 8 10:46:46 2016 +0530 rpc/socket: pthread resources are not cleaned up A socket_connect failure creates a new pthread which is not a detached thread. As no pthread_join is called, the thread resources are not cleaned up causing a memory leak. Now, socket_connect creates a detached thread to handle failure. Change-Id: Idbf25d312f91464ae20c97d501b628bfdec7cf0c BUG: 1343374 Signed-off-by: N Balachandran <nbalacha> Reviewed-on: http://review.gluster.org/14875 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Atin Mukherjee <amukherj> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report. glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html [2] https://www.gluster.org/pipermail/gluster-users/