Description of problem: GlusterFS client invoked OOM-Killer while compiling kernel. No other processes were running on the client at this time. Possible memory leak. Nothing reported in the logs, except the messages about disconnected client. === SNIP ===== [2014-01-27 09:17:48.299614] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-master-client-0: server 10.70.37.131:49152 has not responded in the last 42 seconds, disconnecting. [2014-01-27 09:17:48.345687] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-01-27 09:16:59.104537 (xid=0xa78d2) [2014-01-27 09:17:48.345723] W [client-rpc-fops.c:2771:client3_3_lookup_cbk] 0-master-client-0: remote operation failed: Transport endpoint is not connected. Path: /linux-3.10.3/arch/x86/kernel (83ddb088-8747-4094-86a5-b85a97d9d571) [2014-01-27 09:17:48.346029] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-0: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2014-01-27 09:17:06.292463 (xid=0xa78d3) [2014-01-27 09:17:48.346058] W [client-handshake.c:276:client_ping_cbk] 0-master-client-0: timer must have expired [2014-01-27 09:17:48.346105] I [client.c:2207:client_rpc_notify] 0-master-client-0: disconnected from 10.70.37.131:49152. Client process will keep trying to connect to glusterd until brick's port is available [2014-01-27 09:18:02.314486] E [socket.c:2161:socket_connect_finish] 0-master-client-0: connection to 10.70.37.131:24007 failed (No route to host) [2014-01-27 09:19:13.429637] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-master-client-2: server 10.70.37.82:49152 has not responded in the last 42 seconds, disconnecting. [2014-01-27 09:19:13.430060] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-01-27 09:18:00.698384 (xid=0xa86d6) [2014-01-27 09:19:13.430096] W [client-rpc-fops.c:2771:client3_3_lookup_cbk] 0-master-client-2: remote operation failed: Transport endpoint is not connected. Path: /linux-3.10.3/arch/x86/include/uapi/linux (00000000-0000-0000-0000-000000000000) [2014-01-27 09:19:13.430339] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-2: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2014-01-27 09:18:31.345806 (xid=0xa86d7) [2014-01-27 09:19:13.430370] W [client-handshake.c:276:client_ping_cbk] 0-master-client-2: timer must have expired [2014-01-27 09:19:13.430412] I [client.c:2207:client_rpc_notify] 0-master-client-2: disconnected from 10.70.37.82:49152. Client process will keep trying to connect to glusterd until brick's port is available [2014-01-27 09:19:26.459650] E [socket.c:2161:socket_connect_finish] 0-master-client-2: connection to 10.70.37.82:24007 failed (No route to host) ===================================== Version-Release number of selected component (if applicable): [root@boo ~]# gluster --version glusterfs 3.4afr2.0 built on Jan 23 2014 23:00:24 How reproducible: Intermittent Steps to Reproduce: 1. Create a 2x2 replicate cluster 2. Compile kernel on the client. 3. Disconnect one server each from the pair. 4. glusterfs will get OOM-Killed Actual results: Expected results: Additional info: Out of memory: Kill process 11274 (glusterfs) score 941 or sacrifice child Killed process 11274, UID 0, (glusterfs) total-vm:16074548kB, anon-rss:1731584kB, file-rss:76kB glusterfs invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 glusterfs cpuset=/ mems_allowed=0 Pid: 11275, comm: glusterfs Not tainted 2.6.32-358.el6.x86_64 #1
Always reproducible. Tried with a client (12 GiB RAM), glusterfs invokes OOM-Killer under heavy IO.
Sac, do you know the reason for the "No route to host" errors? A memory leak/OOM kill is unlikely to cause that error. Also, it was the client which got OOM killed, right?
The No route to host are because I bought down the interface on two nodes. Yes it is the client.
This is imho related or even duplicate of Bug: 841617. The it does not matter whether one works with AFR or not, the leaks are in fuse code. One is already on review and another one can be seen by creating gluster volume, mounting it with valgrind and creating two directories on volume (inode_new leak).
Even the glusterfsd invokes OOM-Killer. I noticed this on latest gluster version: glusterfs 3.4afr2.2 built on Feb 12 2014 01:43:08
Sachi, I believe this is because of the writev leak in afrv2. After the fix, I didn't hear about any OOM-killers on afrv2. So closing the bug as Duplicate of the bug that is already verified. Pranith *** This bug has been marked as a duplicate of bug 1085511 ***