Bug 1058227 - AFR: glusterfs client invoked oom-killer
Summary: AFR: glusterfs client invoked oom-killer
Keywords:
Status: CLOSED DUPLICATE of bug 1085511
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-27 09:37 UTC by Sachidananda Urs
Modified: 2014-05-25 10:20 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-05-25 10:20:40 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Sachidananda Urs 2014-01-27 09:37:08 UTC
Description of problem:
GlusterFS client invoked OOM-Killer while compiling kernel. No other processes were running on the client at this time. Possible memory leak.

Nothing reported in the logs, except the messages about disconnected client.

=== SNIP =====

[2014-01-27 09:17:48.299614] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-master-client-0: server 10.70.37.131:49152 has not responded in the last 42 seconds, disconnecting.
[2014-01-27 09:17:48.345687] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-01-27 09:16:59.104537 (xid=0xa78d2)
[2014-01-27 09:17:48.345723] W [client-rpc-fops.c:2771:client3_3_lookup_cbk] 0-master-client-0: remote operation failed: Transport endpoint is not connected. Path: /linux-3.10.3/arch/x86/kernel (83ddb088-8747-4094-86a5-b85a97d9d571)
[2014-01-27 09:17:48.346029] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-0: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2014-01-27 09:17:06.292463 (xid=0xa78d3)
[2014-01-27 09:17:48.346058] W [client-handshake.c:276:client_ping_cbk] 0-master-client-0: timer must have expired
[2014-01-27 09:17:48.346105] I [client.c:2207:client_rpc_notify] 0-master-client-0: disconnected from 10.70.37.131:49152. Client process will keep trying to connect to glusterd until brick's port is available
[2014-01-27 09:18:02.314486] E [socket.c:2161:socket_connect_finish] 0-master-client-0: connection to 10.70.37.131:24007 failed (No route to host)
[2014-01-27 09:19:13.429637] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-master-client-2: server 10.70.37.82:49152 has not responded in the last 42 seconds, disconnecting.
[2014-01-27 09:19:13.430060] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-01-27 09:18:00.698384 (xid=0xa86d6)
[2014-01-27 09:19:13.430096] W [client-rpc-fops.c:2771:client3_3_lookup_cbk] 0-master-client-2: remote operation failed: Transport endpoint is not connected. Path: /linux-3.10.3/arch/x86/include/uapi/linux (00000000-0000-0000-0000-000000000000)
[2014-01-27 09:19:13.430339] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3b4300f79d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3b4300f2e3] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3b4300f1fe]))) 0-master-client-2: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2014-01-27 09:18:31.345806 (xid=0xa86d7)
[2014-01-27 09:19:13.430370] W [client-handshake.c:276:client_ping_cbk] 0-master-client-2: timer must have expired
[2014-01-27 09:19:13.430412] I [client.c:2207:client_rpc_notify] 0-master-client-2: disconnected from 10.70.37.82:49152. Client process will keep trying to connect to glusterd until brick's port is available
[2014-01-27 09:19:26.459650] E [socket.c:2161:socket_connect_finish] 0-master-client-2: connection to 10.70.37.82:24007 failed (No route to host)

=====================================

Version-Release number of selected component (if applicable):
[root@boo ~]# gluster --version
glusterfs 3.4afr2.0 built on Jan 23 2014 23:00:24

How reproducible:
Intermittent

Steps to Reproduce:
1. Create a 2x2 replicate cluster
2. Compile kernel on the client.
3. Disconnect one server each from the pair.
4. glusterfs will get OOM-Killed

Actual results:


Expected results:


Additional info:

Out of memory: Kill process 11274 (glusterfs) score 941 or sacrifice child
Killed process 11274, UID 0, (glusterfs) total-vm:16074548kB, anon-rss:1731584kB, file-rss:76kB
glusterfs invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
glusterfs cpuset=/ mems_allowed=0
Pid: 11275, comm: glusterfs Not tainted 2.6.32-358.el6.x86_64 #1

Comment 2 Sachidananda Urs 2014-01-27 18:25:08 UTC
Always reproducible. Tried with a client (12 GiB RAM), glusterfs invokes OOM-Killer under heavy IO.

Comment 3 Anand Avati 2014-01-30 08:28:46 UTC
Sac,
do you know the reason for the "No route to host" errors? A memory leak/OOM kill is unlikely to cause that error. Also, it was the client which got OOM killed, right?

Comment 4 Sachidananda Urs 2014-01-30 09:46:51 UTC
The No route to host are because I bought down the interface on two nodes. Yes it is the client.

Comment 5 Lukas Bezdicka 2014-01-30 10:15:25 UTC
This is imho related or even duplicate of Bug: 841617. The it does not matter whether one works with AFR or not, the leaks are in fuse code. One is already on review and another one can be seen by creating gluster volume, mounting it with valgrind and creating two directories on volume (inode_new leak).

Comment 6 Sachidananda Urs 2014-02-13 13:03:29 UTC
Even the glusterfsd invokes OOM-Killer. I noticed this on latest gluster version: glusterfs 3.4afr2.2 built on Feb 12 2014 01:43:08

Comment 7 Pranith Kumar K 2014-05-25 10:20:40 UTC
Sachi,
   I believe this is because of the writev leak in afrv2. After the fix, I didn't hear about any OOM-killers on afrv2. So closing the bug as Duplicate of the bug that is already verified.

Pranith

*** This bug has been marked as a duplicate of bug 1085511 ***


Note You need to log in before you can comment on or make changes to this bug.