Bug 765507 (GLUSTER-3775)

Summary: server_lookup RPC error (invalid argument: conn) under heavy load
Product: [Community] GlusterFS Reporter: Steve <steved_2k>
Component: rpcAssignee: Amar Tumballi <amarts>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.4CC: vraman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-18 05:55:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steve 2011-11-02 16:08:46 UTC
Here is the error I am seeing:

[2011-11-02 15:44:18.105788] W [rpcsvc.c:1066:rpcsvc_error_reply] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpc_transport_notify+0x27) [0x7f348b3d1317] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x140) [0x7f348b3d0090]))) 0-: sending a RPC error reply
[2011-11-02 15:44:32.68532] E [server-helpers.c:873:server_alloc_frame] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x291) [0x7f348b3d01e1] (-->/opt/glusterfs/3.2.4/lib64/glusterfs/3.2.4/xlator/protocol/server.so(server_lookup+0xc1) [0x7f34837acb21]))) 0-server: invalid argument: conn


This is seen under heavy load.

Configuration:
     10 servers running Gluster, 6 bricks each with each brick being 6 TB.  Each computer also has 3 copper gigabit links channel bonded together.  Servers are running RHEL6 with EXT4.
     
Heavy load is described as running 40 cluster jobs that pull files, do nothing and then put them back.  If I run 20 jobs I get around 6 gigabits.  If I run 40 jobs I get up to 11 gigabits.  And if I run 60 jobs I get up to 15 gigabits but it will usually start throwing that error within 5 minutes.  Once that error is thrown the only way to get it working again is to reboot the server.  The errors will also come from the 40 job run but it will take longer before we get to the error.

On the client side when this happens it pretty much makes the whole gluster mount inaccessible.  In the server logs I see messages for the different clients like this

[2011-11-02 16:02:53.678700] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.parc-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.2.1.90:1008)


I'm happy to help in any way I can, let me know if you need more info

Comment 1 Amar Tumballi 2011-11-08 01:33:55 UTC
Try doing below things (instead of rebooting servers) when this happens (we prefer 60jobs situation as it happens much quickly there).

* After you see this log, take statedump 
     bash# kill -USR1 <pid of glusterfsd brick process>
  and send the statedump file after compressing (located at /tmp/glusterdump.<PID>)

* check the 'dmesg | tail -n 100' output and see if there is anything suspicious.

* do 'gluster volume stop <VOLNAME> force' and 'gluster volume start <VOLNAME>' to get it to working again. (even client will come back to life with this.

Comment 2 Amar Tumballi 2012-04-17 18:40:50 UTC
we need more information to fix this. Also if testing with 3.3.0beta release is an option, please consider that.

Comment 3 Amar Tumballi 2012-09-18 05:55:34 UTC
not much info on the bug since last 5 months, and also new release is out since then, and no others noticed the issue.