Bug 765507 (GLUSTER-3775) - server_lookup RPC error (invalid argument: conn) under heavy load
Summary: server_lookup RPC error (invalid argument: conn) under heavy load
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: GLUSTER-3775
Product: GlusterFS
Classification: Community
Component: rpc
Version: 3.2.4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Amar Tumballi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-02 16:08 UTC by Steve
Modified: 2013-12-19 00:07 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-09-18 05:55:34 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Steve 2011-11-02 16:08:46 UTC
Here is the error I am seeing:

[2011-11-02 15:44:18.105788] W [rpcsvc.c:1066:rpcsvc_error_reply] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpc_transport_notify+0x27) [0x7f348b3d1317] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x140) [0x7f348b3d0090]))) 0-: sending a RPC error reply
[2011-11-02 15:44:32.68532] E [server-helpers.c:873:server_alloc_frame] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x291) [0x7f348b3d01e1] (-->/opt/glusterfs/3.2.4/lib64/glusterfs/3.2.4/xlator/protocol/server.so(server_lookup+0xc1) [0x7f34837acb21]))) 0-server: invalid argument: conn


This is seen under heavy load.

Configuration:
     10 servers running Gluster, 6 bricks each with each brick being 6 TB.  Each computer also has 3 copper gigabit links channel bonded together.  Servers are running RHEL6 with EXT4.
     
Heavy load is described as running 40 cluster jobs that pull files, do nothing and then put them back.  If I run 20 jobs I get around 6 gigabits.  If I run 40 jobs I get up to 11 gigabits.  And if I run 60 jobs I get up to 15 gigabits but it will usually start throwing that error within 5 minutes.  Once that error is thrown the only way to get it working again is to reboot the server.  The errors will also come from the 40 job run but it will take longer before we get to the error.

On the client side when this happens it pretty much makes the whole gluster mount inaccessible.  In the server logs I see messages for the different clients like this

[2011-11-02 16:02:53.678700] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.parc-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.2.1.90:1008)


I'm happy to help in any way I can, let me know if you need more info

Comment 1 Amar Tumballi 2011-11-08 01:33:55 UTC
Try doing below things (instead of rebooting servers) when this happens (we prefer 60jobs situation as it happens much quickly there).

* After you see this log, take statedump 
     bash# kill -USR1 <pid of glusterfsd brick process>
  and send the statedump file after compressing (located at /tmp/glusterdump.<PID>)

* check the 'dmesg | tail -n 100' output and see if there is anything suspicious.

* do 'gluster volume stop <VOLNAME> force' and 'gluster volume start <VOLNAME>' to get it to working again. (even client will come back to life with this.

Comment 2 Amar Tumballi 2012-04-17 18:40:50 UTC
we need more information to fix this. Also if testing with 3.3.0beta release is an option, please consider that.

Comment 3 Amar Tumballi 2012-09-18 05:55:34 UTC
not much info on the bug since last 5 months, and also new release is out since then, and no others noticed the issue.


Note You need to log in before you can comment on or make changes to this bug.