Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 765507 (GLUSTER-3775)

Summary:	server_lookup RPC error (invalid argument: conn) under heavy load
Product:	[Community] GlusterFS	Reporter:	Steve <steved_2k>
Component:	rpc	Assignee:	Amar Tumballi <amarts>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.2.4	CC:	vraman
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-09-18 05:55:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steve 2011-11-02 16:08:46 UTC

Here is the error I am seeing:

[2011-11-02 15:44:18.105788] W [rpcsvc.c:1066:rpcsvc_error_reply] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpc_transport_notify+0x27) [0x7f348b3d1317] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x140) [0x7f348b3d0090]))) 0-: sending a RPC error reply
[2011-11-02 15:44:32.68532] E [server-helpers.c:873:server_alloc_frame] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x291) [0x7f348b3d01e1] (-->/opt/glusterfs/3.2.4/lib64/glusterfs/3.2.4/xlator/protocol/server.so(server_lookup+0xc1) [0x7f34837acb21]))) 0-server: invalid argument: conn


This is seen under heavy load.

Configuration:
     10 servers running Gluster, 6 bricks each with each brick being 6 TB.  Each computer also has 3 copper gigabit links channel bonded together.  Servers are running RHEL6 with EXT4.
     
Heavy load is described as running 40 cluster jobs that pull files, do nothing and then put them back.  If I run 20 jobs I get around 6 gigabits.  If I run 40 jobs I get up to 11 gigabits.  And if I run 60 jobs I get up to 15 gigabits but it will usually start throwing that error within 5 minutes.  Once that error is thrown the only way to get it working again is to reboot the server.  The errors will also come from the 40 job run but it will take longer before we get to the error.

On the client side when this happens it pretty much makes the whole gluster mount inaccessible.  In the server logs I see messages for the different clients like this

[2011-11-02 16:02:53.678700] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.parc-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.2.1.90:1008)


I'm happy to help in any way I can, let me know if you need more info

Comment 1 Amar Tumballi 2011-11-08 01:33:55 UTC

Try doing below things (instead of rebooting servers) when this happens (we prefer 60jobs situation as it happens much quickly there).

* After you see this log, take statedump 
     bash# kill -USR1 <pid of glusterfsd brick process>
  and send the statedump file after compressing (located at /tmp/glusterdump.<PID>)

* check the 'dmesg | tail -n 100' output and see if there is anything suspicious.

* do 'gluster volume stop <VOLNAME> force' and 'gluster volume start <VOLNAME>' to get it to working again. (even client will come back to life with this.

Comment 2 Amar Tumballi 2012-04-17 18:40:50 UTC

we need more information to fix this. Also if testing with 3.3.0beta release is an option, please consider that.

Comment 3 Amar Tumballi 2012-09-18 05:55:34 UTC

not much info on the bug since last 5 months, and also new release is out since then, and no others noticed the issue.