| Summary: | server_lookup RPC error (invalid argument: conn) under heavy load | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Steve <steved_2k> |
| Component: | rpc | Assignee: | Amar Tumballi <amarts> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.2.4 | CC: | vraman |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-09-18 05:55:34 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Try doing below things (instead of rebooting servers) when this happens (we prefer 60jobs situation as it happens much quickly there).
* After you see this log, take statedump
bash# kill -USR1 <pid of glusterfsd brick process>
and send the statedump file after compressing (located at /tmp/glusterdump.<PID>)
* check the 'dmesg | tail -n 100' output and see if there is anything suspicious.
* do 'gluster volume stop <VOLNAME> force' and 'gluster volume start <VOLNAME>' to get it to working again. (even client will come back to life with this.
we need more information to fix this. Also if testing with 3.3.0beta release is an option, please consider that. not much info on the bug since last 5 months, and also new release is out since then, and no others noticed the issue. |
Here is the error I am seeing: [2011-11-02 15:44:18.105788] W [rpcsvc.c:1066:rpcsvc_error_reply] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpc_transport_notify+0x27) [0x7f348b3d1317] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x140) [0x7f348b3d0090]))) 0-: sending a RPC error reply [2011-11-02 15:44:32.68532] E [server-helpers.c:873:server_alloc_frame] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x7f348b3d03ec] (-->/opt/glusterfs/3.2.4/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x291) [0x7f348b3d01e1] (-->/opt/glusterfs/3.2.4/lib64/glusterfs/3.2.4/xlator/protocol/server.so(server_lookup+0xc1) [0x7f34837acb21]))) 0-server: invalid argument: conn This is seen under heavy load. Configuration: 10 servers running Gluster, 6 bricks each with each brick being 6 TB. Each computer also has 3 copper gigabit links channel bonded together. Servers are running RHEL6 with EXT4. Heavy load is described as running 40 cluster jobs that pull files, do nothing and then put them back. If I run 20 jobs I get around 6 gigabits. If I run 40 jobs I get up to 11 gigabits. And if I run 60 jobs I get up to 15 gigabits but it will usually start throwing that error within 5 minutes. Once that error is thrown the only way to get it working again is to reboot the server. The errors will also come from the 40 job run but it will take longer before we get to the error. On the client side when this happens it pretty much makes the whole gluster mount inaccessible. In the server logs I see messages for the different clients like this [2011-11-02 16:02:53.678700] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.parc-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.2.1.90:1008) I'm happy to help in any way I can, let me know if you need more info