Bug 763929 (GLUSTER-2197)

Summary: applications running on large number of clients simultaneously result in ibv_post_send errors on server.
Product: [Community] GlusterFS Reporter: Raghavendra G <raghavendra>
Component: rdmaAssignee: Raghavendra G <raghavendra>
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: anush, divya, eco, gluster-bugs, lana.deere, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: fuse
Documentation: DNR CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Raghavendra G 2010-12-06 18:45:45 EST
On further debugging, it was found that rdma writes sometime happen with more than one sge as arguments to ibv_post_send whereas the qp was created with max_sge limit set to 1. This was causing rdma writes to fail. However, I've also observed even sending msgs inline also fail during high overload. The failures always seem to happen in __rdma_send_reply_type_nomsg.
Comment 1 Raghavendra G 2010-12-06 21:43:07 EST
The bug description as replied by user on gluster-users is given below. The conversation should be read bottom up.

One other observation is that it seems to be genuinely related to the
number of nodes involved.

If I run, say, 50 instances of my script using 50 separate nodes, then
they almost always generate some failures.

If I run the same number of instances, or even a much greater number,
but using only 10 separate nodes, then they seem always to work OK.

Maybe this is due to some kind of caching behaviour?

.. Lana (lana.deere@gmail.com)

On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere <lana.deere@gmail.com> wrote:
> The gluster configuration is distribute, there are 4 server nodes.
> There are 53 physical client nodes in my setup, each with 8 cores; we
> want to sometimes run more than 400 client processes simultaneously.
> In practice we aren't yet trying that many.
> When I run the commands which break, I am running them on separate
> clients simultaneously.
>    for host in <hosts>; do ssh $host script& done  # Note the &
> When I run on 25 clients simultaneously so far I have not seen it
> fail.  But if I run on 40 or 50 simultaneously it often fails.
> Sometimes I have run more than one command on each client
> simultaneously by listing all the hosts multiple times in the
> for-loop,
>   for host in <hosts> <hosts> <hosts>; do ssh $host script& done
> In example of 3 at a time I have noticed that when a host works, all
> three on that client will work; but when it fails, all three will fail
> exactly the same fashion.
> I've attached a tarfile containing two sets of logs.  In both cases I
> had rotated all the log files and rebooted everything then run my
> test.  In the first set of logs, I went directly to approx. 50
> simultaneous sessions, and pretty much all of them just hung.  (When
> the find hangs, even a kill -9 will not unhang it.)  So I rotated the
> logs again and rebooted everything, but this time I gradually worked
> my way up to higher loads.  This time the failures were mostly cases
> with the wrong checksum but no error message, though some of them did
> give me errors like
>    find: lib/kbd/unimaps/cp865.uni: Invalid argument
> Thanks.  I may try downgrading to 3.1.0 just to see if I have the same
> problem there.
> .. Lana (lana.deere@gmail.com)
> On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G <raghavendra@gluster.com> wrote:
>> Hi Lana,
>> I need some clarifications about test setup:
>> * Are you seeing problem when there are more than 25 clients? If this is the case, are these clients on different physical nodes or is it that more than one client shares same node? In other words, clients are mounted on how many physical nodes are there in your test setup? Also, are you running the command on each of these clients simultaneously?
>> * Or is it that there are more than 25 concurrent concurrent invocations of the script? If this is the case, how many clients are present in your test setup and on how many physical nodes these clients are mounted?
>> regards,
>> ----- Original Message -----
>> From: "Lana Deere" <lana.deere@gmail.com>
>> To: gluster-users@gluster.org
>> Sent: Saturday, December 4, 2010 12:13:30 AM
>> Subject: [Gluster-users] 3.1.1 crashing under moderate load
>> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
>> transport, native/fuse access.
>> I have a directory which is shared on the gluster.  In fact, it is a clone
>> of /lib from one of the clients, shared so all can see it.
>> I have a script which does
>>    find lib -type f -print0 | xargs -0 sum | md5sum
>> If I run this on my clients one at a time, they all yield the same md5sum:
>>    for h in <<hosts>>; do ssh $host script; done
>> If I run this on my clients concurrently, up to roughly 25 at a time they
>> still yield the same md5sum.
>>    for h in <<hosts>>; do ssh $host script& done
>> Beyond that the gluster share often, but not always, fails.  The errors vary.
>>    - sometimes I get "sum: xxx.so not found"
>>    - sometimes I get the wrong checksum without any error message
>>    - sometimes the job simply hangs until I kill it
>> Some of the server logs show messages like these from the time of the
>> failures (other servers show nothing from around that time):
>> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
>> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
>> socket (peer: after handshake is complete
>> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>> rpc-service: failed to submit message (XID: 0x55e82, Program:
>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>> (rdma.RaidData-server)
>> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
>> Reply submission failed
>> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>> rpc-service: failed to submit message (XID: 0x55e83, Program:
>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>> (rdma.RaidData-server)
>> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
>> Reply submission failed
>> On a client which had a failure I see messages like:
>> [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
>> rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
>> (peer: after handshake is complete
>> [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
>> op(READ(12)) called at 2010-12-03 10:03:06.20492
>> [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
>> op(READ(12)) called at 2010-12-03 10:03:06.20529
>> [2010-12-03 10:03:06.26827] I
>> [client-handshake.c:993:select_server_supported_programs]
>> RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
>> Version (310)
>> [2010-12-03 10:03:06.27029] I
>> [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
>> Connected to, attached to remote volume '/data'.
>> [2010-12-03 10:03:06.27067] I
>> [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
>> fds open - Delaying child_up until they are re-opened
>> Anyone else seen anything like this and/or have suggestions about options I can
>> set to work around this?
>> .. Lana (lana.deere@gmail.com)
Comment 2 Anand Avati 2010-12-12 20:07:42 EST
PATCH: http://patches.gluster.com/patch/5843 in master (rpc-transport/rdma: add informative debug messages when posting of send requests fail.)
Comment 3 Anand Avati 2010-12-12 20:07:50 EST
PATCH: http://patches.gluster.com/patch/5844 in master (rpc-transport/rdma: QP configuration changes.)
Comment 4 Vidya Sakar 2010-12-20 03:21:50 EST
*** Bug 2236 has been marked as a duplicate of this bug. ***
Comment 5 Raghavendra G 2010-12-27 02:20:21 EST
Comment 6 Raghavendra G 2010-12-27 02:26:42 EST
Hi Divya,
Please make sure to document that if glusterfs setup involves large number of glusterfs-clients (or large number of applications running simultaneously on these clients), the sysadmin has to configure the options "transport.rdma.work-request-send-count" and "transport.rdma.work-request-recv-count" to a fairly large number. By default these values are set to 4096. If the user wants smaller values to these options, then he/she might see disconnection logs when large number of applications are run simultaneously.

Comment 7 Divya 2010-12-27 22:39:43 EST
Hi Raghavendra,

This option is not set thru Gluster CLI. For now, we are not documenting the options which are not set thru Gluster CLI. Hence not adding to documentation for now.