I was running dbench on a volume with tcp,rdma volume which is mounted via tcp and after warmup i got transport endpoint not connected error. Following is the console output of dbench. 50 987 7.93 MB/sec warmup 139 sec latency 44432.278 ms 50 987 7.88 MB/sec warmup 140 sec latency 45432.314 ms [996] read failed on handle 10086 (Transport endpoint is not connected) [989] read failed on handle 10083 (Transport endpoint is not connected) [997] read failed on handle 10086 (Transport endpoint is not connected) [1001] read failed on handle 10087 (Transport endpoint is not connected) [1010] read failed on handle 10088 (Transport endpoint is not connected) [953] read failed on handle 10073 (Transport endpoint is not connected) [971] open ./clients/client12/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10080 (Transport endpoint is not connected) (972) ERROR: handle 10080 was not found [950] read failed on handle 10072 (Transport endpoint is not connected) [953] read failed on handle 10073 (Transport endpoint is not connected) [950] read failed on handle 10072 (Transport endpoint is not connected) [936] open ./clients/client10/~dmtmp/PWRPNT/TRIDOTS.POT failed for handle 10070 (Transport endpoint is not connected) Child failed with status 1 [root@client4 mnt]# [1002] read failed on handle 10087 (Transport endpoint is not connected) [1011] read failed on handle 10088 (Transport endpoint is not connected) [988] open ./clients/client32/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10083 (Transport endpoint is not connected) [971] open ./clients/client0/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10080 (Transport endpoint is not connected) [969] rename ./clients/client16/~dmtmp/PWRPNT/NEWTIPS.PPT ./clients/client16/~dmtmp/PWRPNT/PPTC321.TMP failed (Transport endpoint is not connected) - expected NT_STATUS_OK ERROR: child 16 failed at line 969 [1014] read failed on handle 10088 (File descriptor in bad state) [971] open ./clients/client40/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10080 (Invalid argument) [971] open ./clients/client25/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10080 (Invalid argument) [995] open ./clients/client46/~dmtmp/SEED/SMALL.FIL failed for handle 10086 (Transport endpoint is not connected) [959] open ./clients/client47/~dmtmp/PWRPNT/PPTC321.TMP failed for handle 10075 (Input/output error) [977] open ./clients/client2/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10081 (Transport endpoint is not connected) [955] open ./clients/client23/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10074 (Transport endpoint is not connected) [985] open ./clients/client41/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10082 (Transport endpoint is not connected) [971] open ./clients/client20/~dmtmp/PWRPNT/NEWTIPS.PPT failed for handle 10080 (Input/output error) [990] open ./clients/client6/~dmtmp/PWRPNT/PPTC321.TMP failed for handle 10084 (Transport endpoint is not connected) I have attached the client log.
There are a few xdr decoding errors. Can you please take a look in?
I see that all the xdr failures are in 'symlink_cbk'. Want to understand if it is the same behavior with master branch. Do you still have access to the setup?
Cornered the error to just one step: mkdir -p <254lenpath>/<254lenpath>/.../ #repeat the 254 to cross 2k limit The logs are: on server: [2011-10-27 21:43:04.138722] E [rpcsvc.c:476:rpcsvc_handle_rpc_call] 0-rpc-service: No vectored handler present [2011-10-27 21:43:04.138803] W [rpcsvc.c:1066:rpcsvc_error_reply] (-->/usr/local/lib/libgfrpc.so.0(rpc_transport_notify+0x130) [0x7f78721c06ac] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_notify+0x181) [0x7f78721bac33] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x35f) [0x7f78721ba8ab]))) 0-: sending a RPC error reply [2011-10-27 21:43:04.139385] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x211) [0x7fd6d6759eb5] (-->/usr/local/lib/glusterfs/3.2.5qa1/xlator/protocol/client.so(client3_1_lookup_cbk+0x13b) [0x7fd6d37fc92d] (-->/usr/local/lib/libgfxdr.so.0(xdr_to_lookup_rsp+0x3c) [0x7fd6d653cd02]))) 0-xdr: XDR decoding failed [2011-10-27 21:43:04.139411] E [client3_1-fops.c:2174:client3_1_lookup_cbk] 0-hosdu-client-1: error [2011-10-27 21:43:04.139428] I [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-hosdu-client-1: remote operation failed: Invalid argument [2011-10-27 21:43:04.139449] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 685: LOOKUP() /1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111/1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111/1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111/1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111/1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111/1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 => -1 (Invalid argument) I suspect the issue is arising because of 2k inline buffer limit in RDMA xport. I guess we should increase it to 8k buffers to fix the issue.
Also, saw this in dmesg of server. Client side glusterfs log says ping timeout of (42secs). Which caused the transport timeout. INFO: task iozone:5571 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 0000000000000003 0 5571 23319 0x00000080 ffff880009d73cc8 0000000000000086 0000000000000000 ffffffffa0622c99 ffff880009d73c38 ffffffff81012979 ffff880009d73c78 ffffffff81098d09 ffff8800375c10b8 ffff880009d73fd8 000000000000f598 ffff8800375c10b8 Call Trace: [<ffffffffa0622c99>] ? rpc_run_task+0xd9/0x130 [sunrpc] [<ffffffff81012979>] ? read_tsc+0x9/0x20 [<ffffffff81098d09>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff81098d09>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff8110d3d0>] ? sync_page+0x0/0x50 [<ffffffff814dbad3>] io_schedule+0x73/0xc0 [<ffffffff8110d40d>] sync_page+0x3d/0x50 [<ffffffff814dc33f>] __wait_on_bit+0x5f/0x90 [<ffffffff8110d5c3>] wait_on_page_bit+0x73/0x80 [<ffffffff8108e1a0>] ? wake_bit_function+0x0/0x50 [<ffffffff81123375>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8110d9db>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff8110dba8>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff811a098e>] vfs_fsync_range+0x7e/0xe0 [<ffffffff811a0a5d>] vfs_fsync+0x1d/0x20 [<ffffffff811a0a9e>] do_fsync+0x3e/0x60 [<ffffffff811a0af0>] sys_fsync+0x10/0x20 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b nfs: server 10.1.10.24 not responding, timed out
by reducing the dbench threads from 50 to 20, the issues went away.
CHANGE: http://review.gluster.com/697 (Change-Id: I1b0b91eb7216f32a5748325d6dda37f3998302c7) merged in master by Vijay Bellur (vijay)
dbench completed fine with tcp,rdma volume mounted via tcp. Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 126548 33.138 14992.630 Close 92136 3.588 13367.404 Rename 5275 88.705 15013.519 Unlink 26106 17.507 3048.435 Qpathinfo 114030 15.512 14954.688 Qfileinfo 19571 10.957 2362.818 Qfsinfo 20852 5.931 13366.600 Sfileinfo 10163 15.627 1841.571 Find 43740 55.253 14997.428 WriteX 61575 2.237 13355.511 ReadX 194303 1.804 14931.607 LockX 400 4.980 15.509 UnlockX 400 4.710 17.717 Flush 8790 156.435 13384.918