Created attachment 1379248 [details] Script to replicate disperse rdma bug Description of problem: In disperse volumes with rdma transport, large directories (containing >= 617 files) can't be listed with `ls`. Attempts to do so result in a "Transport endpoint is not connected" error, and the following log messages appear in the mount log: [2018-01-09 21:33:15.186370] W [MSGID: 103046] [rdma.c:3604:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg of type RDMA_ERROR [2018-01-09 21:33:15.186411] W [MSGID: 103046] [rdma.c:4057:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (10.4.1.60:49152), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048) [2018-01-09 21:33:15.186435] W [MSGID: 114031] [client-rpc-fops.c:2577:client3_3_readdirp_cbk] 0-erasure-client-0: remote operation failed [Transport endpoint is not connected] [2018-01-09 21:33:15.186503] W [fuse-bridge.c:2897:fuse_readdirp_cbk] 0-glusterfs-fuse: 74631173: READDIRP => -1 (Transport endpoint is not connected) Repeated attempts to ls the directory will cause different peers in the cluster to be identified in the log message, indicating that the problem is not with a misconfigured peer. Files in the problem directories can be accessed directly as normal (ls, cat, etc work fine on full file paths within the large directories). Changing the transport type of the disperse volume to tcp and restarting the volume allows the problem directories to be accessed. The issue also does not occur with distributed volumes, only disperse. Version-Release number of selected component (if applicable): 3.13.1 How reproducible: Extremely. Steps to Reproduce: General approach outlined here. See attached gluster-disperse-rdma-bug.sh for working script to reproduce bug. 1. Create and start disperse volume with rdma transport 2. Mount disperse volume 3. Create directory in mounted disperse volume and create 616 empty files 4. Verify that the directory can be accessed with ls 5. Create the 617th file in the test directory 6. Verify that the directory can no longer be accessed with ls Actual results: Large directory cannot be accessed with ls Expected results: Large directory should be accessible with ls
Kernel: 4.9.0-4-amd64 Distro: Debian Stretch (9.2)
Created attachment 1379250 [details] statedump of problem volume
Using gluster 3.12 I see this same behavior with a replica 2 configuration.
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained. As a result this bug is being closed. If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.
Please reopen as a bug under 3.12. It is present in 3.12.9 using transport=RDMA.