Bug 1532842 - Large directories in disperse volumes with rdma transport can't be accessed with ls
Summary: Large directories in disperse volumes with rdma transport can't be accessed w...
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: rdma
Version: 3.13
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Mohammed Rafi KC
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1692441
TreeView+ depends on / blocked
 
Reported: 2018-01-09 21:42 UTC by shane
Modified: 2019-03-25 15:48 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1692441 (view as bug list)
Environment:
Last Closed: 2018-06-20 18:25:01 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Script to replicate disperse rdma bug (2.27 KB, application/x-shellscript)
2018-01-09 21:42 UTC, shane
no flags Details
statedump of problem volume (136.66 KB, application/x-gzip)
2018-01-09 21:50 UTC, shane
no flags Details

Description shane 2018-01-09 21:42:42 UTC
Created attachment 1379248 [details]
Script to replicate disperse rdma bug

Description of problem:

In disperse volumes with rdma transport, large directories (containing >= 617 files) can't be listed with `ls`. Attempts to do so result in a "Transport endpoint is not connected" error, and the following log messages appear in the mount log:

[2018-01-09 21:33:15.186370] W [MSGID: 103046] [rdma.c:3604:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg of type RDMA_ERROR
[2018-01-09 21:33:15.186411] W [MSGID: 103046] [rdma.c:4057:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (10.4.1.60:49152), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048)
[2018-01-09 21:33:15.186435] W [MSGID: 114031] [client-rpc-fops.c:2577:client3_3_readdirp_cbk] 0-erasure-client-0: remote operation failed [Transport endpoint is not connected]
[2018-01-09 21:33:15.186503] W [fuse-bridge.c:2897:fuse_readdirp_cbk] 0-glusterfs-fuse: 74631173: READDIRP => -1 (Transport endpoint is not connected)

Repeated attempts to ls the directory will cause different peers in the cluster to be identified in the log message, indicating that the problem is not with a misconfigured peer.

Files in the problem directories can be accessed directly as normal (ls, cat, etc work fine on full file paths within the large directories).

Changing the transport type of the disperse volume to tcp and restarting the volume allows the problem directories to be accessed. The issue also does not occur with distributed volumes, only disperse.

Version-Release number of selected component (if applicable):

3.13.1

How reproducible:

Extremely.

Steps to Reproduce:

General approach outlined here. See attached gluster-disperse-rdma-bug.sh for working script to reproduce bug.

1. Create and start disperse volume with rdma transport
2. Mount disperse volume
3. Create directory in mounted disperse volume and create 616 empty files
4. Verify that the directory can be accessed with ls
5. Create the 617th file in the test directory
6. Verify that the directory can no longer be accessed with ls


Actual results:

Large directory cannot be accessed with ls

Expected results:

Large directory should be accessible with ls

Comment 1 shane 2018-01-09 21:49:46 UTC
Kernel: 4.9.0-4-amd64
Distro: Debian Stretch (9.2)

Comment 2 shane 2018-01-09 21:50:25 UTC
Created attachment 1379250 [details]
statedump of problem volume

Comment 3 Jim Kinney 2018-06-04 16:10:28 UTC
Using gluster 3.12 I see this same behavior with a replica 2 configuration.

Comment 4 Shyamsundar 2018-06-20 18:25:01 UTC
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Comment 5 Jim Kinney 2018-06-20 18:36:46 UTC
Please reopen as a bug under 3.12. It is present in 3.12.9 using transport=RDMA.


Note You need to log in before you can comment on or make changes to this bug.