Bug 1532842

Summary: Large directories in disperse volumes with rdma transport can't be accessed with ls
Product: [Community] GlusterFS Reporter: shane
Component: rdmaAssignee: Mohammed Rafi KC <rkavunga>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.13CC: bugs, jkinney, rgowdapp, shane
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1692441 (view as bug list) Environment:
Last Closed: 2018-06-20 18:25:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1692441    
Attachments:
Description Flags
Script to replicate disperse rdma bug
none
statedump of problem volume none

Description shane 2018-01-09 21:42:42 UTC
Created attachment 1379248 [details]
Script to replicate disperse rdma bug

Description of problem:

In disperse volumes with rdma transport, large directories (containing >= 617 files) can't be listed with `ls`. Attempts to do so result in a "Transport endpoint is not connected" error, and the following log messages appear in the mount log:

[2018-01-09 21:33:15.186370] W [MSGID: 103046] [rdma.c:3604:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg of type RDMA_ERROR
[2018-01-09 21:33:15.186411] W [MSGID: 103046] [rdma.c:4057:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (10.4.1.60:49152), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048)
[2018-01-09 21:33:15.186435] W [MSGID: 114031] [client-rpc-fops.c:2577:client3_3_readdirp_cbk] 0-erasure-client-0: remote operation failed [Transport endpoint is not connected]
[2018-01-09 21:33:15.186503] W [fuse-bridge.c:2897:fuse_readdirp_cbk] 0-glusterfs-fuse: 74631173: READDIRP => -1 (Transport endpoint is not connected)

Repeated attempts to ls the directory will cause different peers in the cluster to be identified in the log message, indicating that the problem is not with a misconfigured peer.

Files in the problem directories can be accessed directly as normal (ls, cat, etc work fine on full file paths within the large directories).

Changing the transport type of the disperse volume to tcp and restarting the volume allows the problem directories to be accessed. The issue also does not occur with distributed volumes, only disperse.

Version-Release number of selected component (if applicable):

3.13.1

How reproducible:

Extremely.

Steps to Reproduce:

General approach outlined here. See attached gluster-disperse-rdma-bug.sh for working script to reproduce bug.

1. Create and start disperse volume with rdma transport
2. Mount disperse volume
3. Create directory in mounted disperse volume and create 616 empty files
4. Verify that the directory can be accessed with ls
5. Create the 617th file in the test directory
6. Verify that the directory can no longer be accessed with ls


Actual results:

Large directory cannot be accessed with ls

Expected results:

Large directory should be accessible with ls

Comment 1 shane 2018-01-09 21:49:46 UTC
Kernel: 4.9.0-4-amd64
Distro: Debian Stretch (9.2)

Comment 2 shane 2018-01-09 21:50:25 UTC
Created attachment 1379250 [details]
statedump of problem volume

Comment 3 Jim Kinney 2018-06-04 16:10:28 UTC
Using gluster 3.12 I see this same behavior with a replica 2 configuration.

Comment 4 Shyamsundar 2018-06-20 18:25:01 UTC
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Comment 5 Jim Kinney 2018-06-20 18:36:46 UTC
Please reopen as a bug under 3.12. It is present in 3.12.9 using transport=RDMA.