Bug 1401812 - RFE: Make readdirp parallel in dht
Summary: RFE: Make readdirp parallel in dht
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Poornima G
QA Contact:
URL:
Whiteboard: dht-readdirp-perf
Depends On:
Blocks: 1264911 1415917 1424995
TreeView+ depends on / blocked
 
Reported: 2016-12-06 07:26 UTC by Poornima G
Modified: 2018-12-10 02:51 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1415917 1424995 (view as bug list)
Environment:
Last Closed: 2017-05-30 18:37:22 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Poornima G 2016-12-06 07:26:48 UTC
Description of problem:
Currently readdirp is sequential at the dht layer. This make find and recursive listing of small directories very slow(directory whose content can be accomodated in one readdirp call, eg: ~600 entries if buf size is 128k).

The number of readdirp fops
required to fetch the ls -l -R for nested directories is:
no. of fops = (x + 1) * m * n
n = number of bricks
m = number of directories
x = number of readdirp calls required to fetch the dentries completely
(this depends on the size of the directory and the readdirp buf size)
1 = readdirp fop that is sent to just detect the end of directory.

Eg: Let's say, to list 800 directories with files ~300 each and readdirp
buf size 128K, on distribute 6:
(1+1) * 800 * 6 = 9600 fops

And all the readdirp fops are not sent in parallel to all the bricks, but
sequentially. This patch is a first step towards making readdirp parallel
With parallel readdirp, the number of fops may not decrease drastically
but since they are issued in parallel, it will increase the throughput.

Why its not a straightforward problem to solve:
One needs to briefly understand, how the directory offset is handled in dht.
[1], [2], [3] are some of the links that will hint the same.
- The d_off is in the order of bricks identfied by dht. Hence, the dentries
should always be returned in the same order as bricks. i.e. brick2 entries
shouldn't be returned before brick1 reaches EOD.
- We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx
- In case of a very large directories, and readdirp buf too small to hold
all the dentries in any brick, parallel readdirp is a overhead. Sequential
readdirp best suits the large directories. This demands dht be aware of or
speculate the directory size.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Worker Ant 2016-12-06 11:06:24 UTC
REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection logic) posted (#1) for review on master by Poornima G (pgurusid)

Comment 2 Worker Ant 2016-12-07 07:12:29 UTC
REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#2) for review on master by Poornima G (pgurusid)

Comment 3 Worker Ant 2016-12-08 09:30:02 UTC
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#1) for review on master by Poornima G (pgurusid)

Comment 4 Worker Ant 2016-12-08 10:40:27 UTC
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#1) for review on master by Poornima G (pgurusid)

Comment 5 Worker Ant 2016-12-08 11:29:59 UTC
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#1) for review on master by Poornima G (pgurusid)

Comment 6 Worker Ant 2016-12-29 08:45:41 UTC
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#2) for review on master by Poornima G (pgurusid)

Comment 7 Worker Ant 2016-12-29 08:46:49 UTC
REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection logic) posted (#2) for review on master by Poornima G (pgurusid)

Comment 8 Worker Ant 2016-12-29 08:47:05 UTC
REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#3) for review on master by Poornima G (pgurusid)

Comment 9 Worker Ant 2016-12-31 11:14:19 UTC
REVIEW: http://review.gluster.org/16311 (dht: reorg hash calculation, dht server stub xlator will be using      it, to filter unhashed directories in readdirp) posted (#1) for review on master by Poornima G (pgurusid)

Comment 10 Worker Ant 2016-12-31 11:14:22 UTC
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#1) for review on master by Poornima G (pgurusid)

Comment 11 Worker Ant 2016-12-31 11:34:35 UTC
REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to libglusterfs) posted (#2) for review on master by Poornima G (pgurusid)

Comment 12 Worker Ant 2016-12-31 11:34:37 UTC
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#2) for review on master by Poornima G (pgurusid)

Comment 13 Worker Ant 2016-12-31 11:44:28 UTC
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#3) for review on master by Poornima G (pgurusid)

Comment 14 Worker Ant 2017-01-01 05:10:44 UTC
REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to libglusterfs) posted (#3) for review on master by Poornima G (pgurusid)

Comment 15 Worker Ant 2017-01-01 05:10:46 UTC
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#3) for review on master by Poornima G (pgurusid)

Comment 16 Worker Ant 2017-01-01 08:00:04 UTC
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#4) for review on master by Poornima G (pgurusid)

Comment 17 Worker Ant 2017-01-05 11:15:13 UTC
REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#4) for review on master by Poornima G (pgurusid)

Comment 18 Worker Ant 2017-01-05 12:57:13 UTC
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#4) for review on master by Poornima G (pgurusid)

Comment 19 Worker Ant 2017-01-06 04:27:55 UTC
COMMIT: http://review.gluster.org/16042 committed in master by Raghavendra G (rgowdapp) 
------
commit f60631904defdaec2f1bae84b3cfd6a3e083cf09
Author: Poornima G <pgurusid>
Date:   Tue Dec 6 16:31:51 2016 +0530

    readdir-ahead: Enhance EOD detection logic
    
    Issue:
    Currently end of directory is identified on obtaining a
    readdirp_cbk with op_ret = 0 (i.e. 0 entries fetched in
    readdirp). Thus an extra readdirp is required for every
    directory just to identify EOD. Consider a case of listing
    large number of small directories. The readdirp fops required
    are doubled in that case.
    
    Solution:
    On reaching the EOD, posix sets the op_errno to ENOENT,
    hence along with looking for 'op_ret == 0' we also
    look for 'operrno == ENOENT' ehile checking for EOD condition
    
    Change-Id: I7a5b52e7b98f5dc236c387635fcc651dac0171b3
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/16042
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 20 Worker Ant 2017-01-09 07:17:33 UTC
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#5) for review on master by Poornima G (pgurusid)

Comment 21 Worker Ant 2017-01-09 07:24:06 UTC
REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#5) for review on master by Poornima G (pgurusid)

Comment 22 Worker Ant 2017-01-09 08:31:59 UTC
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#6) for review on master by Poornima G (pgurusid)

Comment 23 Worker Ant 2017-01-09 09:24:45 UTC
REVIEW: http://review.gluster.org/16360 (posix, readdir-ahead: Handle genuine ENOENT errors in readdirp) posted (#1) for review on master by Poornima G (pgurusid)

Comment 24 Worker Ant 2017-01-09 18:37:06 UTC
COMMIT: http://review.gluster.org/16039 committed in master by Jeff Darcy (jdarcy) 
------
commit a988741713752c2ec04a0680224d8fa4d42dc203
Author: Poornima G <pgurusid>
Date:   Tue Dec 6 14:43:10 2016 +0530

    dht: At places needed use STACK_WIND_COOKIE
    
    Issue:
    frame has a void * cookie pointer.
    In case of STACK_WIND_COOKIE frame->cookie is assigned
    to what is sent by the caller.
    In case of STACK_WIND frame->cookie is assigned to point
    point to the frame itself.
    
    For ease of coding, at many places, the cookie in the cbk
    is used to get the pointer to the next xl. This is
    inconsistent when STACK_WIND_TAIL comes into picture.
    
    Eg: dht_setxattr () {
        for (i = 0 ; i < conf->subvolume_cnt ; i++) {
           STACK_WIND (..dht_checking_pathinfo_cbk,
                       conf->subvolumes[i] ..);
        }
    
        dht_checking_pathinfo_cbk (...void *cookie...) {
            prev = cookie;
            ...
            for (i = 0; i < conf->subvolume_cnt; i++) {
                if (conf->subvolumes[i] == prev->this) {
                     ...
                }
            }
        }
    
        Consider the below graph:
        dht (parent)
        readdir-ahead  => Doesn't define setxattr and uses STACK_WIND_TAIL
        protocol-client
    
        With this graph, when dht_checking_pathinfo_cbk is called,
        cookie will have frame pointing to protocol-client.
        i.e. prev->this will be protocol-client. But dht was expecting
        it to be readdir-ahead as it has stored in conf->subvolumes[i]
    
    Solution:
        Hence, as a thumb rule, if cbk is using cookie, then we explicitly
        call STACK_WIND_COOKIE.
    
    Change-Id: I83aea1e24c809c5a91a0db7283e908e125471bd4
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/16039
    Reviewed-by: Raghavendra G <rgowdapp>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 25 Worker Ant 2017-01-10 04:52:33 UTC
COMMIT: http://review.gluster.org/16068 committed in master by Raghavendra G (rgowdapp) 
------
commit c89a065af2adc11d5aca3a4500d2e8c1ea02ed28
Author: Poornima G <pgurusid>
Date:   Mon Jan 9 09:55:26 2017 +0530

    readdir-ahead : Perform STACK_UNWIND outside of mutex locks
    
    Currently STACK_UNWIND is performnd within ctx->lock.
    If readdir-ahead is loaded as a child of dht, then there
    can be scenarios where the function calling STACK_UNWIND
    becomes re-entrant. Its a good practice to not call
    STACK_WIND/UNWIND within local mutex's
    
    Change-Id: If4e869849d99ce233014a8aad7c4d5eef8dc2e98
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/16068
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jdarcy>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 26 Worker Ant 2017-01-10 09:08:53 UTC
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#2) for review on master by Poornima G (pgurusid)

Comment 27 Worker Ant 2017-01-16 06:22:08 UTC
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#3) for review on master by Poornima G (pgurusid)

Comment 28 Worker Ant 2017-01-17 08:14:56 UTC
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#4) for review on master by Poornima G (pgurusid)

Comment 29 Worker Ant 2017-01-18 04:05:16 UTC
COMMIT: http://review.gluster.org/16072 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 2c03c753fe77dfadb7660ecb39fe0bbb6bad026f
Author: Poornima G <pgurusid>
Date:   Thu Dec 8 16:48:55 2016 +0530

    glusterd: Change the volfile to have readdir-ahead as a child
              of dht
    
    As mentioned in feature page http://review.gluster.org/#/c/16090/
    readdir-ahead will be optionally placed below dht.
    
    There are two options:
    1. performance.readdir-ahead
    2. performance.parallel-readdir
    
    If only option is enabled, then readdir ahead is placed at its
    original place as an ancestor of dht. If both the options 1 and 2
    are enabled then readdir ahead is placed as a child of dht.
    
    Also changes have been made to retain the rebalance, quotad,
    snapd vol files to remain unchanged.
    
    Change-Id: I0adf0b476fcbf91251f5a2fee2241786a3d8255a
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/16072
    Reviewed-by: Raghavendra G <rgowdapp>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 30 Worker Ant 2017-01-18 04:29:58 UTC
REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is enabled, split the cache limit) posted (#1) for review on master by Poornima G (pgurusid)

Comment 31 Worker Ant 2017-01-18 04:51:02 UTC
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#2) for review on master by Poornima G (pgurusid)

Comment 32 Worker Ant 2017-01-18 05:40:22 UTC
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#3) for review on master by Poornima G (pgurusid)

Comment 33 Worker Ant 2017-01-18 06:13:40 UTC
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#4) for review on master by Poornima G (pgurusid)

Comment 34 Worker Ant 2017-01-19 07:40:45 UTC
REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is enabled, split the cache limit) posted (#2) for review on master by Poornima G (pgurusid)

Comment 35 Shyamsundar 2017-03-06 17:38:27 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 36 Worker Ant 2017-03-10 03:52:17 UTC
REVIEW: https://review.gluster.org/16360 (posix, readdir-ahead: Handle genuine ENOENT errors in readdirp) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 37 Poornima G 2017-03-14 11:36:46 UTC
Not sure why was this reopened?

Comment 38 Shyamsundar 2017-05-30 18:37:22 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.