Bug 1415917 - RFE: Make readdirp parallel in dht
Summary: RFE: Make readdirp parallel in dht
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 3.10
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1401812 1424995
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-24 04:53 UTC by Raghavendra G
Modified: 2017-02-20 09:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1401812
Environment:
Last Closed: 2017-01-27 02:32:55 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Raghavendra G 2017-01-24 04:53:05 UTC
+++ This bug was initially created as a clone of Bug #1401812 +++

Description of problem:
Currently readdirp is sequential at the dht layer. This make find and recursive listing of small directories very slow(directory whose content can be accomodated in one readdirp call, eg: ~600 entries if buf size is 128k).

The number of readdirp fops
required to fetch the ls -l -R for nested directories is:
no. of fops = (x + 1) * m * n
n = number of bricks
m = number of directories
x = number of readdirp calls required to fetch the dentries completely
(this depends on the size of the directory and the readdirp buf size)
1 = readdirp fop that is sent to just detect the end of directory.

Eg: Let's say, to list 800 directories with files ~300 each and readdirp
buf size 128K, on distribute 6:
(1+1) * 800 * 6 = 9600 fops

And all the readdirp fops are not sent in parallel to all the bricks, but
sequentially. This patch is a first step towards making readdirp parallel
With parallel readdirp, the number of fops may not decrease drastically
but since they are issued in parallel, it will increase the throughput.

Why its not a straightforward problem to solve:
One needs to briefly understand, how the directory offset is handled in dht.
[1], [2], [3] are some of the links that will hint the same.
- The d_off is in the order of bricks identfied by dht. Hence, the dentries
should always be returned in the same order as bricks. i.e. brick2 entries
shouldn't be returned before brick1 reaches EOD.
- We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx
- In case of a very large directories, and readdirp buf too small to hold
all the dentries in any brick, parallel readdirp is a overhead. Sequential
readdirp best suits the large directories. This demands dht be aware of or
speculate the directory size.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2016-12-06 06:06:24 EST ---

REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection logic) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-07 02:12:29 EST ---

REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-08 04:30:02 EST ---

REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-08 05:40:27 EST ---

REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-08 06:29:59 EST ---

REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-29 03:45:41 EST ---

REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-29 03:46:49 EST ---

REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection logic) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-29 03:47:05 EST ---

REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#3) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-31 06:14:19 EST ---

REVIEW: http://review.gluster.org/16311 (dht: reorg hash calculation, dht server stub xlator will be using      it, to filter unhashed directories in readdirp) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-31 06:14:22 EST ---

REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-31 06:34:35 EST ---

REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to libglusterfs) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-31 06:34:37 EST ---

REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2016-12-31 06:44:28 EST ---

REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#3) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-01 00:10:44 EST ---

REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to libglusterfs) posted (#3) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-01 00:10:46 EST ---

REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#3) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-01 03:00:04 EST ---

REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#4) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-05 06:15:13 EST ---

REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#4) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-05 07:57:13 EST ---

REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#4) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-05 23:27:55 EST ---

COMMIT: http://review.gluster.org/16042 committed in master by Raghavendra G (rgowdapp@redhat.com) 
------
commit f60631904defdaec2f1bae84b3cfd6a3e083cf09
Author: Poornima G <pgurusid@redhat.com>
Date:   Tue Dec 6 16:31:51 2016 +0530

    readdir-ahead: Enhance EOD detection logic
    
    Issue:
    Currently end of directory is identified on obtaining a
    readdirp_cbk with op_ret = 0 (i.e. 0 entries fetched in
    readdirp). Thus an extra readdirp is required for every
    directory just to identify EOD. Consider a case of listing
    large number of small directories. The readdirp fops required
    are doubled in that case.
    
    Solution:
    On reaching the EOD, posix sets the op_errno to ENOENT,
    hence along with looking for 'op_ret == 0' we also
    look for 'operrno == ENOENT' ehile checking for EOD condition
    
    Change-Id: I7a5b52e7b98f5dc236c387635fcc651dac0171b3
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid@redhat.com>
    Reviewed-on: http://review.gluster.org/16042
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>

--- Additional comment from Worker Ant on 2017-01-09 02:17:33 EST ---

REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#5) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-09 02:24:06 EST ---

REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#5) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-09 03:31:59 EST ---

REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#6) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-09 04:24:45 EST ---

REVIEW: http://review.gluster.org/16360 (posix, readdir-ahead: Handle genuine ENOENT errors in readdirp) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-09 13:37:06 EST ---

COMMIT: http://review.gluster.org/16039 committed in master by Jeff Darcy (jdarcy@redhat.com) 
------
commit a988741713752c2ec04a0680224d8fa4d42dc203
Author: Poornima G <pgurusid@redhat.com>
Date:   Tue Dec 6 14:43:10 2016 +0530

    dht: At places needed use STACK_WIND_COOKIE
    
    Issue:
    frame has a void * cookie pointer.
    In case of STACK_WIND_COOKIE frame->cookie is assigned
    to what is sent by the caller.
    In case of STACK_WIND frame->cookie is assigned to point
    point to the frame itself.
    
    For ease of coding, at many places, the cookie in the cbk
    is used to get the pointer to the next xl. This is
    inconsistent when STACK_WIND_TAIL comes into picture.
    
    Eg: dht_setxattr () {
        for (i = 0 ; i < conf->subvolume_cnt ; i++) {
           STACK_WIND (..dht_checking_pathinfo_cbk,
                       conf->subvolumes[i] ..);
        }
    
        dht_checking_pathinfo_cbk (...void *cookie...) {
            prev = cookie;
            ...
            for (i = 0; i < conf->subvolume_cnt; i++) {
                if (conf->subvolumes[i] == prev->this) {
                     ...
                }
            }
        }
    
        Consider the below graph:
        dht (parent)
        readdir-ahead  => Doesn't define setxattr and uses STACK_WIND_TAIL
        protocol-client
    
        With this graph, when dht_checking_pathinfo_cbk is called,
        cookie will have frame pointing to protocol-client.
        i.e. prev->this will be protocol-client. But dht was expecting
        it to be readdir-ahead as it has stored in conf->subvolumes[i]
    
    Solution:
        Hence, as a thumb rule, if cbk is using cookie, then we explicitly
        call STACK_WIND_COOKIE.
    
    Change-Id: I83aea1e24c809c5a91a0db7283e908e125471bd4
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid@redhat.com>
    Reviewed-on: http://review.gluster.org/16039
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>

--- Additional comment from Worker Ant on 2017-01-09 23:52:33 EST ---

COMMIT: http://review.gluster.org/16068 committed in master by Raghavendra G (rgowdapp@redhat.com) 
------
commit c89a065af2adc11d5aca3a4500d2e8c1ea02ed28
Author: Poornima G <pgurusid@redhat.com>
Date:   Mon Jan 9 09:55:26 2017 +0530

    readdir-ahead : Perform STACK_UNWIND outside of mutex locks
    
    Currently STACK_UNWIND is performnd within ctx->lock.
    If readdir-ahead is loaded as a child of dht, then there
    can be scenarios where the function calling STACK_UNWIND
    becomes re-entrant. Its a good practice to not call
    STACK_WIND/UNWIND within local mutex's
    
    Change-Id: If4e869849d99ce233014a8aad7c4d5eef8dc2e98
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid@redhat.com>
    Reviewed-on: http://review.gluster.org/16068
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>

--- Additional comment from Worker Ant on 2017-01-10 04:08:53 EST ---

REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-16 01:22:08 EST ---

REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#3) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-17 03:14:56 EST ---

REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child           of dht) posted (#4) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-17 23:05:16 EST ---

COMMIT: http://review.gluster.org/16072 committed in master by Pranith Kumar Karampuri (pkarampu@redhat.com) 
------
commit 2c03c753fe77dfadb7660ecb39fe0bbb6bad026f
Author: Poornima G <pgurusid@redhat.com>
Date:   Thu Dec 8 16:48:55 2016 +0530

    glusterd: Change the volfile to have readdir-ahead as a child
              of dht
    
    As mentioned in feature page http://review.gluster.org/#/c/16090/
    readdir-ahead will be optionally placed below dht.
    
    There are two options:
    1. performance.readdir-ahead
    2. performance.parallel-readdir
    
    If only option is enabled, then readdir ahead is placed at its
    original place as an ancestor of dht. If both the options 1 and 2
    are enabled then readdir ahead is placed as a child of dht.
    
    Also changes have been made to retain the rebalance, quotad,
    snapd vol files to remain unchanged.
    
    Change-Id: I0adf0b476fcbf91251f5a2fee2241786a3d8255a
    BUG: 1401812
    Signed-off-by: Poornima G <pgurusid@redhat.com>
    Reviewed-on: http://review.gluster.org/16072
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Atin Mukherjee <amukherj@redhat.com>

--- Additional comment from Worker Ant on 2017-01-17 23:29:58 EST ---

REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is enabled, split the cache limit) posted (#1) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-17 23:51:02 EST ---

REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-18 00:40:22 EST ---

REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#3) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-18 01:13:40 EST ---

REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#4) for review on master by Poornima G (pgurusid@redhat.com)

--- Additional comment from Worker Ant on 2017-01-19 02:40:45 EST ---

REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is enabled, split the cache limit) posted (#2) for review on master by Poornima G (pgurusid@redhat.com)

Comment 1 Raghavendra G 2017-01-27 02:32:55 UTC
The feature made into 3.10 as part of release branching. This issue was cloned to address bugs in the feature. Since this lead to confusion on status of RFE, closing this bug and tracking individual issues through different bugs.


Note You need to log in before you can comment on or make changes to this bug.