+++ This bug was initially created as a clone of Bug #1401812 +++ Description of problem: Currently readdirp is sequential at the dht layer. This make find and recursive listing of small directories very slow(directory whose content can be accomodated in one readdirp call, eg: ~600 entries if buf size is 128k). The number of readdirp fops required to fetch the ls -l -R for nested directories is: no. of fops = (x + 1) * m * n n = number of bricks m = number of directories x = number of readdirp calls required to fetch the dentries completely (this depends on the size of the directory and the readdirp buf size) 1 = readdirp fop that is sent to just detect the end of directory. Eg: Let's say, to list 800 directories with files ~300 each and readdirp buf size 128K, on distribute 6: (1+1) * 800 * 6 = 9600 fops And all the readdirp fops are not sent in parallel to all the bricks, but sequentially. This patch is a first step towards making readdirp parallel With parallel readdirp, the number of fops may not decrease drastically but since they are issued in parallel, it will increase the throughput. Why its not a straightforward problem to solve: One needs to briefly understand, how the directory offset is handled in dht. [1], [2], [3] are some of the links that will hint the same. - The d_off is in the order of bricks identfied by dht. Hence, the dentries should always be returned in the same order as bricks. i.e. brick2 entries shouldn't be returned before brick1 reaches EOD. - We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx - In case of a very large directories, and readdirp buf too small to hold all the dentries in any brick, parallel readdirp is a overhead. Sequential readdirp best suits the large directories. This demands dht be aware of or speculate the directory size. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Worker Ant on 2016-12-06 06:06:24 EST --- REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection logic) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-07 02:12:29 EST --- REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-08 04:30:02 EST --- REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-08 05:40:27 EST --- REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-08 06:29:59 EST --- REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child of dht) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-29 03:45:41 EST --- REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-29 03:46:49 EST --- REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection logic) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-29 03:47:05 EST --- REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#3) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-31 06:14:19 EST --- REVIEW: http://review.gluster.org/16311 (dht: reorg hash calculation, dht server stub xlator will be using it, to filter unhashed directories in readdirp) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-31 06:14:22 EST --- REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-31 06:34:35 EST --- REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to libglusterfs) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-31 06:34:37 EST --- REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-12-31 06:44:28 EST --- REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#3) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-01 00:10:44 EST --- REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to libglusterfs) posted (#3) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-01 00:10:46 EST --- REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#3) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-01 03:00:04 EST --- REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht to filter the unhashed dir in readdirp) posted (#4) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-05 06:15:13 EST --- REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#4) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-05 07:57:13 EST --- REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#4) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-05 23:27:55 EST --- COMMIT: http://review.gluster.org/16042 committed in master by Raghavendra G (rgowdapp) ------ commit f60631904defdaec2f1bae84b3cfd6a3e083cf09 Author: Poornima G <pgurusid> Date: Tue Dec 6 16:31:51 2016 +0530 readdir-ahead: Enhance EOD detection logic Issue: Currently end of directory is identified on obtaining a readdirp_cbk with op_ret = 0 (i.e. 0 entries fetched in readdirp). Thus an extra readdirp is required for every directory just to identify EOD. Consider a case of listing large number of small directories. The readdirp fops required are doubled in that case. Solution: On reaching the EOD, posix sets the op_errno to ENOENT, hence along with looking for 'op_ret == 0' we also look for 'operrno == ENOENT' ehile checking for EOD condition Change-Id: I7a5b52e7b98f5dc236c387635fcc651dac0171b3 BUG: 1401812 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/16042 NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp> --- Additional comment from Worker Ant on 2017-01-09 02:17:33 EST --- REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#5) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-09 02:24:06 EST --- REVIEW: http://review.gluster.org/16039 (dht: At places needed use STACK_WIND_COOKIE) posted (#5) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-09 03:31:59 EST --- REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND outside of mutex locks) posted (#6) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-09 04:24:45 EST --- REVIEW: http://review.gluster.org/16360 (posix, readdir-ahead: Handle genuine ENOENT errors in readdirp) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-09 13:37:06 EST --- COMMIT: http://review.gluster.org/16039 committed in master by Jeff Darcy (jdarcy) ------ commit a988741713752c2ec04a0680224d8fa4d42dc203 Author: Poornima G <pgurusid> Date: Tue Dec 6 14:43:10 2016 +0530 dht: At places needed use STACK_WIND_COOKIE Issue: frame has a void * cookie pointer. In case of STACK_WIND_COOKIE frame->cookie is assigned to what is sent by the caller. In case of STACK_WIND frame->cookie is assigned to point point to the frame itself. For ease of coding, at many places, the cookie in the cbk is used to get the pointer to the next xl. This is inconsistent when STACK_WIND_TAIL comes into picture. Eg: dht_setxattr () { for (i = 0 ; i < conf->subvolume_cnt ; i++) { STACK_WIND (..dht_checking_pathinfo_cbk, conf->subvolumes[i] ..); } dht_checking_pathinfo_cbk (...void *cookie...) { prev = cookie; ... for (i = 0; i < conf->subvolume_cnt; i++) { if (conf->subvolumes[i] == prev->this) { ... } } } Consider the below graph: dht (parent) readdir-ahead => Doesn't define setxattr and uses STACK_WIND_TAIL protocol-client With this graph, when dht_checking_pathinfo_cbk is called, cookie will have frame pointing to protocol-client. i.e. prev->this will be protocol-client. But dht was expecting it to be readdir-ahead as it has stored in conf->subvolumes[i] Solution: Hence, as a thumb rule, if cbk is using cookie, then we explicitly call STACK_WIND_COOKIE. Change-Id: I83aea1e24c809c5a91a0db7283e908e125471bd4 BUG: 1401812 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/16039 Reviewed-by: Raghavendra G <rgowdapp> Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> --- Additional comment from Worker Ant on 2017-01-09 23:52:33 EST --- COMMIT: http://review.gluster.org/16068 committed in master by Raghavendra G (rgowdapp) ------ commit c89a065af2adc11d5aca3a4500d2e8c1ea02ed28 Author: Poornima G <pgurusid> Date: Mon Jan 9 09:55:26 2017 +0530 readdir-ahead : Perform STACK_UNWIND outside of mutex locks Currently STACK_UNWIND is performnd within ctx->lock. If readdir-ahead is loaded as a child of dht, then there can be scenarios where the function calling STACK_UNWIND becomes re-entrant. Its a good practice to not call STACK_WIND/UNWIND within local mutex's Change-Id: If4e869849d99ce233014a8aad7c4d5eef8dc2e98 BUG: 1401812 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/16068 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy> Reviewed-by: Raghavendra G <rgowdapp> --- Additional comment from Worker Ant on 2017-01-10 04:08:53 EST --- REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child of dht) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-16 01:22:08 EST --- REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child of dht) posted (#3) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-17 03:14:56 EST --- REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have readdir-ahead as a child of dht) posted (#4) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-17 23:05:16 EST --- COMMIT: http://review.gluster.org/16072 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 2c03c753fe77dfadb7660ecb39fe0bbb6bad026f Author: Poornima G <pgurusid> Date: Thu Dec 8 16:48:55 2016 +0530 glusterd: Change the volfile to have readdir-ahead as a child of dht As mentioned in feature page http://review.gluster.org/#/c/16090/ readdir-ahead will be optionally placed below dht. There are two options: 1. performance.readdir-ahead 2. performance.parallel-readdir If only option is enabled, then readdir ahead is placed at its original place as an ancestor of dht. If both the options 1 and 2 are enabled then readdir ahead is placed as a child of dht. Also changes have been made to retain the rebalance, quotad, snapd vol files to remain unchanged. Change-Id: I0adf0b476fcbf91251f5a2fee2241786a3d8255a BUG: 1401812 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/16072 Reviewed-by: Raghavendra G <rgowdapp> Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Atin Mukherjee <amukherj> --- Additional comment from Worker Ant on 2017-01-17 23:29:58 EST --- REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is enabled, split the cache limit) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-17 23:51:02 EST --- REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-18 00:40:22 EST --- REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#3) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-18 01:13:40 EST --- REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise option of dht) posted (#4) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2017-01-19 02:40:45 EST --- REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is enabled, split the cache limit) posted (#2) for review on master by Poornima G (pgurusid)
The feature made into 3.10 as part of release branching. This issue was cloned to address bugs in the feature. Since this lead to confusion on status of RFE, closing this bug and tracking individual issues through different bugs.