Bug 1325857
Summary: | Multi-threaded SHD support | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Pranith Kumar K <pkarampu> |
Component: | replicate | Assignee: | Pranith Kumar K <pkarampu> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.7.10 | CC: | atalur, bugs, ndevos, paulds, pkarampu, ravishankar, rkavunga, rwareing |
Target Milestone: | --- | Keywords: | FutureFeature, Patch, Triaged |
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.7.12 | Doc Type: | Enhancement |
Doc Text: | Story Points: | --- | |
Clone Of: | 1221737 | Environment: | |
Last Closed: | 2016-06-28 12:14:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1221737 | ||
Bug Blocks: | 1314724 |
Description
Pranith Kumar K
2016-04-11 11:18:41 UTC
REVIEW: http://review.gluster.org/13967 (syncop: Add parallel dir scan functionality) posted (#1) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu) REVIEW: http://review.gluster.org/13967 (syncop: Add parallel dir scan functionality) posted (#2) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu) COMMIT: http://review.gluster.org/13967 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) ------ commit 02a235b5a5fcfffd17debfbf3fceeddffe171682 Author: Pranith Kumar K <pkarampu> Date: Thu Mar 17 09:32:02 2016 +0530 syncop: Add parallel dir scan functionality Most of this functionality's ideas are contributed by Richard Wareing, in his patch: https://bugzilla.redhat.com/show_bug.cgi?id=1221737#c1 VERY BIG thanks to him :-). After starting porting/testing the patch above, I found a few things we can improve in this patch based on the results we got in testing. 1) We are reading all the indices before we launch self-heals. In some customer cases I worked on there were almost 5million files/directories that needed heal. With such a big number self-heal daemon will be OOM killed if we go this route. So I modified this to launch heals based on a queue length limit. 2) We found that for directory hierarchies, multi-threaded self-heal patch was not giving better results compared to single-threaded self-heal because of the order problems. We improved index xlator to give gfid type to make sure that all directories in the indices are healed before the files that follow in that iteration of readdir output(http://review.gluster.org/13553). In our testing this lead to zero errors of self-heals as we were only doing self-heals in parallel for files and not directories. I think we can further improve self-heal speed for directories by doing name heals in parallel based on similar techniques Richard's patch showed. I think the best thing there would be to introduce synccond_t infra (pthread_cond_t kind of infra for syncops) which I am planning to implement for future releases. 3) Based on 1), 2) and the fact that afr already does retries of the indices in a loop I removed retries again in the threads. 4) After the refactor, the changes required to bring in multi-threaded self-heal for ec would just be ~10 lines, most of it will be about options initialization. Our tests found that we are able to easily saturate network :-). High level description of the final feature: Traditionally self-heal daemon reads the indices (gfids) that need to be healed from the brick and initiates heal one gfid at a time. Goal of this feature is to add parallelization to the way we do self-heals in a way we do not regress in any case but increase parallelization wherever we can. As part of this following knobs are introduced to improve parallelization: 1) We can launch 'max-jobs' number of heals in parallel. 2) We can keep reading indices as long as the wait-q for heals doesn't go over 'max-qlen' passed as arguments to multi-threaded dir_scan. As a first cut, we always do healing of directories in serial order one at a time but for files we launch heals in parallel. In future we can do name-heals of dir in parallel, but this is not implemented as of now. Reason for this is mentioned already in '2)' above. AFR/EC can introduce options like max-shd-threads/wait-qlength which can be set by users to increase the rate of heals when they want. Please note that the options will take effect only for the next crawl. >BUG: 1221737 >Change-Id: I8fc0afc334def87797f6d41e309cefc722a317d2 >Signed-off-by: Pranith Kumar K <pkarampu> >Reviewed-on: http://review.gluster.org/13569 >NetBSD-regression: NetBSD Build System <jenkins.org> >CentOS-regression: Gluster Build System <jenkins.com> >Reviewed-by: Jeff Darcy <jdarcy> >Smoke: Gluster Build System <jenkins.com> BUG: 1325857 Change-Id: I23235bbb923208eee6a8be711bbfb14350edb11b Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/13967 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> REVIEW: http://review.gluster.org/14010 (cluster/afr: Use parallel dir scan functionality) posted (#1) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu) COMMIT: http://review.gluster.org/14010 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) ------ commit 80fd2a0d8b3da20755a38195f62fc4d7fc5f7b52 Author: Pranith Kumar K <pkarampu> Date: Thu Mar 17 09:32:17 2016 +0530 cluster/afr: Use parallel dir scan functionality >BUG: 1221737 >Change-Id: I0ed71a72f0e33bd733723e00a01cf28378c5534e >Signed-off-by: Pranith Kumar K <pkarampu> >Reviewed-on: http://review.gluster.org/13755 >Reviewed-on: http://review.gluster.org/13992 >NetBSD-regression: NetBSD Build System <jenkins.org> >CentOS-regression: Gluster Build System <jenkins.com> >Smoke: Gluster Build System <jenkins.com> >Reviewed-by: Jeff Darcy <jdarcy> BUG: 1325857 Change-Id: I7c6b2ea065edd7f5dafffeb42fd6c601b4ab8d14 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/14010 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> REVIEW: http://review.gluster.org/14017 (op-version: Bump up op-version to 3.7.12) posted (#1) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu) COMMIT: http://review.gluster.org/14017 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) ------ commit e4aef8290e8aac8d7fa345db8703a9c3f95a9f66 Author: Pranith Kumar K <pkarampu> Date: Mon Apr 18 12:28:34 2016 +0530 op-version: Bump up op-version to 3.7.12 BUG: 1325857 Change-Id: I49286ba60281d543f2acacf45c4f824627ef4167 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/14017 Smoke: Gluster Build System <jenkins.com> Reviewed-by: Krutika Dhananjay <kdhananj> CentOS-regression: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.12, please open a new bug report. glusterfs-3.7.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-devel/2016-June/049918.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |