| Summary: | Add xattr prefetch to speed up directory listings | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Jeff Darcy <jdarcy> |
| Component: | core | Assignee: | Anand Avati <aavati> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | low | Docs Contact: | |
| Priority: | medium | ||
| Version: | mainline | CC: | aavati, amarts, chrisw, gluster-bugs, mohitanchlia |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| URL: | http://git.fedorahosted.org/git/?p=CloudFS.git;a=commit;h=3cd06b2b486fc59c3649c529e2da241feaca7165 | ||
| Whiteboard: | |||
| Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-07-24 17:20:40 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Jeff Darcy
2011-04-26 17:38:18 UTC
Jeff, I clearly see the value of this patch and find it crucial. Some questions on the implementation,I see a parallel pattern with stat-prefetch. Both in design and intention. By design - prefetch operations in readdir callback anticipating what's coming next. Intention - quicken ls -l kind of 'crawling' operations. 1. Due to the nature of the intention and weak caching semantics in stat-prefetch, there is a conscious decision of placing the cache within the short-lived fd context instead of the long-termed inode context. And also subject the same process alone to the weak caching semantics (with the explicit pid check). 2. Lifetimes of the new inodes created is not clear. When are they forgotten? In the client side, the inode table is subject to 'FORGET's coming from FUSE. In the server side there is an LRU algorithm to prune out old inodes. Both of these are manifested as inode->nlookup becoming 0. In the client side inode->nlookup is in tight-step with kernel's LOOKUP and FORGET counts. In the server side the LRU algorithm resets the inode->nlookup to 0 to force it to get purge. Of course in both of these cases inode->ref should also be 0 for the inode to get destroyed. I don't see what the strategy is for bookkeeping the inodes (their refs and nlookups) which are allocated by the translator. 3. Probably a small overlook - there is leak of the allocated loc_t used to perform the lookup. I have a few more comments which I'll post later after some thought. Avati (In reply to comment #1) > 1. Due to the nature of the intention and weak caching semantics in > stat-prefetch, there is a conscious decision of placing the cache within the > short-lived fd context instead of the long-termed inode context. And also > subject the same process alone to the weak caching semantics (with the > explicit pid check). Good point. I don't think it would be that hard to associate the cached xattrs with the opendir fd and/or pid. > 2. Lifetimes of the new inodes created is not clear. When are they forgotten? > In the client side, the inode table is subject to 'FORGET's coming from FUSE. > ... > I don't see what the strategy is for bookkeeping > the inodes (their refs and nlookups) which are allocated by the translator. Note that we need an inode - not just a number - to call lookup. This probably shouldn't be the case, but it is currently because of the way other code works. Are you suggesting that the translator simply do periodic sweeps to drop nlookup and/or call xp_forget, or that we actually free the inode when it's no longer needed for the prefetch? In that case, it would only be re-created a moment later when the getxattr comes from FUSE, and we'd have to do a more complicated lookup to re-associate it with the cached values. > 3. Probably a small overlook - there is leak of the allocated loc_t used to > perform the lookup. Are you sure? The loc_t is passed as the cookie for xp_pre_lookup_cbk, which does GF_FREE it. (In reply to comment #2) > (In reply to comment #1) > > 1. Due to the nature of the intention and weak caching semantics in > > stat-prefetch, there is a conscious decision of placing the cache within the > > short-lived fd context instead of the long-termed inode context. And also > > subject the same process alone to the weak caching semantics (with the > > explicit pid check). > > Good point. I don't think it would be that hard to associate the cached xattrs > with the opendir fd and/or pid. > > > 2. Lifetimes of the new inodes created is not clear. When are they forgotten? > > In the client side, the inode table is subject to 'FORGET's coming from FUSE. > > ... > > I don't see what the strategy is for bookkeeping > > the inodes (their refs and nlookups) which are allocated by the translator. > > Note that we need an inode - not just a number - to call lookup. This probably > shouldn't be the case, but it is currently because of the way other code works. > Are you suggesting that the translator simply do periodic sweeps to drop > nlookup and/or call xp_forget, or that we actually free the inode when it's no > longer needed for the prefetch? In that case, it would only be re-created > a moment later when the getxattr comes from FUSE, and we'd have to do a more > complicated lookup to re-associate it with the cached values. Will reply to this in another post. > > 3. Probably a small overlook - there is leak of the allocated loc_t used to > > perform the lookup. > > Are you sure? The loc_t is passed as the cookie for xp_pre_lookup_cbk, which > does GF_FREE it. Just GF_FREE still leaks reference on loc->{inode,parent} and also loc->path. Avati I've incorporated the review comments into a new commit, at http://git.fedorahosted.org/git/?p=CloudFS.git;a=commit;h=600baf631510c6f095cf1f3a03fba19a5f50b87f with the following comment. --- Move cache to xlator, fix leak, add stats. From review suggestions. The cache is now xlator-global and fixed size (a mere 1024 entries which should be tunable some day). It's managed as a four-way set-associative cache with LRU. I also fixed a memory leak and added statistics accessible via getxattr(trusted.xattr.stats). This shows a hit rate of nearly 99.9% for my 100K-file test despite the small cache size, with approximately three hits and one eviction per file - exactly as it should be as the oldest entries constantly get pushed out. --- The code was written with an eye toward adding a periodic "sweeper" to prune stale entries, but with such a small fixed-size cache (not even linear with respect to fd count like stat-prefetch is) I'm not sure it's even worth it. What do you guys think? Is this patch scheduled to be released sometime soon? (In reply to comment #5) > Is this patch scheduled to be released sometime soon? Mohit, Any specific reasons why you are actively awaiting this patch? As long as you do not have posix-acl support (not yet in mainline) and have not enabled selinux, prefetching those xattrs have little to no benefit for performance. Avati (In reply to comment #6) > (In reply to comment #5) > > Is this patch scheduled to be released sometime soon? > Mohit, > Any specific reasons why you are actively awaiting this patch? As long as you > do not have posix-acl support (not yet in mainline) and have not enabled > selinux, prefetching those xattrs have little to no benefit for performance. > Avati I am experiencing very slow self healing when dealing with millions of files. While testing node failures and running self heal listing or even summary is very painful and takes hours or days to finish. Even simple listing of 30K direcotries with no files take very very long time. Would this help me with above directory or file listing issues? Thanks for asking. (In reply to comment #7) > (In reply to comment #6) > > (In reply to comment #5) > > > Is this patch scheduled to be released sometime soon? > > Mohit, > > Any specific reasons why you are actively awaiting this patch? As long as you > > do not have posix-acl support (not yet in mainline) and have not enabled > > selinux, prefetching those xattrs have little to no benefit for performance. > > Avati > > I am experiencing very slow self healing when dealing with millions of files. > While testing node failures and running self heal listing or even summary is > very painful and takes hours or days to finish. > > Even simple listing of 30K direcotries with no files take very very long time. > > Would this help me with above directory or file listing issues? What you are experiencing requires fixes elsewhere. You will certainly benefit from the proactive/asynchronous self-heal feature coming in 3.3. The patch in discussion in this bug will be useful only after the introduction of posix-acl feature (or if you have selinux enabled). Avati (In reply to comment #8) > (In reply to comment #7) > > (In reply to comment #6) > > > (In reply to comment #5) > > > > Is this patch scheduled to be released sometime soon? > > > Mohit, > > > Any specific reasons why you are actively awaiting this patch? As long as you > > > do not have posix-acl support (not yet in mainline) and have not enabled > > > selinux, prefetching those xattrs have little to no benefit for performance. > > > Avati > > > > I am experiencing very slow self healing when dealing with millions of files. > > While testing node failures and running self heal listing or even summary is > > very painful and takes hours or days to finish. > > > > Even simple listing of 30K direcotries with no files take very very long time. > > > > Would this help me with above directory or file listing issues? > What you are experiencing requires fixes elsewhere. You will certainly benefit > from the proactive/asynchronous self-heal feature coming in 3.3. The patch in > discussion in this bug will be useful only after the introduction of posix-acl > feature (or if you have selinux enabled). > Avati Thanks! Is there a separate Bug/Enhancement to speed up directory/files listing? I am also thinking when million files are involved currently I would hesitate to run find or ls commands if I need set of files matching some criteria.
> Thanks! Is there a separate Bug/Enhancement to speed up directory/files
> listing?
> I am also thinking when million files are involved currently I would hesitate
> to run find or ls commands if I need set of files matching some criteria.
Have you not seen any benefits of stat-prefetch? Have you considered NFS access protocol?
Avati
(In reply to comment #10) > > Thanks! Is there a separate Bug/Enhancement to speed up directory/files > > listing? > > I am also thinking when million files are involved currently I would hesitate > > to run find or ls commands if I need set of files matching some criteria. > Have you not seen any benefits of stat-prefetch? Have you considered NFS access > protocol? > Avati I have not done any tuning of stat-prefetch. Do I need to tune something? I thought it's enabled by default. I am really trying not to use NFS because of various reasons and prefer to use native client. Can we close it now, as we have 'md-cache' in upstream? |