Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 764576 (GLUSTER-2844)

Summary:	Add xattr prefetch to speed up directory listings
Product:	[Community] GlusterFS	Reporter:	Jeff Darcy <jdarcy>
Component:	core	Assignee:	Anand Avati <aavati>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	low	Docs Contact:
Priority:	medium
Version:	mainline	CC:	aavati, amarts, chrisw, gluster-bugs, mohitanchlia
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
URL:	http://git.fedorahosted.org/git/?p=CloudFS.git;a=commit;h=3cd06b2b486fc59c3649c529e2da241feaca7165
Whiteboard:
Fixed In Version:	glusterfs-3.4.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-07-24 17:20:40 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Darcy 2011-04-26 17:38:18 UTC

In my investigation of why directory listings are slow, I found that most of the time was spent not fetching stat(2) data, but fetching the following xattrs.

* security.capability
* security.selinux
* system.posix_acl_access
* system.posix_acl_default

To address this, I've implemented a new translator that pre-fetches these xattrs as soon as the file appears in a readdir result, so that by the time somebody actually asks for them they'll already be there.  The results for a 100K-file directory are fairly dramatic:

* GlusterFS trunk as of today: 2m24.900s
* Trunk plus new translator: 0m19.532s

User and system time are also reduced, but there is no difference in the results. Because we're pre-fetching there is some potential for inconsistency, but these are infrequently-changing xattrs and there is built-in protection against using results more than 2s old.  This improvement seems particularly important since directory listings are also used as part of self-heal and rebalance operations, and some users have had to wait days for those operations to complete.

This code was developed as part of CloudFS and is currently set up to build "out of tree" using the Fedora glusterfs-devel package.  I'm willing to donate it upstream if you want it, though, and we can sort out packaging issues and such if/when you make that decision.

Comment 1 Anand Avati 2011-04-27 03:47:13 UTC

Jeff,
  I clearly see the value of this patch and find it crucial. Some questions on the implementation,I see a parallel pattern with stat-prefetch. Both in design and intention. By design - prefetch operations in readdir callback anticipating what's coming next. Intention - quicken ls -l kind of 'crawling' operations.

1. Due to the nature of the intention and weak caching semantics in stat-prefetch, there is a conscious decision of placing the cache within the short-lived fd context instead of the long-termed inode context. And also subject the same process alone to the weak caching semantics (with the explicit pid check).

2. Lifetimes of the new inodes created is not clear. When are they forgotten? In the client side, the inode table is subject to 'FORGET's coming from FUSE. In the server side there is an LRU algorithm to prune out old inodes. Both of these are manifested as inode->nlookup becoming 0. In the client side inode->nlookup is in tight-step with kernel's LOOKUP and FORGET counts. In the server side the LRU algorithm resets the inode->nlookup to 0 to force it to get purge. Of course in both of these cases inode->ref should also be 0 for the inode to get destroyed. I don't see what the strategy is for bookkeeping the inodes (their refs and nlookups) which are allocated by the translator.

3. Probably a small overlook - there is leak of the allocated loc_t used to perform the lookup.

I have a few more comments which I'll post later after some thought.

Avati

Comment 2 Jeff Darcy 2011-04-27 09:58:05 UTC

(In reply to comment #1)
> 1. Due to the nature of the intention and weak caching semantics in
> stat-prefetch, there is a conscious decision of placing the cache within the
> short-lived fd context instead of the long-termed inode context. And also
> subject the same process alone to the weak caching semantics (with the 
> explicit pid check).

Good point.  I don't think it would be that hard to associate the cached xattrs
with the opendir fd and/or pid.

> 2. Lifetimes of the new inodes created is not clear. When are they forgotten?
> In the client side, the inode table is subject to 'FORGET's coming from FUSE.
> ...
> I don't see what the strategy is for bookkeeping 
> the inodes (their refs and nlookups) which are allocated by the translator.

Note that we need an inode - not just a number - to call lookup. This probably
shouldn't be the case, but it is currently because of the way other code works.
Are you suggesting that the translator simply do periodic sweeps to drop
nlookup and/or call xp_forget, or that we actually free the inode when it's no
longer needed for the prefetch?  In that case, it would only be re-created
a moment later when the getxattr comes from FUSE, and we'd have to do a more
complicated lookup to re-associate it with the cached values.

> 3. Probably a small overlook - there is leak of the allocated loc_t used to
> perform the lookup.

Are you sure?  The loc_t is passed as the cookie for xp_pre_lookup_cbk, which
does GF_FREE it.

Comment 3 Anand Avati 2011-05-04 10:12:47 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > 1. Due to the nature of the intention and weak caching semantics in
> > stat-prefetch, there is a conscious decision of placing the cache within the
> > short-lived fd context instead of the long-termed inode context. And also
> > subject the same process alone to the weak caching semantics (with the 
> > explicit pid check).
> 
> Good point.  I don't think it would be that hard to associate the cached xattrs
> with the opendir fd and/or pid.
> 
> > 2. Lifetimes of the new inodes created is not clear. When are they forgotten?
> > In the client side, the inode table is subject to 'FORGET's coming from FUSE.
> > ...
> > I don't see what the strategy is for bookkeeping 
> > the inodes (their refs and nlookups) which are allocated by the translator.
> 
> Note that we need an inode - not just a number - to call lookup. This probably
> shouldn't be the case, but it is currently because of the way other code works.
> Are you suggesting that the translator simply do periodic sweeps to drop
> nlookup and/or call xp_forget, or that we actually free the inode when it's no
> longer needed for the prefetch?  In that case, it would only be re-created
> a moment later when the getxattr comes from FUSE, and we'd have to do a more
> complicated lookup to re-associate it with the cached values.

Will reply to this in another post.


> > 3. Probably a small overlook - there is leak of the allocated loc_t used to
> > perform the lookup.
> 
> Are you sure?  The loc_t is passed as the cookie for xp_pre_lookup_cbk, which
> does GF_FREE it.

Just GF_FREE still leaks reference on loc->{inode,parent} and also loc->path.

Avati

Comment 4 Jeff Darcy 2011-05-06 21:47:50 UTC

I've incorporated the review comments into a new commit, at http://git.fedorahosted.org/git/?p=CloudFS.git;a=commit;h=600baf631510c6f095cf1f3a03fba19a5f50b87f with the following comment.

---
Move cache to xlator, fix leak, add stats.

From review suggestions.  The cache is now xlator-global and fixed size (a
mere 1024 entries which should be tunable some day).  It's managed as a
four-way set-associative cache with LRU.  I also fixed a memory leak and
added statistics accessible via getxattr(trusted.xattr.stats).  This shows
a hit rate of nearly 99.9% for my 100K-file test despite the small cache
size, with approximately three hits and one eviction per file - exactly as
it should be as the oldest entries constantly get pushed out.
---

The code was written with an eye toward adding a periodic "sweeper" to prune stale entries, but with such a small fixed-size cache (not even linear with respect to fd count like stat-prefetch is) I'm not sure it's even worth it.  What do you guys think?

Comment 5 mohitanchlia 2011-06-20 17:29:51 UTC

Is this patch scheduled to be released sometime soon?

Comment 6 Anand Avati 2011-06-21 00:11:37 UTC

(In reply to comment #5)
> Is this patch scheduled to be released sometime soon?

Mohit,
  Any specific reasons why you are actively awaiting this patch? As long as you do not have posix-acl support (not yet in mainline) and have not enabled selinux, prefetching those xattrs have little to no benefit for performance.

Avati

Comment 7 mohitanchlia 2011-06-21 01:32:29 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Is this patch scheduled to be released sometime soon?
> Mohit,
>   Any specific reasons why you are actively awaiting this patch? As long as you
> do not have posix-acl support (not yet in mainline) and have not enabled
> selinux, prefetching those xattrs have little to no benefit for performance.
> Avati

I am experiencing very slow self healing when dealing with millions of files. While testing node failures and running self heal listing or even summary is very painful and takes hours or days to finish.

Even simple listing of 30K direcotries with no files take very very long time.

Would this help me with above directory or file listing issues?

Thanks for asking.

Comment 8 Anand Avati 2011-06-21 01:37:29 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > Is this patch scheduled to be released sometime soon?
> > Mohit,
> >   Any specific reasons why you are actively awaiting this patch? As long as you
> > do not have posix-acl support (not yet in mainline) and have not enabled
> > selinux, prefetching those xattrs have little to no benefit for performance.
> > Avati
> 
> I am experiencing very slow self healing when dealing with millions of files.
> While testing node failures and running self heal listing or even summary is
> very painful and takes hours or days to finish.
> 
> Even simple listing of 30K direcotries with no files take very very long time.
> 
> Would this help me with above directory or file listing issues?

What you are experiencing requires fixes elsewhere. You will certainly benefit from the proactive/asynchronous self-heal feature coming in 3.3. The patch in discussion in this bug will be useful only after the introduction of posix-acl feature (or if you have selinux enabled).

Avati

Comment 9 mohitanchlia 2011-06-21 12:53:41 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > (In reply to comment #5)
> > > > Is this patch scheduled to be released sometime soon?
> > > Mohit,
> > >   Any specific reasons why you are actively awaiting this patch? As long as you
> > > do not have posix-acl support (not yet in mainline) and have not enabled
> > > selinux, prefetching those xattrs have little to no benefit for performance.
> > > Avati
> > 
> > I am experiencing very slow self healing when dealing with millions of files.
> > While testing node failures and running self heal listing or even summary is
> > very painful and takes hours or days to finish.
> > 
> > Even simple listing of 30K direcotries with no files take very very long time.
> > 
> > Would this help me with above directory or file listing issues?
> What you are experiencing requires fixes elsewhere. You will certainly benefit
> from the proactive/asynchronous self-heal feature coming in 3.3. The patch in
> discussion in this bug will be useful only after the introduction of posix-acl
> feature (or if you have selinux enabled).
> Avati

Thanks! Is there a separate Bug/Enhancement to speed up directory/files listing?
I am also thinking when million files are involved currently I would hesitate to run find or ls commands if I need set of files matching some criteria.

Comment 10 Anand Avati 2011-06-21 13:18:28 UTC

> Thanks! Is there a separate Bug/Enhancement to speed up directory/files
> listing?
> I am also thinking when million files are involved currently I would hesitate
> to run find or ls commands if I need set of files matching some criteria.

Have you not seen any benefits of stat-prefetch? Have you considered NFS access protocol?

Avati

Comment 11 mohitanchlia 2011-06-21 14:21:33 UTC

(In reply to comment #10)
> > Thanks! Is there a separate Bug/Enhancement to speed up directory/files
> > listing?
> > I am also thinking when million files are involved currently I would hesitate
> > to run find or ls commands if I need set of files matching some criteria.
> Have you not seen any benefits of stat-prefetch? Have you considered NFS access
> protocol?
> Avati

I have not done any tuning of stat-prefetch. Do I need to tune something? I thought it's enabled by default. I am really trying not to use NFS because of various reasons and prefer to use native client.

Comment 12 Amar Tumballi 2012-04-28 03:10:54 UTC

Can we close it now, as we have 'md-cache' in upstream?