Bug 903175

Summary:	Possible performance issue with md-cache
Product:	[Community] GlusterFS	Reporter:	Xavi Hernandez <jahernan>
Component:	unclassified	Assignee:	Brian Foster <bfoster>
Status:	CLOSED CURRENTRELEASE	QA Contact:	SATHEESARAN <sasundar>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bfoster, gluster-bugs, sasundar
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.5.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-04-17 11:40:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xavi Hernandez 2013-01-23 11:41:55 UTC

Description of problem: On a volume with md-cache and quick-read xlators active, a recursive 'ls' takes more time than the 'ls' on the same volume with these translators removed.

Version-Release number of selected component (if applicable): head of release3.3 branch (commit 47b2bc274a)

How reproducible: This behaviour is reproduced on all tests I made with very similar times.

I'm using two servers (node2 and node3)

Steps to Reproduce:

1. Start the daemons on both nodes (/etc/init.d/glusterd start)
2. Add peers (from node2: gluster peer probe node3)
3. Create a simple replica 2 volume (gluster volume create test1 replica 2 transport tcp node2:/bricks/b01 node3:/bricks/b01)
4. Start the volume (gluster volume start test1)
5. Mount the volume (mount -t glusterfs node2:/test1 /gluster/test1)
6. Copy some data (I've made the tests with ~21000 files and directories)
7. Unmount the volume, stop the volume and stop the daemons on both nodes
8. Restart everything
9. Do the timed recursive ls (time ls -alR /gluster/test1 | wc -l)
10. Repeat the recursive ls (to see caching effects)
11. Stop everything
12. Modify volume files to remove md-cache and/or quick-read
13. Restart everyting
14. Repeat the recursive ls (two times)

Actual results:

With md-cache and quick-read:
-----------------------------

[root@node2 gluster]# time ls -alR /gluster/test1/ | wc -l
31528

real 1m2.967s
user 0m1.322s
sys 0m4.066s
[root@node2 gluster]# time ls -alR /gluster/test1/ | wc -l
31528

real 0m50.418s
user 0m1.355s
sys 0m3.788s

Without md-cache and quick-read:
--------------------------------

[root@node2 gluster]# time ls -alR /gluster/test1/ | wc -l
31528

real 0m54.795s
user 0m1.273s
sys 0m4.076s
[root@node2 gluster]# time ls -alR /gluster/test1/ | wc -l
31528

real 0m48.713s
user 0m1.262s
sys 0m3.675s

With md-cache or quick-read, but not both, the elapsed time is between these two tests.

Expected results:

md-cache and qucik-read should not add latency to simple lookups and should be, at least, as fast as without them.

Additional info:

As a side effect, I've seen that both translators keep some information in the context of each inode using a dict_t structure. This means that every file discovered takes 2 dict_t from the memory pool. The memory pool is initialized with 4096 entries, so after 2048 files the memory pool is depleted and the log (in debug mode) gets full of messages saying that it needs to allocate a dict_t from regular memory because the pool is full.

This makes the dict_t memory pool useless, as it will be allocating dict_t structures from regular memory very soon.

May this be the cause of the performance degradation when using these translators ?

I'm not sure if it's important, but I compiled the code with '--disable-ibverbs --disable-georeplication' on the configure command. Both servers are connected through a dedicated 1GB switch. The servers are dual core Intel Atoms with hyperthreading (CPU utilization never goes higher than 50% of one core). Bricks are physical partitions of 100GB formatted as ext4.

gluster volume info:

Volume Name: test1
Type: Replicate
Volume ID: 23f88b96-6b14-43d4-9b84-daf15216775b
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: node2:/bricks/b01
Brick2: node3:/bricks/b01

Comment 1 Brian Foster 2013-01-23 19:30:30 UTC

Running a recursive ls on a kernel source tree to a single DHT volume takes me around 26s with no client side xlators. Add md-cache and it jumps to around 29s, and add quick-read to that and it jumps to 35s (quick-read alone is at 32s).

With regard to the expected quick-read results, I think we'd consider the impact of quick-read here as not a bug. quick-read effectively turns some lookups into read operations, so a tradeoff here should be expected.

The following are some further results from my tests focused only on md-cache (my numbers here got better for some reason):

- w/o md-cache

[root@heka-client-05 ~]# umount /mnt/; mount -t glusterfs heka-srv-01.sbu.lab.eng.bos.redhat.com:test /mnt/; time ls -alR /mnt/ | wc -l
56924

real	0m19.825s
user	0m0.404s
sys	0m1.073s

- w/ md-cache

[root@heka-client-05 ~]# umount /mnt/; mount -t glusterfs heka-srv-01.sbu.lab.eng.bos.redhat.com:test /mnt/; time ls -alR /mnt/ | wc -l
56924

real	0m22.764s
user	0m0.454s
sys	0m1.309s

I'll try to hack around md-cache a bit to see if we can reasonably discern where the hit on the initial ls originates from.

Comment 2 Brian Foster 2013-01-23 23:26:55 UTC

It seems that the "hit" in this case is due to readdirplus operations. Note that md-cache converts incoming readdir operations to readdirplus, which will induce storage/posix to run a stat on each entry and package that data back up in the reply (to be cached by md-cache). I suppose it makes sense from a file operation perspective for this to take longer, but it doesn't quite explain why the initial directory traversal seems to take slightly longer when readdirp operations are enabled, particularly if the purpose is to cache data for the expected/subsequent lookup.

One thing to note is that this effect is more pronounced when fuse attribute/entry caching is in effect. In other words, the existence of md-cache is a huge win with the fuse timeouts set to 0, it seems to have less of an effect when using the default timeout of 1s. The data I have collected so far is below and shows the results with md-cache (with readdirp enabled and disabled) and without, with fuse caching enabled and disabled.

I'll probably have to collect some fop stats or do some more detailed tracing to really break down what is going on here.

Command: time ls -alR /mnt/linux-3.7.4/ | wc -l

--- timeouts = 0

- !md-cache - 9m30s
- md-cache - 3m7s
- md-cache, !readdirp - 3m0s

--- timeouts = 1

- !md-cache - 21s
- md-cache - 24s
- md-cache, !readdirp - 21s

Comment 3 Xavi Hernandez 2013-01-24 10:20:59 UTC

I've repeated my tests with fuse timeouts set to 0 and I also appreciate a huge difference (about 2.5 times faster with md-cache). In this case the use of md-cache really matters. However it is much slower than using the default timeouts of 1 second (more than 10 times slower).

Regarding the quick-read xlator, I understand that there should be a small slowdown if it caches some information that will be needed later, however for simple tasks like a recursive ls, a find, a rsync with few modified files, and some others, this may have a huge impact.

Consider a gluster volume with milions of files where we execute a find command. If we read 64 KB on each lookup we will have read more than 64 GB of useless data.

Maybe it would be better to read the data when the user declares his intention to access the file, like in the open fop. This has the added advantage that we can release the information when the last fd of the inode is released (this final release could be optional to improve the performance of programs that continuosly open and close files). That would help to reduce the utilization of dict_t structures (they are used to cache the GF_CONTENT_KEY value) and to reduce global memory utilization by only caching opened files.

Comment 4 Brian Foster 2013-01-24 14:20:48 UTC

(In reply to comment #3)
> I've repeated my tests with fuse timeouts set to 0 and I also appreciate a
> huge difference (about 2.5 times faster with md-cache). In this case the use
> of md-cache really matters. However it is much slower than using the default
> timeouts of 1 second (more than 10 times slower).
> 

Right. I'm not suggesting to use attribute-timeout=0, just dropping a data point that caching is a factor here.

> Regarding the quick-read xlator, I understand that there should be a small
> slowdown if it caches some information that will be needed later, however
> for simple tasks like a recursive ls, a find, a rsync with few modified
> files, and some others, this may have a huge impact.
> 
> Consider a gluster volume with milions of files where we execute a find
> command. If we read 64 KB on each lookup we will have read more than 64 GB
> of useless data.
> 

Yeah, it is certainly worth considering whether quick-read is something users want enabled in their environments (perhaps whether it should be enabled by default is another question). Note that there are several tunables that determine the maximum file size (64k default), total cache size, timeout, etc. And of course you can always disable it entirely: gluster volume set myvol quick-read disable.

> Maybe it would be better to read the data when the user declares his
> intention to access the file, like in the open fop. This has the added
> advantage that we can release the information when the last fd of the inode
> is released (this final release could be optional to improve the performance
> of programs that continuosly open and close files). That would help to
> reduce the utilization of dict_t structures (they are used to cache the
> GF_CONTENT_KEY value) and to reduce global memory utilization by only
> caching opened files.

That's a good idea. My only request is to file a separate RFE (enhancement) bug for quick-read to call out this flaw, so we can keep this focused on md-cache and separate out the analysis that has to take place for quick-read, which is a very different problem. :)

Comment 5 Xavi Hernandez 2013-01-24 16:47:12 UTC

(In reply to comment #4)
> That's a good idea. My only request is to file a separate RFE (enhancement)
> bug for quick-read to call out this flaw, so we can keep this focused on
> md-cache and separate out the analysis that has to take place for
> quick-read, which is a very different problem. :)

I've filed bug 903723 to follow this question.

Comment 6 Brian Foster 2013-01-24 21:00:38 UTC

Some interesting profile results on readdir vs. readdirp during this test:

- with readdir:

      0.30      31.88 us      23.00 us      50.00 us            399        STAT
      6.06      40.47 us      22.00 us     280.00 us           6432     READDIR

- with readdirp:

      0.00      38.33 us      34.00 us      47.00 us              3        STAT
     46.94     548.53 us      60.00 us    2613.00 us           6432    READDIRP

So readdirp is effective in reducing the number of stat calls, but unfortunately the readdirp fop is experiencing a huge spike in latency (+500us). In fact, if you calculate the reduction in time spent in stat calls as (396 * 32us = ~13ms) and compare to the time added spent in readdirp calls as (6432 * ~500us = ~3s), it seems apparent where that time is being spent.

Doing some brute force hacking down in storage/posix in this codepath points me at posix_handle_path() and the repeated invocations thereof. This is speculative in that it's possible my hacking so far disrupted some other return codepaths that also affect latency, but gives me something to dig into further...

As an aside, a possible short term workaround here might be to support toggling the conversion of readdir to readdirp in md-cache with a translator option. That is not a real solution (I wouldn't close this bug based on that alone), but I wonder if it would be considered useful.

Comment 7 Brian Foster 2013-01-25 18:35:47 UTC

NOTE: I discovered that dht can convert a readdir to readdirp as well, so a bypass might be required there for experimentation purposes.

Some further breakdown of the codepath where I'm observing the additional latency:

- posix_readdirp_fill():
  - 2 calls to posix_handle_path():
    - at least one lstat()
  - for each directory entry:
      - posix_pstat()
      - posix_entry_xattr_fill():
        - 2 calls to posix_handle_path()
        - alloc a dict, for each xattr request (posix_lookup_xattr_fill()):
                - at least one lgetxattr(), 2 if key exists

In my tests, skipping posix_readdirp_fill() drops the latency to a level comparable with readdir (expected). Skipping only the xattr requests cuts the latency by more than half (down to ~200us).

It seems like there might be some room for some micro optimizations here:

- The initial calls to posix_handle_path() are superfluous if the directory list is empty, which seems to happen once for each high-level directory read.
- We repeat the posix_handle_path() call several times per-entry. It would be interesting to see whether we could do this once and pass the resulting path down the callchain (and possibly make use of stat() vs. lstat() to avoid the readlink sequence that occurs down in posix_handle_pump()).
- Requesting the xattrs (2 of which come from posix-acl) could be unnecessary here, or made tunable at least.

That said, reducing this call down to the bare minimum per-entry stat still adds to the latency, so the effects of these approaches might be minimal. Some macro optimizations might be possible as well:

- An enhanced implementation (perhaps a standalone translator) that predicts a full directory read and preloads the full directory contents to serve subsequent readdirs could be interesting, if possible to manage the opaque offsets and whatnot.
- XFS supports a bulkstat feature that though I don't know might address this as it stands, it would be interesting to see whether it could be enhanced to support a per-directory range (or custom, user-specified list) of inodes.
- I noticed recently that fuse limits readdir requests to 4k, even when ls appears to use a 32k buffer. I have an ugly hack lying around that bumps up the readdir request size, but hasn't been tested extensively enough.


Xavier,

This all of course assumes you're observing the same behavior I observe, which according to IRC, might not be the case. When you have a chance, it would be interesting to see whether you reproduce similar profiling results.

Comment 8 Xavi Hernandez 2013-01-26 18:50:36 UTC

(In reply to comment #7)
> NOTE: I discovered that dht can convert a readdir to readdirp as well, so a
> bypass might be required there for experimentation purposes.
> 
Yes, that's what was giving me inconsistent results. Now I have made some tests with the minimum set of xlators and the results confirms your observations.

without readdirp

      0.17      25.03 us      21.00 us      47.00 us            318        STAT
      7.58      75.61 us      13.00 us    1655.00 us           4674     READDIR

with readdirp

      0.01      29.75 us      23.00 us      48.00 us             20        STAT
     63.59    1597.51 us      91.00 us   12273.00 us           4674    READDIRP

My test show a very big difference between the two options, much larger than your. Maybe it's related with the processor (I'm using Intel Atoms).

> Some further breakdown of the codepath where I'm observing the additional
> latency:
> 
> - posix_readdirp_fill():
>   - 2 calls to posix_handle_path():
>     - at least one lstat()
>   - for each directory entry:
>       - posix_pstat()
>       - posix_entry_xattr_fill():
>         - 2 calls to posix_handle_path()
>         - alloc a dict, for each xattr request (posix_lookup_xattr_fill()):
>                 - at least one lgetxattr(), 2 if key exists
> 
> In my tests, skipping posix_readdirp_fill() drops the latency to a level
> comparable with readdir (expected). Skipping only the xattr requests cuts
> the latency by more than half (down to ~200us).
> 
In my tests I've seen that, on average, for each directory entry the posix_pstat() takes ~20% of the total time. The pair of posix_handle_path() calls take ~35%, and the remaining ~45% is spent, basically, on the lgetxattr() calls.

However there's one thing that seems weird to me. If we do not convert readdir into readdirp fops, the md-cache xlator will request the same xattrs in the lookup fop, and it seems to do very similar things to read the xattrs in the posix translator (it calls posix_lookup_xattr_fill()). I'll do some more tests about this on monday.

> It seems like there might be some room for some micro optimizations here:
> 
> - The initial calls to posix_handle_path() are superfluous if the directory
> list is empty, which seems to happen once for each high-level directory read.
> - We repeat the posix_handle_path() call several times per-entry. It would
> be interesting to see whether we could do this once and pass the resulting
> path down the callchain (and possibly make use of stat() vs. lstat() to
> avoid the readlink sequence that occurs down in posix_handle_pump()).
> - Requesting the xattrs (2 of which come from posix-acl) could be
> unnecessary here, or made tunable at least.
> 
Not so sure about the possibility of replacing lstat() with stat(). I'm not much familiarized with this part of the code and I'm not sure of which may be the consequences. The other optimizations seem good.

> That said, reducing this call down to the bare minimum per-entry stat still
> adds to the latency, so the effects of these approaches might be minimal.
> Some macro optimizations might be possible as well:
> 
> - An enhanced implementation (perhaps a standalone translator) that predicts
> a full directory read and preloads the full directory contents to serve
> subsequent readdirs could be interesting, if possible to manage the opaque
> offsets and whatnot.
> - XFS supports a bulkstat feature that though I don't know might address
> this as it stands, it would be interesting to see whether it could be
> enhanced to support a per-directory range (or custom, user-specified list)
> of inodes.
I think we cannot depend on this. This would require to add specific code to the posix xlator to detect the underlaying file system and implement more than one way to do the same depending on the detected file system. This would increase the complexity considerably.

> - I noticed recently that fuse limits readdir requests to 4k, even when ls
> appears to use a 32k buffer. I have an ugly hack lying around that bumps up
> the readdir request size, but hasn't been tested extensively enough.
> 
That would minimize network latency at least.

Maybe an intermediate solution could be to use the readdirp fop to only declare the intention to read the specified xattrs on a subsequent lookup. This way, the readdirp fop would return as fast as readdir (no data returned), but it would start the reading of the xattr in the background. When the lookup request arrives, the data will be available to be returned very fast.

> 
> Xavier,
> 
> This all of course assumes you're observing the same behavior I observe,
> which according to IRC, might not be the case. When you have a chance, it
> would be interesting to see whether you reproduce similar profiling results.
It was my fault. Once I've made the tests with a bare minimum volume, I've seen the same behavior than you.

(In reply to comment #6)
> As an aside, a possible short term workaround here might be to support
> toggling the conversion of readdir to readdirp in md-cache with a translator
> option. That is not a real solution (I wouldn't close this bug based on that
> alone), but I wonder if it would be considered useful.
I think it's always useful to have the possibility to tweak the behavior of a feature. However, as you say, this shouldn't be the solution to this bug.

Comment 9 Vijay Bellur 2013-01-26 19:37:37 UTC

CHANGE: http://review.gluster.org/4432 (storage/posix: skip path construction when dentry list is empty) merged in master by Anand Avati (avati)

Comment 10 Vijay Bellur 2013-01-29 01:26:51 UTC

CHANGE: http://review.gluster.org/4440 (performance/md-cache: add force-readdirp flag to make readdirp configurable) merged in master by Anand Avati (avati)

Comment 11 Brian Foster 2013-01-29 14:40:25 UTC

I submitted a few minor changes (above), one of which was removing the readdir->readdirp conversion in dht. Avati explained that this has a purpose beyond caching, so that particular change was dropped.

In doing some tracing, I discovered that md-cache is basically ineffective during the first traversal of the directory structure (e.g., a lookup following the first readdirp of each entry is a "miss" and the lookup proceeds to the server). Avati pointed out that this is a known issue with at least one currently pending fix:

http://review.gluster.org/3483

What looks to be the case is that the internal inodes are constructed in the client translator and not linked into the inode table when they are passed back up to fuse-bridge. So md-cache performs all the right operations and receives the relevant data, but the framework can't actually track it in the inode until that inode has received a real lookup. This mostly explains the added cost of the initial ls observed when using md-cache, as we do a bunch of work to collect extra information on dentries and throw it all away.

The immediate/preferred fix is gluster support for the recently merged fuse readdirp request support:

http://review.gluster.org/#change,3905

... but the first change referenced above is also valid for older fuse clients or clients where fuse is absent.

Comment 12 Niels de Vos 2014-04-17 11:40:20 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report.

glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user