Bug 850426 - gfs2: Add xgetdents syscall to the kernel
gfs2: Add xgetdents syscall to the kernel
Status: ASSIGNED
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
Unspecified Unspecified
medium Severity unspecified
: ---
: ---
Assigned To: Abhijith Das
Fedora Extras Quality Assurance
: FutureFeature, Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-21 10:51 EDT by Abhijith Das
Modified: 2015-03-25 08:14 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-08-21 11:32:24 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
First bash at a patch (47.08 KB, patch)
2012-08-21 10:55 EDT, Abhijith Das
no flags Details | Diff
Another revision (28.14 KB, patch)
2012-09-25 10:36 EDT, Abhijith Das
no flags Details | Diff
Another revision (40.98 KB, patch)
2012-10-11 13:12 EDT, Abhijith Das
no flags Details | Diff
Updated to latest upstream kernel (nmw tree) (41.14 KB, patch)
2013-01-15 10:23 EST, Abhijith Das
no flags Details | Diff
Updated patch (41.31 KB, patch)
2013-01-17 09:06 EST, Abhijith Das
no flags Details | Diff
seekwatcher graph (121.62 KB, image/png)
2013-01-17 10:28 EST, Abhijith Das
no flags Details
Newer version of patch with vector of pages storage (47.58 KB, patch)
2013-03-11 09:57 EDT, Abhijith Das
no flags Details | Diff
David Howells' xstat patch - vfs only bits (25.72 KB, patch)
2013-04-03 14:11 EDT, Abhijith Das
no flags Details | Diff
0001-vfs-xgetdents-syscall-hooks-and-xreaddir-fop.patch (5.94 KB, patch)
2013-04-03 14:13 EDT, Abhijith Das
no flags Details | Diff
0002-gfs2-vector-of-pages-buffer.patch (8.24 KB, patch)
2013-04-03 14:13 EDT, Abhijith Das
no flags Details | Diff
0003-gfs2-sort-with-opaque-context.patch (2.16 KB, patch)
2013-04-03 14:14 EDT, Abhijith Das
no flags Details | Diff
0004-gfs2-xreaddir-op-and-supporting-functions.patch (49.42 KB, patch)
2013-04-03 14:14 EDT, Abhijith Das
no flags Details | Diff
Test program to run xgetdents syscall against a directory (20.83 KB, text/x-csrc)
2013-04-03 14:17 EDT, Abhijith Das
no flags Details
Program to create a bunch of entries in a directory w/ xattrs (2.09 KB, text/x-csrc)
2013-04-03 14:19 EDT, Abhijith Das
no flags Details
0002-gfs2-vector-of-pages-buffer.patch (8.25 KB, patch)
2013-04-08 12:21 EDT, Abhijith Das
no flags Details | Diff
0004-gfs2-xreaddir-op-and-supporting-functions.patch (49.19 KB, patch)
2013-04-08 12:22 EDT, Abhijith Das
no flags Details | Diff
blktrace comparison of readdir+stat+getxattr vs readdirplus (107.38 KB, image/png)
2013-04-08 17:18 EDT, Abhijith Das
no flags Details
Another variant that adds a syscall to do readahead on the relevant blocks (13.66 KB, patch)
2013-12-10 10:57 EST, Abhijith Das
no flags Details | Diff
Updated readahead patch with tunable limits (14.67 KB, patch)
2014-01-21 18:04 EST, Abhijith Das
no flags Details | Diff
First of 3 patch set (2.68 KB, patch)
2014-06-17 07:13 EDT, Abhijith Das
no flags Details | Diff
Second patch (4.43 KB, patch)
2014-06-17 07:14 EDT, Abhijith Das
no flags Details | Diff
Third patch (13.84 KB, patch)
2014-06-17 07:35 EDT, Abhijith Das
no flags Details | Diff
Revised readahead syscall patch after discussions with swhiteho (13.42 KB, patch)
2014-06-27 06:05 EDT, Abhijith Das
no flags Details | Diff
Updated readahead syscall with buffer allocation in the VFS (13.85 KB, patch)
2014-07-10 23:39 EDT, Abhijith Das
no flags Details | Diff
Variant of the previous patch with allocation of buffer in the filesystem (15.11 KB, patch)
2014-07-10 23:43 EDT, Abhijith Das
no flags Details | Diff
Latest revision of the xgetdents patches (24.98 KB, application/gzip)
2014-07-25 04:03 EDT, Abhijith Das
no flags Details
Revised dirreadahead patches (6.23 KB, application/gzip)
2014-07-25 04:05 EDT, Abhijith Das
no flags Details
Program to test the readahead system call (1.75 KB, text/x-csrc)
2014-07-25 04:07 EDT, Abhijith Das
no flags Details
Script to automate runs of xgetdents/dirreadahead syscalls and measure performance (5.85 KB, text/plain)
2014-07-25 04:08 EDT, Abhijith Das
no flags Details
Comparison of the various syscalls for different sized directories (83.07 KB, image/png)
2014-07-25 04:11 EDT, Abhijith Das
no flags Details
seekwatcher graph of 'ls -l' vs xgetdents() with various tunable values (409.86 KB, image/png)
2014-07-25 04:12 EDT, Abhijith Das
no flags Details
seekwatcher graph of 'ls -l' vs dirreadahead() with various tunable values (420.54 KB, image/png)
2014-07-25 04:13 EDT, Abhijith Das
no flags Details
Multithreaded userspace program to readahead directory inodes (11.30 KB, text/x-csrc)
2014-10-17 13:34 EDT, Abhijith Das
no flags Details

  None (edit)
Description Abhijith Das 2012-08-21 10:51:32 EDT
Description of problem:
Add a xgetdents() syscall (a.k.a readdirplus) to the linux kernel to return xstat info (based on David Howells' recent xstat patches) along with dirents.
Comment 1 Abhijith Das 2012-08-21 10:55:36 EDT
Created attachment 605960 [details]
First bash at a patch

This patch also includes the vfs and ext4 bits of David Howell's xstat patch (patches 1of6 and 2of6) and is against Steve Whitehouse's latest nmw tree.
Comment 2 Dave Jones 2012-08-21 11:32:24 EDT
we don't add syscalls until they're upstream.
Comment 3 Abhijith Das 2012-09-25 10:36:28 EDT
Created attachment 617073 [details]
Another revision

This is another revision of the patch. It does not contain the xstat bits from dhowells' patchset, just the vfs bits for xgetdents and supporting bits in gfs2. Please refer to dhowells' patches here: https://lists.samba.org/archive/samba-technical/2012-April/082906.html

Things to do:
1. See if it's possible to avoid allocating and deallocating xattrs repeatedly. Possibly bypass the whole xdirent_info struct and directly manipulate the __user linux_xdirent struct.

2. Possibly add a new readdirplus.c and header file and move all relevant stuff there to make things cleaner and easier to read.

3. Add bits to other fs' so they support this syscall too.
Comment 4 Steve Whitehouse 2012-09-25 10:53:54 EDT
Some comments:

It is not a good idea to look up the inodes in do_filldir_main() since you are only getting a few inodes at a time on each call and ideally we want to be working on larger batches. Also, I can't see anywhere which sorts the inodes to be looked up, so that it appears we are looking up in hash order rather than disk block order. One of the main benefits of doing this as a syscall is that we should be able to use an ordering which is beneficial to the filesystem.

It would be better to do the directory reading into a reasonably sized buffer to start with, and then to do all the lookups afterwards. That will mean that a more optimal ordering can be achieved as well as reducing the possibility of any locking issues. The filesystem needs the freedom to order things in the way best suited to it.

In addition it ought to be possible to send all the lookups off at the same time, in order to reduce the i/o latency, rather than doing them sequentially. That should provide a big performance win, although it also increases complexity due to the lock ordering issues.
Comment 5 Abhijith Das 2012-10-11 13:12:37 EDT
Created attachment 625592 [details]
Another revision

Steve,

This revised patch hopefully addresses some of the issues you mentioned. When you get a chance, can you review and comment?

Thanks!
--Abhi
Comment 6 Steve Whitehouse 2012-10-17 05:18:02 EDT
The big question is what kind of speed up you get with this? The larger it is, the easier it will be to make an argument for including it upstream.

Some other comments:

 - You don't need __inline__ in the kernel, inline will do on its own
 - Probably some of those inline functions don't actually need to be inlined anyway. The compiler will inline small functions automatically in many cases.
 - I'd prefer to not go back to using the _i suffix for functions, but to name them in a way which makes clear what it does
 - void pointers don't need casts to/from other pointer types
 - We probably need to look again at the gfs2_alloc/realloc/free functions... we should not be using vmalloc in places which are performance critical at all, since vmalloc can be very slow. Can we have a system of using a vector of pages or does that give us problems wrt to sorting ?

Overall though this looks a lot cleaner than the previous version and I think it moving in the right direction.

One further thought... rather the copying directly to user space, would it be easier if we had a buffer (per struct file for example) which would be used as temporary space into which to write results. That way larger requests could be made to the fs, even if userspace only asked for small amounts at a time. In addition it may help to resolve some of the buffer sizing issues that you've run into. The downside is that it increases complexity and also we'd need to know when to invalidate the buffer in case of changes in the directory. But something to consider perhaps...
Comment 7 Abhijith Das 2012-10-17 11:51:18 EDT
IRC chat with Steve - more details regarding his feedback in comment #6

<abhi``> swhiteho, Thanks for taking a look at the xgetdents patch again. The minor things are easy to fix, but I'm a bit unclear about a couple of things: 1. The per-struct-file buffer idea to use as temporary space and 2. alloc/realloc/free functions and using vector of pages instead. Can you elaborate a bit?
<swhiteho> ok, I'll see if I can explain a bit better :-)
<swhiteho> 1. my thought is "copy into buffer in large chunks while holding the glock" and "copy to user from buffer in chunks specified by the user while not holding the glock"
<swhiteho> so if the user asks for less than the buffer size, you can leave the buffer on the struct file and if they then don't lseek before requesting the next chunk of data, you don't need a new call to the fs to copy it out
<swhiteho> that means the buffer size to the fs can be a known fixed size
<swhiteho> which may make the fs implementation easier
<swhiteho> or it may slow things down due to an extra buffer copy
<swhiteho> so its tricky to know what is best, but it might be worth experimenting with
<swhiteho> 2. kmalloc is basically "allocate a load of pages, and use the page tables to map them contiguously into kernel memory space"
<swhiteho> s/kmalloc/vmalloc
<abhi`> swhiteho, it might not slow down much, as we're copying most of the stuff into a temp kernel buffer anyway... the one that hangs off of the ctx struct
<swhiteho> in that case, it may not be even that much of a slow down then I guess
<abhi`> ok, so the buffer hangs from the struct file to the directory in question
<swhiteho> one of the issues with vmalloc is that there are various slow things compared with kmalloc
<swhiteho> one is setting up and tearing down the vm mappings, and another is that you land up using up TLB entries when accessing the memory
<swhiteho> so its ok for large hash tables for example, but for anything where performance is an issue, it is better to just get a vector of pages directly
<swhiteho> kmalloc also is layered on top of the page allocator, but via slab (or similar) and there is no virtual address translation involved
<abhi`> right
<swhiteho> now we can ask for higher order pages too, but those have a decreasing probability of being available as the order increases
<swhiteho> so if we could use just a list of single pages, that should be a better compromise perhaps
<abhi`> I see, certainly worth experimenting with.
<abhi`> swhiteho, so here's a question, if the amount of allocation I need is greater than KMALLOC_MAX_SIZE, but I still call kmalloc(), will it fail? And if it does, does it mean there are no more pages available to allocate?
<abhi`> If so, wouldn't it affect the vector-of-pages method similarly?
<bmarson_mtg> third groups by system and is sorted by job # on node
<swhiteho> if it is greater than KMALLOC_MAX_SIZE I suspect it will just fail anyway
<swhiteho> since that depends on the largest slab cache
<swhiteho> vector of pages just needs individual pages to be free, so that they don't need to be contiguous pages
<abhi`> ah, I got confused. I was thinking vmalloc when I wrote kmalloc above. It makes sense now. So I'll have to have some sort of header in each page to say how many entries are in it and so on.
<swhiteho> something like that, or just have an arrary (page pointer, size) type tuples
<abhi`> yeah, or that. Ok, I'm going to try this out - have a page vector hanging off of the struct file of the directory.
<abhi`> I think it'll actually make my life easier without all the alloc/realloc
<swhiteho> abhi`: yes, I suspect it might
<swhiteho> abhi`: did you get any timings btw?
<swhiteho> would be interesting to know how much faster it is than readdir + stat
<abhi`> swhiteho, I had half a mind to do it, but wasn't sure if there were more changes needed, so I wanted to wait for a round of feedback+changes
<swhiteho> worth doing anyway I think, just to check that it is working as hoped
<abhi`> swhiteho, I did write up a test program and test that it works as it's designed. But have not stress-tested it or anything. I will write up a script to compare numbers for this vs readdir+stat+getattr and also vs the new page-buffer method when it's ready
<swhiteho> ok, sounds good
Comment 8 Abhijith Das 2013-01-15 10:23:58 EST
Created attachment 678841 [details]
Updated to latest upstream kernel (nmw tree)
Comment 9 Abhijith Das 2013-01-17 09:06:48 EST
Created attachment 680222 [details]
Updated patch

This is an updated version of the previous patch, where it fixes a couple of issues I found while testing.
The results of this patch are not as expected though. It seems to be quite inefficient when compared to simply doing getdents() followed by open() and fxstat()+flistxattr()+fgetxattr() on every entry.

The test was on a loopback-device-backed gfs2 fs using lock_nolock.
For a set of 5000 files in a single directory with upto 25 xattrs per file,
readdirplus(xgetdents) took

real	0m17.977s
user	0m0.001s
sys	0m13.863s

whereas, getdents()+open()+fxstat()+flistxattr()+fgetxattr() took

real	0m3.685s
user	0m0.072s
sys	0m3.230s

There could be a bug in the patch somewhere that could explain this or the implementation itself might be inefficient (memory allocation or locking, perhaps).
Comment 10 Steve Whitehouse 2013-01-17 09:09:39 EST
Glad to see that it is working now - maybe try a blktrace/seekwatcher to compare the two cases?
Comment 11 Abhijith Das 2013-01-17 10:28:33 EST
Created attachment 680282 [details]
seekwatcher graph

The bottom graph corresponds to the readdirplus patch and the top graph is using the conventional set of syscalls.
Comment 12 Abhijith Das 2013-01-17 17:51:03 EST
Some instrumentation -

[root@smoke-05 userland]# perf stat -a -e kmem:kmalloc time ./readdirtest -q /mnt/gfs2/foo
0.00user 18.85system 0:28.39elapsed 66%CPU (0avgtext+0avgdata 1444maxresident)k
85952inputs+0outputs (0major+385minor)pagefaults 0swaps

 Performance counter stats for 'time ./readdirtest -q /mnt/gfs2/foo':

         1,060,423 kmem:kmalloc                                                

      28.401295160 seconds time elapsed

/var/log/messages:
[19509.681640] ALL DONE. Printing kmalloc/krealloc/kfree stats
[19509.687887] total kmalloc duration: 11.321093680 seconds
[19509.693831] total krealloc duration: 0.368829520 seconds
[19509.699787] total kfree duration: 7.982661000 seconds
[19509.705559] total op counts - kmalloc:1035948   krealloc:31129   kfree:1063625

allocation/deallocation is taking up a most of the time.
Comment 13 Abhijith Das 2013-03-11 09:57:31 EDT
Created attachment 708389 [details]
Newer version of patch with vector of pages storage

This is a draft version (still needs some cleanup and a couple of features) of readdirplus with a dynamic buffer backed by a vector of pages to store intermediate data.

This scheme is faster than using kmalloc/krealloc/kfree etc. For a 1000 file here's the comparison against conventional readdir()+stat()+getxattr(). The conventional method corresponds to the '-r' switch in the readdirtest program.

Cold cache:
[root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -q /mnt/gfs2/foo

real	0m1.536s
user	0m0.001s
sys	0m1.039s

[root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -rq /mnt/gfs2/foo

real	0m2.202s
user	0m0.021s
sys	0m1.674s

Cached results:
[root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -q /mnt/gfs2/foo

real	0m0.099s
user	0m0.000s
sys	0m0.099s

[root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -rq /mnt/gfs2/foo

real	0m0.639s
user	0m0.019s
sys	0m0.620s
Comment 14 Abhijith Das 2013-04-03 14:09:53 EDT
Posting the latest version of my readdirplus work rebased to the latest gfs2-3.0-nmw tree. I've split it into 4 parts. I'm also attaching David Howells' not-yet-upstream xstat patch (only the relevant vfs bits) for completeness.

dhowells-xstat-vfs-only.patch
0001-vfs-xgetdents-syscall-hooks-and-xreaddir-fop.patch
0002-gfs2-vector-of-pages-buffer.patch
0003-gfs2-sort-with-opaque-context.patch
0004-gfs2-xreaddir-op-and-supporting-functions.patch

There's a known bug with this patchset, in that, when the system is under memory pressure some of the entries of a directory are skipped and not returned. On my test machine, I can reproduce this when I set the collection buffer to unlimited (using the tunables) and run the syscall on a dir with 500k files. I'm still working to fix this issue.

New features:

- small improvements to vector of pages buffer code (I call it vbuf). I'm still not sure if this is the best way to implement such a thing. I don't know if there's a way to translate these random pages into a linear address space without much cost. Couldn't find anything suitable in the kernel, so I'm doing the arithmetic myself.

- added 2 tunables to limit the amount of memory that the syscall uses and all associated code to manage collecting/outputting data in chunks. By adjusting these tunables, one can also tweak performance. At some future point, we should teach the syscall to adapt to memory conditions/data sizes and choose appropriate collection strategies.

- added ctx_sort function (based on lib/sort.c) that accepts an opaque context that it passes to the compare() and swap() methods. This is used to sort the dirent ptr array in the vbuf because the vbuf doesn't have a linear address space and special things need to be done to access the pointers.
Comment 15 Abhijith Das 2013-04-03 14:11:21 EDT
Created attachment 731269 [details]
David Howells' xstat patch - vfs only bits
Comment 16 Abhijith Das 2013-04-03 14:13:12 EDT
Created attachment 731270 [details]
0001-vfs-xgetdents-syscall-hooks-and-xreaddir-fop.patch
Comment 17 Abhijith Das 2013-04-03 14:13:43 EDT
Created attachment 731272 [details]
0002-gfs2-vector-of-pages-buffer.patch
Comment 18 Abhijith Das 2013-04-03 14:14:15 EDT
Created attachment 731273 [details]
0003-gfs2-sort-with-opaque-context.patch
Comment 19 Abhijith Das 2013-04-03 14:14:40 EDT
Created attachment 731274 [details]
0004-gfs2-xreaddir-op-and-supporting-functions.patch
Comment 20 Abhijith Das 2013-04-03 14:17:50 EDT
Created attachment 731278 [details]
Test program to run xgetdents syscall against a directory

 Usage: /root/userland/readdirtest [OPTION] <dir>

 -b SIZE   Buffer size in KB (default 1MB)
 -r        Use readdir instead of readdirplus (default)
 -l LEVEL, where LEVEL:
           0 - Only entries
           1 - entries and stat
           2 - entries, stat and xattr keys
           3 - entries, stat, xattr keys and values (default) 
 -c        Compact output - single line for each file
 -q        Quiet... don't output information, just process
Comment 21 Abhijith Das 2013-04-03 14:19:25 EDT
Created attachment 731279 [details]
Program to create a bunch of entries in a directory w/ xattrs

Usage: ./create_readdir_dir <topdir> <seed> <num files> <max xattrs/file>
Comment 22 Abhijith Das 2013-04-08 12:21:56 EDT
Created attachment 732739 [details]
0002-gfs2-vector-of-pages-buffer.patch

Fixed a few minor issues reported by Bob when he reviewed the patch set.
Comment 23 Abhijith Das 2013-04-08 12:22:53 EDT
Created attachment 732740 [details]
0004-gfs2-xreaddir-op-and-supporting-functions.patch

Fixed a few minor issues reported by Bob when he reviewed the patch set.
Comment 24 Abhijith Das 2013-04-08 17:18:39 EDT
Created attachment 732845 [details]
blktrace comparison of readdir+stat+getxattr vs readdirplus

On the left is readdir()+stat()+getxattr() and on the right is readdirplus()
Comment 25 Abhijith Das 2013-12-10 10:57:35 EST
Created attachment 834840 [details]
Another variant that adds a syscall to do readahead on the relevant blocks

This is a first-attempt patch at a syscall that does readahead on the inode blocks of a directory's entries.
A couple of things are missing:
1. A way to limit the amount of readahead so we don't run out of memory.
2. xattr blocks are not readahead yet.

With this patch, the 'ls -l' (readdir() + stat() on each entry) operation is significantly faster on a directory with 100000 files.

Without the patch:
[root@skol-04 gfs2-3.0-nmw]# echo 3 > /proc/sys/vm/drop_caches
[root@skol-04 gfs2-3.0-nmw]# sleep 5
[root@skol-04 gfs2-3.0-nmw]# mount -t gfs2 /dev/loop0 /mnt/gfs2
[root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo > /dev/null

real	8m39.396s
user	0m1.376s
sys	0m3.424s
[root@skol-04 gfs2-3.0-nmw]# umount /mnt/gfs2

With the patch:
[root@skol-04 gfs2-3.0-nmw]# echo 3 > /proc/sys/vm/drop_caches
[root@skol-04 gfs2-3.0-nmw]# sleep 5
[root@skol-04 gfs2-3.0-nmw]# mount -t gfs2 /dev/loop0 /mnt/gfs2
[root@skol-04 gfs2-3.0-nmw]# time /root/ra_test /mnt/gfs2/foo
Return value from rddir_readahead: 0

real	0m22.606s
user	0m0.000s
sys	0m0.321s
[root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo > /dev/null

real	0m5.074s
user	0m1.195s
sys	0m2.662s
[root@skol-04 gfs2-3.0-nmw]# umount /mnt/gfs2
Comment 26 Abhijith Das 2014-01-21 18:04:39 EST
Created attachment 853502 [details]
Updated readahead patch with tunable limits
Comment 27 Steve Whitehouse 2014-02-03 05:56:01 EST
The general structure of this looks good. Just a few comments:

 - The naming of the call could perhaps be better, but we can figure that out later
 - gfs2_rddir_ra_inodes() doesn't appear to do the right thing... looks like its using the directory's address space, and not actually reading in the inodes correctly, so that it is not doing the readahead properly. This bit should look like a batch ->lookup() since that is what it is doing.
 - I don't want to add any more items to the tunable structures, so we need to find a way to set that limit automatically, however for the purposes of gathering some performance info this should be perfectly ok.
Comment 28 Abhijith Das 2014-06-17 07:13:29 EDT
Created attachment 909528 [details]
First of 3 patch set

VFS interface for the new rddir_readahead system call
Comment 29 Abhijith Das 2014-06-17 07:14:37 EDT
Created attachment 909529 [details]
Second patch

GFS2 - rename delete workqueue to ra_and_del so we can use it to do offline inode readheads as well
Comment 30 Abhijith Das 2014-06-17 07:35:13 EDT
Created attachment 909534 [details]
Third patch

GFS2's implementation of the rddir_ra syscall.

This 3-patch set is a preview... still needs a bit of work:
1. locking around the rra struct
2. error handling - currently works only in the happy path
3. some way to limit the amount of memory consumed
4. A notification mechanism to userspace when readahead is complete

Having said that, I see some promising results with this patch:

On a gfs2 fs with a directory 'foo' with 100000 files, without the patch, a 'ls -l' on 'foo' takes over 4 minutes with a cold cache.

[root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null

real	4m10.660s
user	0m0.992s
sys	0m4.163s

With the patch, I see the following:
[root@skol-04 gfs2-3.0-nmw]# time /root/ra_test /mnt/gfs2/foo

real	0m30.169s
user	0m0.000s
sys	0m0.146s

^^ In this bit, we go through the directory collecting all the inode numbers contained in it and pass on the collection to the workqueue to look them up. (Should we delegate the inode number collection to the workqueue as well, instead of doing it synchronously?)

[root@skol-04 gfs2-3.0-nmw]# time sleep 45

real	0m45.001s
user	0m0.000s
sys	0m0.000s

^^ This is an arbitrary sleep period to cover the time it takes to lookup all the 100000 inodes. Currently, there's no mechanism to notify the userland caller that readahead is complete, but when that feature is added, this won't be necessary. Looking at kernel printks at appropriate locations, I see that the lookups take about 36 seconds. i.e:

[90583.996953] Start gfs2_rddir_ra syscall
[90614.126061] Collected 100000 entries
[90614.129643] End gfs2_rddir_ra syscall
[90640.523517] fs/gfs2/rddir_ra.c gfs2_rddir_ra_inodes 109: Read in 100000 inodes

Another thing to note here is that it's not necessary to sleep or do something else while the readahead is being performed. If you issue an 'ls -l' or similar command immediately after the call to rddir_ra, it will still be faster than doing a regular 'ls -l', but not quite as fast as it can get, given that not all inodes have been read in yet. This delay can be controlled by the user by adjusting the max number of inode-readaheads requested.

[root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null

real	0m2.219s
user	0m0.712s
sys	0m1.431s

^^ Finally, with all the inodes read in, 'ls -l' takes 2.2 seconds. The total time taken is 1m8s. Considerably faster than 'ls -l' by itself.
Comment 31 Steve Whitehouse 2014-06-24 09:26:49 EDT
RE: comment #28

The patch looks good generally, but we might want to think a bit more about the naming: dir_readahead() might be better or even dirreadahead() since there are few syscalls with '_' in them.

RE: comment #29

The naming of the workqueue again is a bit ugly, but otherwise I think it looks ok.

RE: comment #30

Results look promising I think. Some comments on the code:

If the buffer size is greater than the max size for kmalloc, then we should probably limit it to that size (which is pretty large these days) since otherwise we may land up always using __vmalloc, which is bad for performance.

Probably you could also fit the struct gfs2_rddir_ra into the buffer, and thus avoid one of the allocations too.

Is it not possible to use the directory read ctx to avoid needing to pass down an extra argument? Even if you have to add a field to this structure that would be a cleaner solution.

What does the HASHCOLL bit do?

It looks like you are still iterating over the inode individually though, even though they are on the work queue. Would suggest to try sorting them in the gfs2_rddir_ra() and then for each inode, allocate a structure with the workqueu head and inode number, and then submit it to the workqueue. If kmalloc returns NULL at any stage, you can always stop early, since you are returning the number of items read, and I assume also updating the current position.

That way you'll land up with the maximum number of things happening in parallel. If you need to use a workqueue specifically for readahead, then that would be better than trying to force it into the existing delete workqueue if it is going to be a pain to do that.

However, all in all looking promising I think, and very glad to see that you are getting good performance results already.
Comment 32 Abhijith Das 2014-06-27 06:05:27 EDT
Created attachment 912745 [details]
Revised readahead syscall patch after discussions with swhiteho

This patch still needs a bit of cleanup and splitting into logical pieces.
Here's what is new, compared to the previous version:

1. Renaming the syscall to something less ugly: dir_readahead

2. Left the delete_wq alone and created a new one called dir_ra_wq for async inode lookups.

3. Use the dir_context struct to preserve state needed to perform operation. By passing around the state in this struct, we also avoid adding a new parameter to a bunch of functions in fs/gfs2/dir.c

4. Allocate and deallocate the above state in the VFS (fs/readdir.c)

5. Once inode numbers are collected, create a new work struct for every inode lookup and add them to the workqueue.

There are a few things that I'd like to change about this patch:

a) The readahead context (buf, count, size, req) stored in the dir_context struct seems too fs-specific. I think it might be a better idea to simply have a 'void *opaque' pointer in dir_context and have the fs alloc/dealloc it and fill it with whatever content it needs.

b) Another (minor) benefit for gfs2 from a) is that we can add a flag to check for the hash collision situation while collecting inode numbers where we can persist to get the entire set of requested inodes. Otherwise, we'll end up doing short reads of the dirents every time we encounter a dirent hash collision.

c) We need the gfs2 superblock in the workqueue to perform inode lookups. What happens if the superblock disappears (unmount) while the workqueue is still processing inodes for that fs? Does it get initialized to NULL? Will a simple check for NULL suffice?

Regardless, I'm still seeing a fair bit of speedup with this patch for the same 100000 file directory.

[root@skol-04 ~]# modprobe -r gfs2
[root@skol-04 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@skol-04 ~]# sleep 5
[root@skol-04 ~]# mount -t gfs2 /dev/skolo/50g /mnt/gfs2/
[root@skol-04 ~]# time /root/ra_test /mnt/gfs2/foo
   offset: 0
   return err code: 5732
   offset: 122920247
   return err code: 84562
   offset: 1942302922
   return err code: 9706
   offset: 2147480613
   return err code: 0

real	0m47.085s
user	0m0.000s
sys	0m0.219s
[root@skol-04 ~]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null

real	0m5.042s
user	0m0.734s
sys	0m1.354s
[root@skol-04 ~]# umount /mnt/gfs2
[root@skol-04 ~]# modprobe -r gfs2
[root@skol-04 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@skol-04 ~]# mount -t gfs2 /dev/skolo/50g /mnt/gfs2/
[root@skol-04 ~]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null

real	5m2.768s
user	0m1.007s
sys	0m4.202s
[root@skol-04 ~]# umount /mnt/gfs2

52s with readahead vs 5m2s w/o to perform an 'ls -l' of the dir.
Comment 33 Abhijith Das 2014-07-10 23:39:18 EDT
Created attachment 917238 [details]
Updated readahead syscall with buffer allocation in the VFS

This patch fixes a couple of issues I found with the previous version of this patch. Along the way, I also answered my own question of what happens in the workqueues if the superblock disappears. Looks like the answer is to flush the workqueues at unmount time to prevent that scenario.
Comment 34 Abhijith Das 2014-07-10 23:43:39 EDT
Created attachment 917239 [details]
Variant of the previous patch with allocation of buffer in the filesystem

The dir_context struct only has a void* opaque pointer which is allocated/deallocated as necessary by gfs2. Gives more flexibility with the amount of state we can have. In gfs2's specific case, we can identify the hash-collision scenario during directory reading and not have to return prematurely having collected less than requested entries.
Comment 35 Abhijith Das 2014-07-25 04:03:51 EDT
Created attachment 920895 [details]
Latest revision of the xgetdents patches
Comment 36 Abhijith Das 2014-07-25 04:05:30 EDT
Created attachment 920896 [details]
Revised dirreadahead patches
Comment 37 Abhijith Das 2014-07-25 04:07:02 EDT
Created attachment 920897 [details]
Program to test the readahead system call
Comment 38 Abhijith Das 2014-07-25 04:08:37 EDT
Created attachment 920898 [details]
Script to automate runs of xgetdents/dirreadahead syscalls and measure performance
Comment 39 Abhijith Das 2014-07-25 04:11:34 EDT
Created attachment 920899 [details]
Comparison of the various syscalls for different sized directories

Comparison of the 3 ways of doing readdirplus:
1. 'ls -l' (getdents() followed by stat() of each entry)
2. xgetdents() syscall
3. dirreadahead() syscall, followed by 'ls -l'
Comment 40 Abhijith Das 2014-07-25 04:12:59 EDT
Created attachment 920902 [details]
seekwatcher graph of 'ls -l' vs xgetdents() with various tunable values
Comment 41 Abhijith Das 2014-07-25 04:13:41 EDT
Created attachment 920903 [details]
seekwatcher graph of 'ls -l' vs dirreadahead() with various tunable values
Comment 42 Abhijith Das 2014-10-17 13:34:12 EDT
Created attachment 947987 [details]
Multithreaded userspace program to readahead directory inodes

Note You need to log in before you can comment on or make changes to this bug.