Description of problem: Add a xgetdents() syscall (a.k.a readdirplus) to the linux kernel to return xstat info (based on David Howells' recent xstat patches) along with dirents.
Created attachment 605960 [details] First bash at a patch This patch also includes the vfs and ext4 bits of David Howell's xstat patch (patches 1of6 and 2of6) and is against Steve Whitehouse's latest nmw tree.
we don't add syscalls until they're upstream.
Created attachment 617073 [details] Another revision This is another revision of the patch. It does not contain the xstat bits from dhowells' patchset, just the vfs bits for xgetdents and supporting bits in gfs2. Please refer to dhowells' patches here: https://lists.samba.org/archive/samba-technical/2012-April/082906.html Things to do: 1. See if it's possible to avoid allocating and deallocating xattrs repeatedly. Possibly bypass the whole xdirent_info struct and directly manipulate the __user linux_xdirent struct. 2. Possibly add a new readdirplus.c and header file and move all relevant stuff there to make things cleaner and easier to read. 3. Add bits to other fs' so they support this syscall too.
Some comments: It is not a good idea to look up the inodes in do_filldir_main() since you are only getting a few inodes at a time on each call and ideally we want to be working on larger batches. Also, I can't see anywhere which sorts the inodes to be looked up, so that it appears we are looking up in hash order rather than disk block order. One of the main benefits of doing this as a syscall is that we should be able to use an ordering which is beneficial to the filesystem. It would be better to do the directory reading into a reasonably sized buffer to start with, and then to do all the lookups afterwards. That will mean that a more optimal ordering can be achieved as well as reducing the possibility of any locking issues. The filesystem needs the freedom to order things in the way best suited to it. In addition it ought to be possible to send all the lookups off at the same time, in order to reduce the i/o latency, rather than doing them sequentially. That should provide a big performance win, although it also increases complexity due to the lock ordering issues.
Created attachment 625592 [details] Another revision Steve, This revised patch hopefully addresses some of the issues you mentioned. When you get a chance, can you review and comment? Thanks! --Abhi
The big question is what kind of speed up you get with this? The larger it is, the easier it will be to make an argument for including it upstream. Some other comments: - You don't need __inline__ in the kernel, inline will do on its own - Probably some of those inline functions don't actually need to be inlined anyway. The compiler will inline small functions automatically in many cases. - I'd prefer to not go back to using the _i suffix for functions, but to name them in a way which makes clear what it does - void pointers don't need casts to/from other pointer types - We probably need to look again at the gfs2_alloc/realloc/free functions... we should not be using vmalloc in places which are performance critical at all, since vmalloc can be very slow. Can we have a system of using a vector of pages or does that give us problems wrt to sorting ? Overall though this looks a lot cleaner than the previous version and I think it moving in the right direction. One further thought... rather the copying directly to user space, would it be easier if we had a buffer (per struct file for example) which would be used as temporary space into which to write results. That way larger requests could be made to the fs, even if userspace only asked for small amounts at a time. In addition it may help to resolve some of the buffer sizing issues that you've run into. The downside is that it increases complexity and also we'd need to know when to invalidate the buffer in case of changes in the directory. But something to consider perhaps...
IRC chat with Steve - more details regarding his feedback in comment #6 <abhi``> swhiteho, Thanks for taking a look at the xgetdents patch again. The minor things are easy to fix, but I'm a bit unclear about a couple of things: 1. The per-struct-file buffer idea to use as temporary space and 2. alloc/realloc/free functions and using vector of pages instead. Can you elaborate a bit? <swhiteho> ok, I'll see if I can explain a bit better :-) <swhiteho> 1. my thought is "copy into buffer in large chunks while holding the glock" and "copy to user from buffer in chunks specified by the user while not holding the glock" <swhiteho> so if the user asks for less than the buffer size, you can leave the buffer on the struct file and if they then don't lseek before requesting the next chunk of data, you don't need a new call to the fs to copy it out <swhiteho> that means the buffer size to the fs can be a known fixed size <swhiteho> which may make the fs implementation easier <swhiteho> or it may slow things down due to an extra buffer copy <swhiteho> so its tricky to know what is best, but it might be worth experimenting with <swhiteho> 2. kmalloc is basically "allocate a load of pages, and use the page tables to map them contiguously into kernel memory space" <swhiteho> s/kmalloc/vmalloc <abhi`> swhiteho, it might not slow down much, as we're copying most of the stuff into a temp kernel buffer anyway... the one that hangs off of the ctx struct <swhiteho> in that case, it may not be even that much of a slow down then I guess <abhi`> ok, so the buffer hangs from the struct file to the directory in question <swhiteho> one of the issues with vmalloc is that there are various slow things compared with kmalloc <swhiteho> one is setting up and tearing down the vm mappings, and another is that you land up using up TLB entries when accessing the memory <swhiteho> so its ok for large hash tables for example, but for anything where performance is an issue, it is better to just get a vector of pages directly <swhiteho> kmalloc also is layered on top of the page allocator, but via slab (or similar) and there is no virtual address translation involved <abhi`> right <swhiteho> now we can ask for higher order pages too, but those have a decreasing probability of being available as the order increases <swhiteho> so if we could use just a list of single pages, that should be a better compromise perhaps <abhi`> I see, certainly worth experimenting with. <abhi`> swhiteho, so here's a question, if the amount of allocation I need is greater than KMALLOC_MAX_SIZE, but I still call kmalloc(), will it fail? And if it does, does it mean there are no more pages available to allocate? <abhi`> If so, wouldn't it affect the vector-of-pages method similarly? <bmarson_mtg> third groups by system and is sorted by job # on node <swhiteho> if it is greater than KMALLOC_MAX_SIZE I suspect it will just fail anyway <swhiteho> since that depends on the largest slab cache <swhiteho> vector of pages just needs individual pages to be free, so that they don't need to be contiguous pages <abhi`> ah, I got confused. I was thinking vmalloc when I wrote kmalloc above. It makes sense now. So I'll have to have some sort of header in each page to say how many entries are in it and so on. <swhiteho> something like that, or just have an arrary (page pointer, size) type tuples <abhi`> yeah, or that. Ok, I'm going to try this out - have a page vector hanging off of the struct file of the directory. <abhi`> I think it'll actually make my life easier without all the alloc/realloc <swhiteho> abhi`: yes, I suspect it might <swhiteho> abhi`: did you get any timings btw? <swhiteho> would be interesting to know how much faster it is than readdir + stat <abhi`> swhiteho, I had half a mind to do it, but wasn't sure if there were more changes needed, so I wanted to wait for a round of feedback+changes <swhiteho> worth doing anyway I think, just to check that it is working as hoped <abhi`> swhiteho, I did write up a test program and test that it works as it's designed. But have not stress-tested it or anything. I will write up a script to compare numbers for this vs readdir+stat+getattr and also vs the new page-buffer method when it's ready <swhiteho> ok, sounds good
Created attachment 678841 [details] Updated to latest upstream kernel (nmw tree)
Created attachment 680222 [details] Updated patch This is an updated version of the previous patch, where it fixes a couple of issues I found while testing. The results of this patch are not as expected though. It seems to be quite inefficient when compared to simply doing getdents() followed by open() and fxstat()+flistxattr()+fgetxattr() on every entry. The test was on a loopback-device-backed gfs2 fs using lock_nolock. For a set of 5000 files in a single directory with upto 25 xattrs per file, readdirplus(xgetdents) took real 0m17.977s user 0m0.001s sys 0m13.863s whereas, getdents()+open()+fxstat()+flistxattr()+fgetxattr() took real 0m3.685s user 0m0.072s sys 0m3.230s There could be a bug in the patch somewhere that could explain this or the implementation itself might be inefficient (memory allocation or locking, perhaps).
Glad to see that it is working now - maybe try a blktrace/seekwatcher to compare the two cases?
Created attachment 680282 [details] seekwatcher graph The bottom graph corresponds to the readdirplus patch and the top graph is using the conventional set of syscalls.
Some instrumentation - [root@smoke-05 userland]# perf stat -a -e kmem:kmalloc time ./readdirtest -q /mnt/gfs2/foo 0.00user 18.85system 0:28.39elapsed 66%CPU (0avgtext+0avgdata 1444maxresident)k 85952inputs+0outputs (0major+385minor)pagefaults 0swaps Performance counter stats for 'time ./readdirtest -q /mnt/gfs2/foo': 1,060,423 kmem:kmalloc 28.401295160 seconds time elapsed /var/log/messages: [19509.681640] ALL DONE. Printing kmalloc/krealloc/kfree stats [19509.687887] total kmalloc duration: 11.321093680 seconds [19509.693831] total krealloc duration: 0.368829520 seconds [19509.699787] total kfree duration: 7.982661000 seconds [19509.705559] total op counts - kmalloc:1035948 krealloc:31129 kfree:1063625 allocation/deallocation is taking up a most of the time.
Created attachment 708389 [details] Newer version of patch with vector of pages storage This is a draft version (still needs some cleanup and a couple of features) of readdirplus with a dynamic buffer backed by a vector of pages to store intermediate data. This scheme is faster than using kmalloc/krealloc/kfree etc. For a 1000 file here's the comparison against conventional readdir()+stat()+getxattr(). The conventional method corresponds to the '-r' switch in the readdirtest program. Cold cache: [root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -q /mnt/gfs2/foo real 0m1.536s user 0m0.001s sys 0m1.039s [root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -rq /mnt/gfs2/foo real 0m2.202s user 0m0.021s sys 0m1.674s Cached results: [root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -q /mnt/gfs2/foo real 0m0.099s user 0m0.000s sys 0m0.099s [root@smoke-05 gfs2-3.0-nmw]# time /root/userland/readdirtest -rq /mnt/gfs2/foo real 0m0.639s user 0m0.019s sys 0m0.620s
Posting the latest version of my readdirplus work rebased to the latest gfs2-3.0-nmw tree. I've split it into 4 parts. I'm also attaching David Howells' not-yet-upstream xstat patch (only the relevant vfs bits) for completeness. dhowells-xstat-vfs-only.patch 0001-vfs-xgetdents-syscall-hooks-and-xreaddir-fop.patch 0002-gfs2-vector-of-pages-buffer.patch 0003-gfs2-sort-with-opaque-context.patch 0004-gfs2-xreaddir-op-and-supporting-functions.patch There's a known bug with this patchset, in that, when the system is under memory pressure some of the entries of a directory are skipped and not returned. On my test machine, I can reproduce this when I set the collection buffer to unlimited (using the tunables) and run the syscall on a dir with 500k files. I'm still working to fix this issue. New features: - small improvements to vector of pages buffer code (I call it vbuf). I'm still not sure if this is the best way to implement such a thing. I don't know if there's a way to translate these random pages into a linear address space without much cost. Couldn't find anything suitable in the kernel, so I'm doing the arithmetic myself. - added 2 tunables to limit the amount of memory that the syscall uses and all associated code to manage collecting/outputting data in chunks. By adjusting these tunables, one can also tweak performance. At some future point, we should teach the syscall to adapt to memory conditions/data sizes and choose appropriate collection strategies. - added ctx_sort function (based on lib/sort.c) that accepts an opaque context that it passes to the compare() and swap() methods. This is used to sort the dirent ptr array in the vbuf because the vbuf doesn't have a linear address space and special things need to be done to access the pointers.
Created attachment 731269 [details] David Howells' xstat patch - vfs only bits
Created attachment 731270 [details] 0001-vfs-xgetdents-syscall-hooks-and-xreaddir-fop.patch
Created attachment 731272 [details] 0002-gfs2-vector-of-pages-buffer.patch
Created attachment 731273 [details] 0003-gfs2-sort-with-opaque-context.patch
Created attachment 731274 [details] 0004-gfs2-xreaddir-op-and-supporting-functions.patch
Created attachment 731278 [details] Test program to run xgetdents syscall against a directory Usage: /root/userland/readdirtest [OPTION] <dir> -b SIZE Buffer size in KB (default 1MB) -r Use readdir instead of readdirplus (default) -l LEVEL, where LEVEL: 0 - Only entries 1 - entries and stat 2 - entries, stat and xattr keys 3 - entries, stat, xattr keys and values (default) -c Compact output - single line for each file -q Quiet... don't output information, just process
Created attachment 731279 [details] Program to create a bunch of entries in a directory w/ xattrs Usage: ./create_readdir_dir <topdir> <seed> <num files> <max xattrs/file>
Created attachment 732739 [details] 0002-gfs2-vector-of-pages-buffer.patch Fixed a few minor issues reported by Bob when he reviewed the patch set.
Created attachment 732740 [details] 0004-gfs2-xreaddir-op-and-supporting-functions.patch Fixed a few minor issues reported by Bob when he reviewed the patch set.
Created attachment 732845 [details] blktrace comparison of readdir+stat+getxattr vs readdirplus On the left is readdir()+stat()+getxattr() and on the right is readdirplus()
Created attachment 834840 [details] Another variant that adds a syscall to do readahead on the relevant blocks This is a first-attempt patch at a syscall that does readahead on the inode blocks of a directory's entries. A couple of things are missing: 1. A way to limit the amount of readahead so we don't run out of memory. 2. xattr blocks are not readahead yet. With this patch, the 'ls -l' (readdir() + stat() on each entry) operation is significantly faster on a directory with 100000 files. Without the patch: [root@skol-04 gfs2-3.0-nmw]# echo 3 > /proc/sys/vm/drop_caches [root@skol-04 gfs2-3.0-nmw]# sleep 5 [root@skol-04 gfs2-3.0-nmw]# mount -t gfs2 /dev/loop0 /mnt/gfs2 [root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo > /dev/null real 8m39.396s user 0m1.376s sys 0m3.424s [root@skol-04 gfs2-3.0-nmw]# umount /mnt/gfs2 With the patch: [root@skol-04 gfs2-3.0-nmw]# echo 3 > /proc/sys/vm/drop_caches [root@skol-04 gfs2-3.0-nmw]# sleep 5 [root@skol-04 gfs2-3.0-nmw]# mount -t gfs2 /dev/loop0 /mnt/gfs2 [root@skol-04 gfs2-3.0-nmw]# time /root/ra_test /mnt/gfs2/foo Return value from rddir_readahead: 0 real 0m22.606s user 0m0.000s sys 0m0.321s [root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo > /dev/null real 0m5.074s user 0m1.195s sys 0m2.662s [root@skol-04 gfs2-3.0-nmw]# umount /mnt/gfs2
Created attachment 853502 [details] Updated readahead patch with tunable limits
The general structure of this looks good. Just a few comments: - The naming of the call could perhaps be better, but we can figure that out later - gfs2_rddir_ra_inodes() doesn't appear to do the right thing... looks like its using the directory's address space, and not actually reading in the inodes correctly, so that it is not doing the readahead properly. This bit should look like a batch ->lookup() since that is what it is doing. - I don't want to add any more items to the tunable structures, so we need to find a way to set that limit automatically, however for the purposes of gathering some performance info this should be perfectly ok.
Created attachment 909528 [details] First of 3 patch set VFS interface for the new rddir_readahead system call
Created attachment 909529 [details] Second patch GFS2 - rename delete workqueue to ra_and_del so we can use it to do offline inode readheads as well
Created attachment 909534 [details] Third patch GFS2's implementation of the rddir_ra syscall. This 3-patch set is a preview... still needs a bit of work: 1. locking around the rra struct 2. error handling - currently works only in the happy path 3. some way to limit the amount of memory consumed 4. A notification mechanism to userspace when readahead is complete Having said that, I see some promising results with this patch: On a gfs2 fs with a directory 'foo' with 100000 files, without the patch, a 'ls -l' on 'foo' takes over 4 minutes with a cold cache. [root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null real 4m10.660s user 0m0.992s sys 0m4.163s With the patch, I see the following: [root@skol-04 gfs2-3.0-nmw]# time /root/ra_test /mnt/gfs2/foo real 0m30.169s user 0m0.000s sys 0m0.146s ^^ In this bit, we go through the directory collecting all the inode numbers contained in it and pass on the collection to the workqueue to look them up. (Should we delegate the inode number collection to the workqueue as well, instead of doing it synchronously?) [root@skol-04 gfs2-3.0-nmw]# time sleep 45 real 0m45.001s user 0m0.000s sys 0m0.000s ^^ This is an arbitrary sleep period to cover the time it takes to lookup all the 100000 inodes. Currently, there's no mechanism to notify the userland caller that readahead is complete, but when that feature is added, this won't be necessary. Looking at kernel printks at appropriate locations, I see that the lookups take about 36 seconds. i.e: [90583.996953] Start gfs2_rddir_ra syscall [90614.126061] Collected 100000 entries [90614.129643] End gfs2_rddir_ra syscall [90640.523517] fs/gfs2/rddir_ra.c gfs2_rddir_ra_inodes 109: Read in 100000 inodes Another thing to note here is that it's not necessary to sleep or do something else while the readahead is being performed. If you issue an 'ls -l' or similar command immediately after the call to rddir_ra, it will still be faster than doing a regular 'ls -l', but not quite as fast as it can get, given that not all inodes have been read in yet. This delay can be controlled by the user by adjusting the max number of inode-readaheads requested. [root@skol-04 gfs2-3.0-nmw]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null real 0m2.219s user 0m0.712s sys 0m1.431s ^^ Finally, with all the inodes read in, 'ls -l' takes 2.2 seconds. The total time taken is 1m8s. Considerably faster than 'ls -l' by itself.
RE: comment #28 The patch looks good generally, but we might want to think a bit more about the naming: dir_readahead() might be better or even dirreadahead() since there are few syscalls with '_' in them. RE: comment #29 The naming of the workqueue again is a bit ugly, but otherwise I think it looks ok. RE: comment #30 Results look promising I think. Some comments on the code: If the buffer size is greater than the max size for kmalloc, then we should probably limit it to that size (which is pretty large these days) since otherwise we may land up always using __vmalloc, which is bad for performance. Probably you could also fit the struct gfs2_rddir_ra into the buffer, and thus avoid one of the allocations too. Is it not possible to use the directory read ctx to avoid needing to pass down an extra argument? Even if you have to add a field to this structure that would be a cleaner solution. What does the HASHCOLL bit do? It looks like you are still iterating over the inode individually though, even though they are on the work queue. Would suggest to try sorting them in the gfs2_rddir_ra() and then for each inode, allocate a structure with the workqueu head and inode number, and then submit it to the workqueue. If kmalloc returns NULL at any stage, you can always stop early, since you are returning the number of items read, and I assume also updating the current position. That way you'll land up with the maximum number of things happening in parallel. If you need to use a workqueue specifically for readahead, then that would be better than trying to force it into the existing delete workqueue if it is going to be a pain to do that. However, all in all looking promising I think, and very glad to see that you are getting good performance results already.
Created attachment 912745 [details] Revised readahead syscall patch after discussions with swhiteho This patch still needs a bit of cleanup and splitting into logical pieces. Here's what is new, compared to the previous version: 1. Renaming the syscall to something less ugly: dir_readahead 2. Left the delete_wq alone and created a new one called dir_ra_wq for async inode lookups. 3. Use the dir_context struct to preserve state needed to perform operation. By passing around the state in this struct, we also avoid adding a new parameter to a bunch of functions in fs/gfs2/dir.c 4. Allocate and deallocate the above state in the VFS (fs/readdir.c) 5. Once inode numbers are collected, create a new work struct for every inode lookup and add them to the workqueue. There are a few things that I'd like to change about this patch: a) The readahead context (buf, count, size, req) stored in the dir_context struct seems too fs-specific. I think it might be a better idea to simply have a 'void *opaque' pointer in dir_context and have the fs alloc/dealloc it and fill it with whatever content it needs. b) Another (minor) benefit for gfs2 from a) is that we can add a flag to check for the hash collision situation while collecting inode numbers where we can persist to get the entire set of requested inodes. Otherwise, we'll end up doing short reads of the dirents every time we encounter a dirent hash collision. c) We need the gfs2 superblock in the workqueue to perform inode lookups. What happens if the superblock disappears (unmount) while the workqueue is still processing inodes for that fs? Does it get initialized to NULL? Will a simple check for NULL suffice? Regardless, I'm still seeing a fair bit of speedup with this patch for the same 100000 file directory. [root@skol-04 ~]# modprobe -r gfs2 [root@skol-04 ~]# echo 3 > /proc/sys/vm/drop_caches [root@skol-04 ~]# sleep 5 [root@skol-04 ~]# mount -t gfs2 /dev/skolo/50g /mnt/gfs2/ [root@skol-04 ~]# time /root/ra_test /mnt/gfs2/foo offset: 0 return err code: 5732 offset: 122920247 return err code: 84562 offset: 1942302922 return err code: 9706 offset: 2147480613 return err code: 0 real 0m47.085s user 0m0.000s sys 0m0.219s [root@skol-04 ~]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null real 0m5.042s user 0m0.734s sys 0m1.354s [root@skol-04 ~]# umount /mnt/gfs2 [root@skol-04 ~]# modprobe -r gfs2 [root@skol-04 ~]# echo 3 > /proc/sys/vm/drop_caches [root@skol-04 ~]# mount -t gfs2 /dev/skolo/50g /mnt/gfs2/ [root@skol-04 ~]# time ls -l /mnt/gfs2/foo 2>&1 > /dev/null real 5m2.768s user 0m1.007s sys 0m4.202s [root@skol-04 ~]# umount /mnt/gfs2 52s with readahead vs 5m2s w/o to perform an 'ls -l' of the dir.
Created attachment 917238 [details] Updated readahead syscall with buffer allocation in the VFS This patch fixes a couple of issues I found with the previous version of this patch. Along the way, I also answered my own question of what happens in the workqueues if the superblock disappears. Looks like the answer is to flush the workqueues at unmount time to prevent that scenario.
Created attachment 917239 [details] Variant of the previous patch with allocation of buffer in the filesystem The dir_context struct only has a void* opaque pointer which is allocated/deallocated as necessary by gfs2. Gives more flexibility with the amount of state we can have. In gfs2's specific case, we can identify the hash-collision scenario during directory reading and not have to return prematurely having collected less than requested entries.
Created attachment 920895 [details] Latest revision of the xgetdents patches
Created attachment 920896 [details] Revised dirreadahead patches
Created attachment 920897 [details] Program to test the readahead system call
Created attachment 920898 [details] Script to automate runs of xgetdents/dirreadahead syscalls and measure performance
Created attachment 920899 [details] Comparison of the various syscalls for different sized directories Comparison of the 3 ways of doing readdirplus: 1. 'ls -l' (getdents() followed by stat() of each entry) 2. xgetdents() syscall 3. dirreadahead() syscall, followed by 'ls -l'
Created attachment 920902 [details] seekwatcher graph of 'ls -l' vs xgetdents() with various tunable values
Created attachment 920903 [details] seekwatcher graph of 'ls -l' vs dirreadahead() with various tunable values
Created attachment 947987 [details] Multithreaded userspace program to readahead directory inodes