Fedora Account System
Red Hat Associate
Red Hat Customer
Description of problem: Performing ls -laRt on Ganesha mount point is taking around ~3.6 Hrs for around 11 lakhs file. Version-Release number of selected component (if applicable): nfs-ganesha-gluster-2.5.5-7.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-7.el7rhgs.x86_64 nfs-ganesha-2.5.5-7.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-11.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1.Create 6 node ganesha cluster 2.Create 4 x 3 = 12 Distributed-Replicate Volume. 3.Export the volume via Ganesha 4.Mount the volume to 4 different clients via 4.0 5.Run following workload from 4 clients Client 1: Create 2 lakhs file using touch command Client 2: Run bonnie Client 3, Client 4- Create file in loop using dd command (for i in {1..1000000};do dd if=/dev/urandom of=stressc3$i conv=fdatasync bs=100 count=10000;done) Stop the IO's after sometime (When IO's on client 1 and Client 2 are finished) 6.Perform ls -laRt from single client Actual results: ls -laRt took around ~3.6 Hrs for 11 lakhs files ----------------- 4 directories, 1129061 files real 218m24.789s user 0m9.004s sys 0m51.261s ----------------- Expected results: ls -laRt should not take this long. Additional info: When performed ls -laRt on Fuse client, with same volume/same data mounted via glusterfs,It took around ~2 minutes ---------- real 2m53.605s user 0m10.317s sys 0m19.533s ---------
The after-reboot is the same directory tree? So after reboot, Ganesha is *faster* than FUSE? That's very surprising... I wonder if this is creations causing a degeneration in readdir chunks... Frank?
So you do a bunch of creates, then when all done with creates, do ls -laRt? The creates will fill the dirent cache (up to the limit of Dir_Chunk * Detached_Mult). The dirents will be in the "unattached" chunk. Assuming your mdcache is big enough, all of the fsal_obj_handles will be cached with attributes. Hmm, is the ls -laRt from the same client that did the creates? If so, and it has cached the dirents client side, it MAY just do a LOOKUP and/or GETATTR not a READDIR... Assuming the client DOES do a READDIR, I'm not sure what the implications of readdir call back passing an fsal_obj_handle that duplicates one already in the mdcache.
(In reply to Daniel Gryniewicz from comment #5) > The after-reboot is the same directory tree? So after reboot, Ganesha is > *faster* than FUSE? That's very surprising... > > I wonder if this is creations causing a degeneration in readdir chunks... > Frank? Yes...After restarting ganesha service,its the same data set on which ls -laRt was performed (1 lakh files) After writing data set on mount,ls -laRt took around 15 mins.But on restarting ganesha service,It took way lesser time i.e ~28 seconds. Note-Same client was used in both the iterations.
chunks_hwmark is 5000, that's the number of chunks, so 640,000 dirents, so your entire directory can live in the dirent cache. Since nothing is changing any attributes, the directory is not being invalidated (and the activity keeps the mdcache entry for the directory from being reaped). We need to be setting entries_hwmark to something larger than chunks_hwmark * dir_chunk. Ideally we should have a large enough dirent cache for the largest directory and thus enough mdcache entries also. But we do still need to have a somewhat reasonable performance for a directory larger than that. Looking through why mdcache_readdir_chunked is calling getattrs, it's because the mdcache entry for the dirent was no longer present. Clearly Gluster's performance doing everything that is needed to re-instantiate an entry is far worse than doing a readdir plus. Given that, I would strongly suggest we set entries_hwmark to at least 10 * chunks_hwmark * dir_chunk to assure that we almost always have the mdcache entries corresponding to a dirent cached.
A dirent doesn't actually contain any information about the entry itself except the key to look it up. So we can't recreate the entry directly from the dirent, instead, if the entry has been reaped, we need to re-create it with a lookup/getattrs pair. You will always have this problem if your entries_hwmark is lower than the number of dirents in a directory. You've created a situation where ganesha's cache cannot hold an entire directory, so listing the directory will have to do lookups every time. We have the odd situation here where lookup/getattrs is much much slower than readdir_p, so the first readdir is faster, but the best solution is to have a large enough cache. It may be possible to have a "large directory" detection where we dump the dirent cache for very large directories, if the FSAL has a setting that wants it? I'm not sure if it will help, but it could be tried. It would be interesting to see performance of readdir for #dirents vs. entries_hwmark. Is there radio of (dirents) / (entries_hiwmark) that starts getting really bad? You're example here is (20k) / (5k) = 4. Obviously 1 is fine. Is 2 bad? How is 3? It'd be nice to have some numbers we can recommend to customers.
I don't think this should be a blocker. If the handle cache is large enough, there's no problem; and if the directory is huge, it will be slow. But it won't be any slower than previous versions, which didn't have readdir_p. It the medium term, I have a proposal to fix this. It will complicate the readdir code to the point where I'm not comfortable squeezing it into a release that is about to come out, but it can go into the next one, once properly tested.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0260