Bug 1639463
Summary: | [GSS] Metadata performance degrades as number of files goes up. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ben Turner <bturner> |
Component: | glusterfs | Assignee: | Xavi Hernandez <jahernan> |
Status: | CLOSED DEFERRED | QA Contact: | Bala Konda Reddy M <bmekala> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | rhgs-3.4 | CC: | amukherj, bkunal, bturner, guillaume.pavese, jahernan, jbautist, nchilaka, olim, rhs-bugs, sheggodu, srangana, vbellur |
Target Milestone: | --- | Keywords: | Performance, Triaged, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-02-10 09:29:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1651050 | ||
Bug Blocks: |
Description
Ben Turner
2018-10-15 19:09:39 UTC
For archival purpose. (In reply to Shyamsundar from comment #16) > Even if smallfile and ls -lR approach listing differently, the LOOKUP bloat > on Big volume should not occur or, also occur against the Little volume. > > The above needs more analysis, to understand why the extra LOOKUPS are > generated and from where that causes the bloat only with the Big volume. > > strace was run against the ls -lR and smallfile tests to understand what > syscalls are being invoked, and which ones are causing an increased latency > for the application. (strace run with -ff -T -o options) > > strace patterns against the 2 tools are as follows: > > NOTE: Volume types did not matter, the application does not change behaviour > based on it running against Little or Big volume, and only Big volume > results are presented below. > > ls -lR strace pattern (cold cache): > > openat(AT_FDCWD, > "/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 <0.000270> > <snip> > getdents(3, /* 19 entries */, 32768) = 1136 <0.103618> > lstat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_1_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000101> > lgetxattr("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/ > thrd_00/d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_1_", > "security.selinux", 0x1e8b310, 255) = -1 ENODATA (No data available) > <0.000028> > lstat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_2_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000009> > lgetxattr("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/ > thrd_00/d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_2_", > "security.selinux", 0x1e8b310, 255) = -1 ENODATA (No data available) > <0.000023> > <snip> > getdents(3, /* 19 entries */, 32768) = 1216 <0.014153> > lstat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_18_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000042> > lgetxattr("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/ > thrd_00/d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_18_", > "security.selinux", 0x1e8b610, 255) = -1 ENODATA (No data available) > <0.000022> > <snip> > > smallfile strace pattern (cold cache): > > openat(AT_FDCWD, > "/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 22 <0.038498> > getdents(22, /* 19 entries */, 32768) = 1136 <0.398268> > getdents(22, /* 19 entries */, 32768) = 1216 <0.056523> > getdents(22, /* 19 entries */, 32768) = 1216 <0.202871> > <snip> > getdents(22, /* 19 entries */, 32768) = 1216 <0.001830> > getdents(22, /* 4 entries */, 32768) = 256 <0.000569> > getdents(22, /* 0 entries */, 32768) = 0 <0.000246> > close(22) = 0 <0.000076> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_1_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.004337> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_2_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000522> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_3_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000506> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_4_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000667> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_5_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000559> > <snip> > > smallfile strace pattern (warn cache): > openat(AT_FDCWD, > "/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/ > d_000", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 22 <0.001342> > getdents(22, /* 19 entries */, 32768) = 1136 <0.002651> > getdents(22, /* 19 entries */, 32768) = 1216 <0.003035> > getdents(22, /* 19 entries */, 32768) = 1216 <0.003023> > <snip> > getdents(22, /* 19 entries */, 32768) = 1216 <0.001462> > getdents(22, /* 4 entries */, 32768) = 256 <0.000442> > getdents(22, /* 0 entries */, 32768) = 0 <0.000198> > close(22) = 0 <0.000014> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_1_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000210> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_2_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000031> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_3_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000030> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_4_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000038> > stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/ > d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_5_", {st_mode=S_IFREG|0755, > st_size=1024, ...}) = 0 <0.000045> > > Observations: > - ls -lR calls getdents followed by lstat and lgetxattr, the lstat and > lgetxattr return with low latency I think processing of dentries is happening within default values of entry and attribute-timeout (1s) of fuse-kernel dentry cache. So, the stat and entry information fetched in readdirplus response is being used. For the cases of miss on kernel dentry cache lookups might be sent to md-cache. But I don't think md-cache might help as its default timeout is 1 sec too. So, if an entry has timed out in kernel cache, its likely that it will timeout in md-cache too. > - smallfile calls getdents till there are no more entries, and then calls > stat on each entry > - stat shows higher latency in cold cache case as compared to the warm cache > case Same reason as above. since there is a considerable time gap between getdents and actual processing of dentries (which trigger stat/lookup), kernel dentry cache and md-cache can't absorb the lookup/fstat/getxattr. > > Next steps: > The current thought is that in the cold cache case against Big volume the > stat that follows getdents is causing a LOOKUP to be sent to the bricks, > whereas this LOOKUP does not even occur with the warn cache. > > Either FUSE xlator is sending the extra lookup or FUSE kernel/VFS is. > Further this could happen if getdents (IOW readdirp) is not fetching the > required iatt information and caching it or returning it to FUSE xlator and > above. I think readdirp is fetching all relevant information as the dentries are all files. Its likely that cache has timed out. > > This needs to be debugged further to understand what is occuring in this > case. We can repeat tests with higher values of entry and attribute timeouts (few hundreds/thousands of seconds?) and enabling group metadata cache in glusterfs. Note that this will also help cold cache scenario as the caches are populated from readdirplus and hence will be useful even for a fresh access case. (In reply to Shyamsundar from comment #23) > To clear out the cache time out issue as analyzed with strace in comment > #22, redid a test with increased FUSE attribute and entry timeout values as > follows. > > This test showed up ~80 LOOKUP calls in io-stats on the FUSE mount point and > listed ~20k entries/sec on a cold cache on Big volume. > > volume mounted as: glusterfs > --volfile-server=dell-r730-1.gsslab.rdu2.redhat.com --volfile-id=/testvol > --attribute-timeout=5 --entry-timeout=5 /gluster-mount > > Tests run as in comment #14 > > Tail of client io-stats as below: > # tail -10 /var/run/gluster/io-stats-bigV-post-atttimeout-5.txt.testvol > FSYNC 1 31745.00 us 31745.00 us 31745.00 us > SETXATTR 1 673.00 us 673.00 us 673.00 us > OPENDIR 25 12684.52 us 226.00 us 93697.00 us > CREATE 2 83378.00 us 80460.00 us 86296.00 us > LOOKUP 83 42591.22 us 2.00 us 207896.00 us > READDIRP 4258 3872.57 us 103.00 us 199101.00 us > RELEASE 2 0 us 0 us 0 us > RELEASEDIR 25 0 us 0 us 0 us > ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- > > So to reiterate the problem is in the cache timeouts that leads to a large > slowdown when performing the test on a cold cache. For the sake of easy access for someone just stumbling on this bz (as this has more visibility), I am listing down the work we had done earlier on this problem. We had identified this problem of LOOKUP bloat in our presentation at (see large directory section): https://events.static.linuxfound.org/sites/events/files/slides/Gluster_DirPerf_Vault2017_0.pdf Though we had then suggested turning on "group metadata-cache" to overcome this problem. The other problem related to directory listing was with large scale distribute where readdirp latency is considerable (due to serial nature of readdir on distribute subvols). This latency was significantly visible with what we called small directories (all dentries could be fetched in a single readdir call). We had suggested performance.parallel-readdir for this problem (see small directories section in the above presentation). However, please note that the serialization costs were not visible for large directory as there is no wasted work (each readdir is likely to have fetched enough dentries) and the serialization costs (if any) are amortized within large number of readdirps done to fetch entire directory. Even with the above two, readdirp latency on brick stack is still significant as observed in bz 1478411. This can be due to many reasons: * XFS issue which xavi pointed out earlier * High CPU utilization due to processing of dentry list by many xlators on brick stack * Too many getxattrs, lstat for each dentry fetched in readdirplus (sometimes which is likely thrown away by DHT if dentry points to a directory). Also see the Future work section of the presentation whose points are likely valid even now. Summary till now: # Observation 1: Directory listing using smallfile on a volume with a lot of entries is slower in cold cache situations than a volume with very little entries. - Test: The test attempts to list 80k entries in both volume cases, spread/created in the same manner - Root cause: - The tool lists the entire directory contents and then does a stat on each entry - Each directory has 5k files in it - The listing takes more than 1.5 seconds, and that causes FUSE to timeout the entry and attr cache, causing the stat to actually be sent to the bricks, which slows down the entire listing - How to determine this problem is limiting listing in a user situation: - A simple strace of the application/command that shows the slowness, will also demonstrate that the application fetches all entries before stat'ing the same (IOW, getdents calls in strace will run till there are no more entries before the stat is sent) - Further the strace will also show that time taken for stat is significant, demonstrating that the stat is actually not being served by the cache - Another manner of detecting the same would be, using glusterfs io-stats, and checking how may LOOKUP calls are being made at the client/FUSE layer and how many are reaching the brick io-stats. - If these are close to the number of entries being listed, then caching is not effective, and causes the delay. - If these are higher in the client io-stats, but far lower in the brick io-stats, then client side glusterfs cache is effective, but FUSE level cache is not - Alleviating the symptom: - Increase the FUSE level entry and attr cache timeouts (using mount options entry-timeout and attribute-timeout, and increasing the timeout to say 5/10 seconds) - Caution: If these timeouts are higher, then FUSE will assume it's cache is up-to-date and serve the results from cache, even if another client mount has updated the file attributes or contents. This should be used when strict attribute information can be relaxed in favor of performance. - Increase the md-cache timeouts, such that FUSE still sends the lookup to gluster, but the gluster client cache can serve the same. - Other notes: - A test using 'ls -lR' on the same directory does not demonstrate the slowness, as ls gets a set of entries from a getdents call and immediately calls lstat on the same, this is within the 1 second FUSE attribute and entry timeouts, and thus is served from the FUSE cache, and does not result in slowness. # Observation 2: Directory listing using smallfile on a volume containing large number of files on a client that has a warm cache, as compared to a volume with lower number of entries, is slower by a factor of 2x (~40k entries on the smaller volume, against ~20k entries on the large volume) - Test: same as observation 1 (except run on a warm cache) - Root causes: - The directory listing is still sent to the bricks, but as the cache is warm even on the bricks and XFS mount in there, it returns faster than before, hence serving more entries from the FUSE cache than the prior case. This shows speed up in entry listing. - Further, just dropping the server cache (and not restarting the client mount process) demonstrates the slowness in listing again. - Pending analysis: - During root causing operation 1, the brick processes were restarted, post which listing on the larger volume also showed ~40k entries when using a warm cache. Need to eliminate why this speeds up the listing. # Observation 3: (more a pending analysis) Why is initial listing of directory contents in the larger volume taking 2-2.5x more time? - This seems well covered (and similar) in terms of analysis in bz #1478411, will be repeating parts of that to understand which layer contributes the most latency to getdents from the gluster stack POV. (In reply to Raghavendra G from comment #25) > Also see the Future work section of the presentation whose points are likely > valid even now. Among the things listed in future works is - readdirplus for directories. This is also a problem related to this bug i.e., lookup bloat bringing down the performance. If we change the smallfile workload to create directories instead of files, we will hit both the problems of ls being slow and smallfile tool being slow for cold-cache. This is currently an unsolved problem for cold-cache. For hot-cache, this problem is not present. Yes I can Shyam, I'll ping you for more detail. -b Xavi, Any updates? Anything we can do with the information we have right now? (In reply to Yaniv Kaul from comment #44) > Xavi, > Any updates? Anything we can do with the information we have right now? I've done some (very preliminary) tests with a highly fragmented XFS that are showing higher latencies on cold cache (3 seconds for an 'ls -l' of a directory with 256 entries directly on XFS, no gluster involved). Right now I'm working on the Global Thread Pool, which has higher priority. As soon as I can retake this, my intention is to finish the tests to see the impact of XFS latency on Gluster and see if it's possible to predict Gluster latency directly from XFS layout and disk access times. If that's possible, I'll try to correlate the results with some real-world use case that is currently experiencing high latencies for 'ls'. I'll also try to see if the high fragmentation I've manually created can be naturally created by a combination of an aged XFS and LVM thin provisioning. The final step would be to make changes to the code to improve performance. If this is really caused by the back-end file system, we cannot eliminate the latencies, but maybe we can hide some of them using background operations or using read-ahead policies. Nithya has done some work [1] related to this which seems promising [2]. However this approach is still experimental and needs to be validated. [1] https://review.gluster.org/c/glusterfs/+/22366/ [2] https://docs.google.com/spreadsheets/d/1KA-qM8rivRrTdFTOPA_CY99Jq18BqGuPy4ZPZXNvjeE/edit?usp=sharing |