Bug 1639463

Summary:	[GSS] Metadata performance degrades as number of files goes up.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ben Turner <bturner>
Component:	glusterfs	Assignee:	Xavi Hernandez <jahernan>
Status:	CLOSED DEFERRED	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	rhgs-3.4	CC:	amukherj, bkunal, bturner, guillaume.pavese, jahernan, jbautist, nchilaka, olim, rhs-bugs, sheggodu, srangana, vbellur
Target Milestone:	---	Keywords:	Performance, Triaged, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-10 09:29:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1651050
Bug Blocks:

Description Ben Turner 2018-10-15 19:09:39 UTC

Description of problem:

I created:

[root@dell-r730-1 ~]# cat file-count.txt | wc -l
280,923,290

Files on my gluster volume.  After creating the files I created a fresh volume on the same XFS FS, when running perf benchmarks on the empty volume I see vastly better perf than on the empty volume.

Version-Release number of selected component (if applicable):

I started reproing on 3.3, after a 3.4 update the problem was still there.

How reproducible:

Every time.

Steps to Reproduce:
1.  Test ls perf on empty volume using smallfile app.
2.  Create ~300,000,000 files on a gluster volume.
3.  Re test ls perf after adding files.
4.  Note perf goes from 15,000 files / sec on cold cache to 50 files / sec.  On warm cache we saw ls-l on smallfile show 30k before the files and 2k after.

Actual results:

Ls perf is impacted by this large number of files.

Expected results:

Ls perf should not be impacted by this large number of files.

Additional info:

I ran the same test directly on XFS, I saw 30k files / sec with cold cache and 100k files / sec with warm cache.

Comment 24 Raghavendra G 2018-10-19 01:15:55 UTC

For archival purpose.

(In reply to Shyamsundar from comment #16)
> Even if smallfile and ls -lR approach listing differently, the LOOKUP bloat
> on Big volume should not occur or, also occur against the Little volume.
>
> The above needs more analysis, to understand why the extra LOOKUPS are
> generated and from where that causes the bloat only with the Big volume.
>
> strace was run against the ls -lR and smallfile tests to understand what
> syscalls are being invoked, and which ones are causing an increased latency
> for the application. (strace run with -ff -T -o options)
>
> strace patterns against the 2 tools are as follows:
>
> NOTE: Volume types did not matter, the application does not change behaviour
> based on it running against Little or Big volume, and only Big volume
> results are presented below.
>
> ls -lR strace pattern (cold cache):
>
> openat(AT_FDCWD,
> "/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 <0.000270>
> <snip>
> getdents(3, /* 19 entries */, 32768)    = 1136 <0.103618>
> lstat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_1_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000101>
> lgetxattr("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/
> thrd_00/d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_1_",
> "security.selinux", 0x1e8b310, 255) = -1 ENODATA (No data available)
> <0.000028>
> lstat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_2_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000009>
> lgetxattr("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/
> thrd_00/d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_2_",
> "security.selinux", 0x1e8b310, 255) = -1 ENODATA (No data available)
> <0.000023>
> <snip>
> getdents(3, /* 19 entries */, 32768)    = 1216 <0.014153>
> lstat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_18_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000042>
> lgetxattr("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/
> thrd_00/d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_18_",
> "security.selinux", 0x1e8b610, 255) = -1 ENODATA (No data available)
> <0.000022>
> <snip>
>
> smallfile strace pattern (cold cache):
>
> openat(AT_FDCWD,
> "/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 22 <0.038498>
> getdents(22, /* 19 entries */, 32768)   = 1136 <0.398268>
> getdents(22, /* 19 entries */, 32768)   = 1216 <0.056523>
> getdents(22, /* 19 entries */, 32768)   = 1216 <0.202871>
> <snip>
> getdents(22, /* 19 entries */, 32768)   = 1216 <0.001830>
> getdents(22, /* 4 entries */, 32768)    = 256 <0.000569>
> getdents(22, /* 0 entries */, 32768)    = 0 <0.000246>
> close(22)                               = 0 <0.000076>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_1_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.004337>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_2_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000522>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_3_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000506>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_4_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000667>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_00/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_00_5_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000559>
> <snip>
>
> smallfile strace pattern (warn cache):
> openat(AT_FDCWD,
> "/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/
> d_000", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 22 <0.001342>
> getdents(22, /* 19 entries */, 32768)   = 1136 <0.002651>
> getdents(22, /* 19 entries */, 32768)   = 1216 <0.003035>
> getdents(22, /* 19 entries */, 32768)   = 1216 <0.003023>
> <snip>
> getdents(22, /* 19 entries */, 32768)   = 1216 <0.001462>
> getdents(22, /* 4 entries */, 32768)    = 256 <0.000442>
> getdents(22, /* 0 entries */, 32768)    = 0 <0.000198>
> close(22)                               = 0 <0.000014>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_1_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000210>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_2_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000031>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_3_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000030>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_4_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000038>
> stat("/gluster-mount/file_srcdir/dell-r730-1.gsslab.rdu2.redhat.com/thrd_04/
> d_000/_dell-r730-1.gsslab.rdu2.redhat.com_04_5_", {st_mode=S_IFREG|0755,
> st_size=1024, ...}) = 0 <0.000045>
>
> Observations:
> - ls -lR calls getdents followed by lstat and lgetxattr, the lstat and
> lgetxattr return with low latency

I think processing of dentries is happening within default values of entry and attribute-timeout (1s) of fuse-kernel dentry cache. So, the stat and entry information fetched in readdirplus response is being used. For the cases of miss on kernel dentry cache lookups might be sent to md-cache. But I don't think md-cache might help as its default timeout is 1 sec too. So, if an entry has timed out in kernel cache, its likely that it will timeout in md-cache too.

> - smallfile calls getdents till there are no more entries, and then calls
> stat on each entry
> - stat shows higher latency in cold cache case as compared to the warm cache
> case

Same reason as above. since there is a considerable time gap between getdents and actual processing of dentries (which trigger stat/lookup), kernel dentry cache and md-cache can't absorb the lookup/fstat/getxattr.

>
> Next steps:
> The current thought is that in the cold cache case against Big volume the
> stat that follows getdents is causing a LOOKUP to be sent to the bricks,
> whereas this LOOKUP does not even occur with the warn cache.
>
> Either FUSE xlator is sending the extra lookup or FUSE kernel/VFS is.
> Further this could happen if getdents (IOW readdirp) is not fetching the
> required iatt information and caching it or returning it to FUSE xlator and
> above.

I think readdirp is fetching all relevant information as the dentries are all files. Its likely that cache has timed out.

>
> This needs to be debugged further to understand what is occuring in this
> case.

We can repeat tests with higher values of entry and attribute timeouts (few hundreds/thousands of seconds?) and enabling group metadata cache in glusterfs.

Note that this will also help cold cache scenario as the caches are populated from readdirplus and hence will be useful even for a fresh access case.

Comment 25 Raghavendra G 2018-10-19 01:38:59 UTC

(In reply to Shyamsundar from comment #23)
> To clear out the cache time out issue as analyzed with strace in comment
> #22, redid a test with increased FUSE attribute and entry timeout values as
> follows.
> 
> This test showed up ~80 LOOKUP calls in io-stats on the FUSE mount point and
> listed ~20k entries/sec on a cold cache on Big volume.
> 
> volume mounted as: glusterfs
> --volfile-server=dell-r730-1.gsslab.rdu2.redhat.com --volfile-id=/testvol
> --attribute-timeout=5 --entry-timeout=5 /gluster-mount
> 
> Tests run as in comment #14
> 
> Tail of client io-stats as below:
> # tail -10 /var/run/gluster/io-stats-bigV-post-atttimeout-5.txt.testvol 
> FSYNC                  1    31745.00 us    31745.00 us    31745.00 us
> SETXATTR               1      673.00 us      673.00 us      673.00 us
> OPENDIR               25    12684.52 us      226.00 us    93697.00 us
> CREATE                 2    83378.00 us    80460.00 us    86296.00 us
> LOOKUP                83    42591.22 us        2.00 us   207896.00 us
> READDIRP            4258     3872.57 us      103.00 us   199101.00 us
> RELEASE                2           0 us           0 us           0 us
> RELEASEDIR            25           0 us           0 us           0 us
> ------ ----- ----- ----- ----- ----- ----- -----  ----- ----- ----- -----
> 
> So to reiterate the problem is in the cache timeouts that leads to a large
> slowdown when performing the test on a cold cache.

For the sake of easy access for someone just stumbling on this bz (as this has more visibility), I am listing down the work we had done earlier on this problem.

We had identified this problem of LOOKUP bloat in our presentation at (see large directory section):
https://events.static.linuxfound.org/sites/events/files/slides/Gluster_DirPerf_Vault2017_0.pdf

Though we had then suggested turning on "group metadata-cache" to overcome this problem.

The other problem related to directory listing was with large scale distribute where readdirp latency is considerable (due to serial nature of readdir on distribute subvols). This latency was significantly visible with what we called small directories (all dentries could be fetched in a single readdir call). We had suggested performance.parallel-readdir for this problem (see small directories section in the above presentation). However, please note that the serialization costs were not visible for large directory as there is no wasted work (each readdir is likely to have fetched enough dentries) and the serialization costs (if any) are amortized within large number of readdirps done to fetch entire directory.

Even with the above two, readdirp latency on brick stack is still significant as observed in bz 1478411. This can be due to many reasons:
* XFS issue which xavi pointed out earlier
* High CPU utilization due to processing of dentry list by many xlators on brick stack
* Too many getxattrs, lstat for each dentry fetched in readdirplus (sometimes which is likely thrown away by DHT if dentry points to a directory).

Also see the Future work section of the presentation whose points are likely valid even now.

Comment 26 Shyamsundar 2018-10-19 15:32:24 UTC

Summary till now:

# Observation 1:
Directory listing using smallfile on a volume with a lot of entries is slower in cold cache situations than a volume with very little entries.

- Test: The test attempts to list 80k entries in both volume cases, spread/created in the same manner

- Root cause:
  - The tool lists the entire directory contents and then does a stat on each entry
  - Each directory has 5k files in it
  - The listing takes more than 1.5 seconds, and that causes FUSE to timeout the entry and attr cache, causing the stat to actually be sent to the bricks, which slows down the entire listing

- How to determine this problem is limiting listing in a user situation:
  - A simple strace of the application/command that shows the slowness, will also demonstrate that the application fetches all entries before stat'ing the same (IOW, getdents calls in strace will run till there are no more entries before the stat is sent)
  - Further the strace will also show that time taken for stat is significant, demonstrating that the stat is actually not being served by the cache
  - Another manner of detecting the same would be, using glusterfs io-stats, and checking how may LOOKUP calls are being made at the client/FUSE layer and how many are reaching the brick io-stats.
    - If these are close to the number of entries being listed, then caching is not effective, and causes the delay.
    - If these are higher in the client io-stats, but far lower in the brick io-stats, then client side glusterfs cache is effective, but FUSE level cache is not

- Alleviating the symptom:
  - Increase the FUSE level entry and attr cache timeouts (using mount options entry-timeout and attribute-timeout, and increasing the timeout to say 5/10 seconds)
    - Caution: If these timeouts are higher, then FUSE will assume it's cache is up-to-date and serve the results from cache, even if another client mount has updated the file attributes or contents. This should be used when strict attribute information can be relaxed in favor of performance.
  - Increase the md-cache timeouts, such that FUSE still sends the lookup to gluster, but the gluster client cache can serve the same.

- Other notes:
  - A test using 'ls -lR' on the same directory does not demonstrate the slowness, as ls gets a set of entries from a getdents call and immediately calls lstat on the same, this is within the 1 second FUSE attribute and entry timeouts, and thus is served from the FUSE cache, and does not result in slowness.

# Observation 2:
Directory listing using smallfile on a volume containing large number of files on a client that has a warm cache, as compared to a volume with lower number of entries, is slower by a factor of 2x (~40k entries on the smaller volume, against ~20k entries on the large volume)

- Test: same as observation 1 (except run on a warm cache)

- Root causes:
  - The directory listing is still sent to the bricks, but as the cache is warm even on the bricks and XFS mount in there, it returns faster than before, hence serving more entries from the FUSE cache than the prior case. This shows speed up in entry listing.
  - Further, just dropping the server cache (and not restarting the client mount process) demonstrates the slowness in listing again.

- Pending analysis:
  - During root causing operation 1, the brick processes were restarted, post which listing on the larger volume also showed ~40k entries when using a warm cache. Need to eliminate why this speeds up the listing.

# Observation 3:
(more a pending analysis) Why is initial listing of directory contents in the larger volume taking 2-2.5x more time?

- This seems well covered (and similar) in terms of analysis in bz #1478411, will be repeating parts of that to understand which layer contributes the most latency to getdents from the gluster stack POV.

Comment 27 Raghavendra G 2018-10-20 01:02:21 UTC

(In reply to Raghavendra G from comment #25)
> Also see the Future work section of the presentation whose points are likely
> valid even now.

Among the things listed in future works is - readdirplus for directories. This is also a problem related to this bug i.e., lookup bloat bringing down the performance. If we change the smallfile workload to create directories instead of files, we will hit both the problems of ls being slow and smallfile tool being slow for cold-cache. This is currently an unsolved problem for cold-cache. For hot-cache, this problem is not present.

Comment 31 Ben Turner 2018-11-05 19:03:04 UTC

Yes I can Shyam, I'll ping you for more detail.

-b

Comment 44 Yaniv Kaul 2019-04-09 11:58:57 UTC

Xavi,
Any updates? Anything we can do with the information we have right now?

Comment 45 Xavi Hernandez 2019-04-09 15:27:53 UTC

(In reply to Yaniv Kaul from comment #44)
> Xavi,
> Any updates? Anything we can do with the information we have right now?

I've done some (very preliminary) tests with a highly fragmented XFS that are showing higher latencies on cold cache (3 seconds for an 'ls -l' of a directory with 256 entries directly on XFS, no gluster involved).

Right now I'm working on the Global Thread Pool, which has higher priority. As soon as I can retake this, my intention is to finish the tests to see the impact of XFS latency on Gluster and see if it's possible to predict Gluster latency directly from XFS layout and disk access times. If that's possible, I'll try to correlate the results with some real-world use case that is currently experiencing high latencies for 'ls'.

I'll also try to see if the high fragmentation I've manually created can be naturally created by a combination of an aged XFS and LVM thin provisioning.

The final step would be to make changes to the code to improve performance. If this is really caused by the back-end file system, we cannot eliminate the latencies, but maybe we can hide some of them using background operations or using read-ahead policies.

Comment 46 Xavi Hernandez 2019-04-09 15:47:41 UTC

Nithya has done some work [1] related to this which seems promising [2]. However this approach is still experimental and needs to be validated.

[1] https://review.gluster.org/c/glusterfs/+/22366/
[2] https://docs.google.com/spreadsheets/d/1KA-qM8rivRrTdFTOPA_CY99Jq18BqGuPy4ZPZXNvjeE/edit?usp=sharing