Description of problem: For tiered volume tests where files on the hot tier are accessed, for large-files we see performance close to what the hot tier is capable of. There is some degradation because of ctr; but when ctr is disabled, performance gap is only a few percent. However, for small files, performance is much less that what the hot tier is capable of, even when ctr is disabled. Version-Release number of selected component (if applicable): glusterfs-server-3.7.5-7.el7rhgs.x86_64 Red Hat Enterprise Linux Server release 7.1 (Maipo) How reproducible: consistently. Steps to Reproduce: 1. create a 2x(8+4) volume. attach 2x2 ssd tier. disable ctr. create a small-file data set that fits in the hot tier (less than the migration watermark). perform a multi-threaded read and note throughput. 2. create a non-tiered volume on the 2x2 ssd tier. create the same data set again and note read throughput. 3. repeat steps 1 and 2 for a large-file workload. Actual results: for the small-file workload, read throughput: tiered volume: 270 MB/s non-tiered volume on hot tier bricks: 490 MB/s For large-file workload, read throughput: tiered volume: 1038 MB/s non-tiered volume on hot tier bricks: 1103 MB/s Expected results: For this type of test, we would like to see tiered volume small-file performance within a few percent of hot tier performance. Additional info: The poor performance for small-file workloads is probably coming from the extra round-trip in the tiered case for accessing dht link files on the cold tier. Will update this bz as we make progress on rca.
More details on the test: Tiered volume: 2x(8+4) base volume (4 servers) lookup-optimize on server.event-threads 4 client.event-threads 4 attach-tier 2x2 SAS-SSD bricks (2 servers) ctr-enabled on tier-mode cache mounted on 4 clients workload: 1. smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 8 --files 16378 --file-size 64 --record-size 64 --fsync N --operation create 2. smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 8 --files 16378 --file-size 64 --record-size 64 --fsync N --operation read Note that data set created is only 32GB whereas hot tier has a capacity of 360GB. So, all accesses should be going to the hot tier. Compare performance to (a) non-tiered volume created on the 2x(8+4) EC volume (b) non-tiered volume created on the 2x2 SSD volume. For tiered volume, also repeat with ctr-enabled off. create performance issues with tiered volume is being tracked elsewhere, so will focus on the read results here. Results: ctr on: 220.862684 MB/sec ctr off: 260.501528 MB/sec 2x(8+4) ec volume: 101.108332 MB/sec 2x2 SSD volume: 490.835347 MB/sec Now, substitute file-size as 16384 and files-per-thread as 16 in smallfile command line. So, we are creating 16MB files now. Results for 16MB file size: ctr on: 945.299436 MB/sec ctr off: 978.266685 MB/sec 2x(8+4) ec volume: 849.426594 MB/sec 2x2 SSD volume: 1146.902127 MB/sec For large files, tiering is able to get close to the ideal for this configuration which is the hot tier performance. For small-files however, there is a big gap between tiered volume performance and hot tier performance.
(In reply to Manoj Pillai from comment #2) > More details on the test: clarification... ${top_dir} in the smallfile command line is a directory created after attach-tier and just before the actual run. Also, results with an 8MB file size are given below: tiered vol, ctr on: 975.318167 MB/sec tiered vol, ctr off: 1012.929358 MB/sec 2x(8+4) ec volume: 672.716669 MB/sec 2x2 SSD volume: 1121.599904 MB/sec It is easier to see in this set of results that tiered vol performance is close to SSD volume performance, rather than EC volume performance.
Test have above with sql-cache:12500 and sql-wal size:25000 pages Option: features.ctr-sql-db-cachesize Default Value: 12500 Description: Defines the cache size of the sqlite database of changetimerecorder xlator.The input to this option is in pages.Each page is 4096 bytes. Default value is 12500 pages i.e ~ 49 MB. The max value is 262144 pages i.e 1 GB and the min value is 1000 pages i.e ~ 4 MB. Option: features.ctr-sql-db-wal-autocheckpoint Default Value: 25000 Description: Defines the autocheckpoint of the sqlite database of changetimerecorder. The input to this option is in pages. Each page is 4096 bytes. Default value is 25000 pages i.e ~ 98 MB.The max value is 262144 pages i.e 1 GB and the min value is 1000 pages i.e ~4 MB.
This particular performance issue is not primarily ctr-related. Note that in comment #2 and comment #3, turning ctr off results only in a small benefit. I expect this particular issue to be the result of tier xlator, the extra round-trip for accessing dht-link files and the impact of this on small-file performance. In any case, I re-ran the same tests with glusterfs*-3.7.5-14.el7.x86_64, with sql-cache and sql-wal size set as in comment #4. I'm not seeing any significant improvement compared to results in comment #2.
This issue is most likely because of the necessity of a lookup following a readdirp. For files present on hot-tier, entry-resolution (path to gfid conversion) and attribute/stat fetching has to be done on hot-tier. But a readdir is issued on cold-tier. This necessitates a lookup on file. In small file workload, this overhead of an extra lookup is relatively significant and hence might causing the performance drop.
One not so neat fix can be: 1. Initiate a readdir on hot tier too when we initiate readdir on cold tier. 2. Link the inodes while handling readdir response from hot tier. 3. Directory entries are always returned from cold tier. If the dentry corresponds to a linkto file (as is the case for files present on hot tier), do an inode_find (gfid). Since linkto file present on cold tier also contains gfid, this should be possible. If inode is found in inode table (readdir on hot tier has fetched the inode corresponding to this dentry) use that inode and send a response dentry with a valid inode. This I feel an experiment worth trying. However, the flip side of this approach memory consumption (for inodes linked which may or may not be required) for large directories on hot-tier. Probably we can refine this idea and see how it fares in tests.
Upgraded my setup to add PCI SSDs on the hot tier (previous results were with SAS SSDs). Now we can switch between PCI-SSDs and SAS-SSDs on the hot tier. I'm repeating the small-file test with the 'smallfile' benchmark, as described in comment #0. For the tiered volume case, the test is simply running the test on an empty tiered volume (hence files are getting created on the hot tier, and getting read back from the hot tier). The data set size is small: 32GB of data on a 320 GB hot tier. Since all accesses for the tiered volume case is to the SSD tier, a reasonable expectation is performance close to the SSD volume. 2x(8+4) volume: create phase: 113.934506 MB/sec read phase: 100.424073 MB/sec PCI-SSD results: 2x2 PCI-SSD volume: create phase: 218.233151 MB/sec read phase: 704.615155 MB/sec tiered volume with 2x(8+4) cold tier; 2x2 PCI-SSD hot tier: create phase: 109.017227 MB/sec read phase: 279.837460 MB/sec Comparing with SAS-SSD results 2x2 SAS-SSD volume: create phase: 54.649998 MB/sec read phase: 487.091146 MB/sec tiered volume with 2x(8+4) cold tier; 2x2 SAS-SSD hot tier: create phase: 52.703876 MB/sec read phase: 260.699312 MB/sec So, when we replaced the SAS-SSD hot tier with a PCI-SSD hot tier that is 44% faster on reads, we got a 7% boost for the tiered volume. Or: in the SAS-SSD case, tiered volume is at 53% of hot tier performance on reads. in the PCI-SSD case, tiered volume is at 39% of hot tier performance on reads. The current implementation seems to be allowing the cold tier to drag down its performance, which becomes more obvious as you use faster storage at the hot tier.
Tried some variations to see the impact of cold tier layout on this particular issue. Results in this comment are for the read phase of a smallfile test with 64KB files, 32GB data set. Results for 2x2 volume on PCIe SSDs: 695 MB/s Results for tiered volume with hot tier on 2x2 PCIe SSDs. Tried this with different cold tier layouts. 2x(8+4) cold tier: 293 MB/s 4x(4+2) cold tier: 349 MB/s 8x3 cold tier: 544 MB/s Note that in case, the files are being read from the hot tier, so ideally we would see same performance for tiered volume and 2x2 SSD volume. However, the tiered volume performance with 2x(8+4) cold tier is only about 42% of the ideal; in contrast 8x3 cold tier gives about 78% of ideal. EC cold tier seems to drag down the performance of tiered volume for this workload with (4+2) better than (8+4).
Created attachment 1133705 [details] gluster volume profile for tiered volume for read phase of smallfile benchmark gluster volume profile for test in comment #15 with 2x(8+4) cold tier and 2x2 PCIe SSD hot tier. gprfc082 and gprfc83 are the hot tier servers. bricks on these servers receive a lot of lookup AND read requests. gprfs045-gprfs048 are the cold tier servers. No significant read requests seen on the cold tier bricks, because the entire data set fits withing the hot tier. lookups seen as expected. Also, inodelk requests with very high latencies in a few cases.
I have written a fix upstream "optimize lookups for tiering", which may mitigate some of the overhead. DHT performs a "revalidate" lookup on every subvolume to confirm the layout for a directory has not changed. This is not necessary for the tiering translator. It does not use DHT layouts. An existance test can be performed by only checking the cold tier. This will cut the number of lookups and improve small file performance. I can build downstream RPMs with this fix, so we can quantify the benefit.
Created attachment 1138520 [details] gluster volume profile for tiered volume for read phase 20s interval this profile covers the same case as comment #16. however info is gathered at 20s intervals to better separate startup ops from regular ops.
Updating the bz with results of some investigations that have been going on. We tried runs with private builds incorporating this patch: http://review.gluster.org/#/c/13605/ With disperse.eager-lock off, we found some improvement when running with an EC cold tier. Results for small-file read test: SSD volume: 695 MB/s tiered vol with 8x3 cold tier: 544 MB/s tiered vol with 2x(8+4) cold tier: 293 MB/s tiered vol with 2x(8+4) cold tier with eager-lock off: 371 MB/s tiered vol with 4x(4+2) cold tier: 349 MB/s tiered vol with 4x(4+2) cold tier with eager-lock off: 466 MB/s So, turning disperse.eager-lock off does give some benefit in performance. but still doesn't get us where we would like to be. Plan now is to try changes to tier xlator.
We have RPMs to try location : root.96.31:/root/rpmbuild/RPMS/x86_64/ Email if you have trouble obtaining them. The patch implements comment 16. We direct lookups to the hot tier rather than the cold tier. I measured 10-20% improvement with it using a similar configuration and workload as described in comment 1. The client needs to "learn" where the files are and cache this in its inode. The first time smallfile runs it learns. The second time it has the location known and this is cached. In my measurements I saw benefit the second time smallfile ran.
Another set of RPMs was created 5/1 in the same location, fixing a bug related to low memory clients. This set should be used for performance tests.
http://review.gluster.org/13601 Upstream patch
The tests so far have been with caches dropped between create/write and read phases. I'm also adding a test where caches are not dropped between create/write and read phases. Another change in the tests reported in this comment: So far I have been running the smallfile benchmark with 4 clients, 8 threads per client. The standard test is 16K files per thread, 64K file size. Recently, I've been running into problems where some threads seem to be getting starved and the benchmark reports "not enough files processed error".To get around the error, I'm now running with 4 clients, 4 threads per client, 32K files per thread, 64K file size. So total data set size is the same, but expect to see lower throughput across the board (i.e. for tiered and non-tiered volumes) compared to the earlier tests. Since I didn't have the baselines for these modified tests from RHGS 3.1.2, re-ran with those rpms. Results: Test: smallfile test with NO drop-cache between create/write and read smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 32768 --file-size 64 --record-size 64 --fsync Y --response-times N --operation create smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 32768 --file-size 64 --record-size 64 --response-times N --operation read RHGS 3.1.2 GA 2x(8+4): write: 29 MB/s; read: 177 MB/s 2x2 SSD volume: write: 153 MB/s; read: 615 MB/s TIER: write: 81 MB/s; read: 462 MB/s New private build 2x(8+4): write: 28; read: 176 2x2 SSD volume: write: 150; read: 607 TIER: write: 72; read: 446 Test: smallfile test with drop-cache between create/write and read RHGS 3.1.2 GA 2x(8+4): write: 28 MB/s; read: 90 MB/s 2x2 SSD volume: write: 154 MB/s; read: 468 MB/s TIER: write: 79 MB/s; read: 201 MB/s New private build 2x(8+4): write: 28 ; read: 91 2x2 SSD volume: write: 150 ; read: 461 TIER: write: 71 ; read: 183 New private build: glusterfs-libs-3.7.9-2.13.git852f5ea.el7rhs.x86_64 glusterfs-client-xlators-3.7.9-2.13.git852f5ea.el7rhs.x86_64 glusterfs-fuse-3.7.9-2.13.git852f5ea.el7rhs.x86_64 glusterfs-server-3.7.9-2.13.git852f5ea.el7rhs.x86_64 glusterfs-3.7.9-2.13.git852f5ea.el7rhs.x86_64 glusterfs-api-3.7.9-2.13.git852f5ea.el7rhs.x86_64 glusterfs-cli-3.7.9-2.13.git852f5ea.el7rhs.x86_64 Summary: No gains seen in the private build over 3.1.2 GA.
Can you get a profile and tell me if you see LOOKUPs on the cold tier. If you do see texts on the cold tier, the patch has a bug or your testing methods are different than mine. If you do not, then the theory of LOOKUPs being expensive is incorrect.
Dan, we're both using BAGL systems and the smallfile benchmark, so our conclusions should not be too different. I'd ask you to post your tests and results, where you're seeing performance gains with this patch. If this patch is showing a benefit, it will be good to see under what circumstances.
Sure, I can do TWO people's jobs. I have all the time in the world, right? I'll post the results.
Targeting this BZ for 3.2.0.
The metadata translator has been updated to cache file stat information on the client. For this bug's smallfile workload, LOOKUPs were done on each directory level. The cumulative overhead of the round trips for each of those exceeded any benefits from the SSD file transfer. The effect is described further below along with the benefits to client side caching [1]. The upcall support with tiering should be tested together. [1] http://blog.gluster.org/2016/10/gluster-tiering-and-small-file-performance/
Here we are trying to see whether md-cache implementation in RHGS 3.2 helps in improving smallfile performance on tiered volume. Benchmark: smallfile (bengland2/smallfile ยท GitHub ) Base Volume: 2x(8+4) disperse volume on HDD (6 servers) Cache Tier: 2x2 NVMe SSD/JBOD (4 servers) OS and software: RHEL 7.3; glusterfs*-3.8.4-13.el7rhgs.x86_64 The performance comparison in this test is between: >tiered volume >volume created on the slow tier storage (in other words, 2x(8+4) disperse volume on HDD/JBOD). >volume created on fast tier storage (in other words, 2x2 distributed-replicated volume on NVMe-SSD/JBOD). Workload details: >small-files, 32KB file size >total data set size of 32GB (a million files) >5 clients, 4 threads per client, 32K files per thread >top_dir (see commands below) is a directory within the mount point, created just before benchmark run. >tests were run with FSYNC (see command below) set to Y. Tests for default and metadata enabled setting: Set of tests smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 52432 --file-size 32 --fsync ${FSYNC} --operation create smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 52432 --file-size 32 --operation read Note above, cache was not dropped between tests Test Results Default Metadata Cache Settings: Dispersed Volume Create(Write) : 259 Files/Sec NVMe SSD Volume Create(Write) : 5406 Files/Sec Tiered Volume Create(Write) : 589 Files/Sec Dispersed Volume Read : 3480 Files/Sec NVMe SSD Volume Read : 17325 Files/Sec Tiered Volume Read : 5569 Files/Sec Metadata Cache Enabled on Volume: Dispersed Volume Create(Write) : 257 Files/Sec NVMe SSD Volume Create(Write) : 5378 Files/Sec Tiered Volume Create(Write) : 736 Files/Sec Dispersed Volume Read : 3479 Files/Sec NVMe SSD Volume Read : 17884 Files/Sec Tiered Volume Read : 16438 Files/Sec Observations: Performance of the create (write) phase is seen to be a problem with tiered volume both with metadata cache enabled as well as with no metadata cache implementation. Their is 86% drop in performance for tiered volume create (write) when metadata cache is enabled. MD-cache has no affect on create (write) performance as write is the first phase of testing. Performance of read phase has shown 195% improvement when metadata cache is enabled versus when no md-cache is there. When metadata cache is enabled read phase performs only 8% less than what standalone SSD NVMe volume performs, This may be due to tiering db overhead and overhead in handling of operations (like lookup) in the tiered architecture. The results shows that md-cache significantly helps in overcoming the tiering overhead for read phase but tiering overhead still persists for create phase for smallfile workload.
In addition to tests performed in Comment 34, performance numbers were also captured by dropping cache in between create (write) and read phase. Test Results with cache dropped: Default Metadata Cache Settings: Dispersed Volume Create(Write) : 288 Files/Sec NVMe SSD Volume Create(Write) : 5384 Files/Sec Tiered Volume Create(Write) : 522 Files/Sec Dispersed Volume Read : 1217 Files/Sec NVMe SSD Volume Read : 17476 Files/Sec Tiered Volume Read : 1823 Files/Sec Metadata Cache Enabled on Volume: Dispersed Volume Create(Write) : 254 Files/Sec NVMe SSD Volume Create(Write) : 5302 Files/Sec Tiered Volume Create(Write) : 657 Files/Sec Dispersed Volume Read : 1098 Files/Sec NVMe SSD Volume Read : 17717 Files/Sec Tiered Volume Read : 2863 Files/Sec Observations: When cache was dropped between create and read phase, create phase performance in tiered volume continues to suffer for both default metadata cache setting and metadata enabled cache setting. The read performance is 57% better for tiered volume with md-cache enabled compared to default md-cache setting but read phase performs 519% slower for tiered volume compared to SSD volume for md-cache enabled setting when cache was dropped between create and write phase. The results shows that md-cache significantly helps in overcoming the tiering overhead for read phase when cache is not dropped between tests but tiering overhead still persists for create phase as well as for read phase when cache is dropped for smallfile workload.
Correction: In comment 34 in observation section its mentioned: Their is 86% drop in performance for tiered volume create (write) when metadata cache is enabled. Instead of that it would be The write performance is 630% slower for tiered volume create (write) when metadata cache is enabled compared to SSD volume case.
Me and Manoj had a discussion on this. Following are the things we think that are bottlenecks in the create phase: 1. Negative lookup Tier doesn't have lookup-optimize set. This means a lookup is done on cold tier followed by parallel lookups on hot and cold tier. So, the latency is likely to be twice the latency of a negative lookup on cold tier. Given the current design of having cold tier having entire directory structure and cold tier is chosen as hashed subvol for all files/directories there is no need to do lookup_everywhere. IOW, lookup-optimize behavior for tier translator should be enabled. 2. create Since data files exist on hot tier, a linkto file will be created on cold tier. This linkto creation can drag down the performance 3. writes There is unlikely any overhead involved here unless the file is under migration. Manoj confirmed that there are no writes here.
(In reply to Raghavendra G from comment #42) > Me and Manoj had a discussion on this. Following are the things we think > that are bottlenecks in the create phase: > > 1. Negative lookup > > Tier doesn't have lookup-optimize set. This means a lookup is done on cold > tier followed by parallel lookups on hot and cold tier. However lookup-optimize is set on dht itself. So hot and cold tier won't be doing lookup-everywhere.
As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.