Bug 1292391 - limited performance gain for small-file workloads on tiered volumes
limited performance gain for small-file workloads on tiered volumes
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: tier (Show other bugs)
unspecified
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Dan Lambright
Rahul Hinduja
tier-performance
: Performance
Depends On:
Blocks: 1314586
  Show dependency treegraph
 
Reported: 2015-12-17 05:12 EST by Manoj Pillai
Modified: 2017-07-12 12:50 EDT (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Release Note
Doc Text:
Small files performance has been improved for gluster via enhancements to the meta-data cache translator. The enhancement improves the tiering feature as well as the rest of gluster.
Story Points: ---
Clone Of:
: 1314586 1427783 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
gluster volume profile for tiered volume for read phase of smallfile benchmark (101.46 KB, text/plain)
2016-03-07 04:55 EST, Manoj Pillai
no flags Details
gluster volume profile for tiered volume for read phase 20s interval (723.71 KB, text/plain)
2016-03-21 05:22 EDT, Manoj Pillai
no flags Details

  None (edit)
Description Manoj Pillai 2015-12-17 05:12:08 EST
Description of problem:

For tiered volume tests where files on the hot tier are accessed, for large-files we see performance close to what the hot tier is capable of. There is some degradation because of ctr; but when ctr is disabled, performance gap is only a few percent. 

However, for small files, performance is much less that what the hot tier is capable of, even when ctr is disabled.

Version-Release number of selected component (if applicable):

glusterfs-server-3.7.5-7.el7rhgs.x86_64
Red Hat Enterprise Linux Server release 7.1 (Maipo)

How reproducible:
consistently.

Steps to Reproduce:
1.
create a 2x(8+4) volume. attach 2x2 ssd tier. disable ctr. 
create a small-file data set that fits in the hot tier (less than the migration watermark). perform a multi-threaded read and note throughput.

2.
create a non-tiered volume on the 2x2 ssd tier. create the same data set again and note read throughput.

3.
repeat steps 1 and 2 for a large-file workload.

Actual results:
for the small-file workload, read throughput:

tiered volume: 270 MB/s
non-tiered volume on hot tier bricks: 490 MB/s

For large-file workload, read throughput:
tiered volume: 1038 MB/s
non-tiered volume on hot tier bricks: 1103 MB/s

Expected results:

For this type of test, we would like to see tiered volume small-file performance within a few percent of hot tier performance.

Additional info:
The poor performance for small-file workloads is probably coming from the extra round-trip in the tiered case for accessing dht link files on the cold tier. Will update this bz as we make progress on rca.
Comment 2 Manoj Pillai 2015-12-17 08:52:52 EST
More details on the test:

Tiered volume:
2x(8+4) base volume (4 servers)
lookup-optimize on
server.event-threads 4
client.event-threads 4

attach-tier 2x2 SAS-SSD bricks (2 servers)
ctr-enabled on
tier-mode cache

mounted on 4 clients

workload:
1. smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 8 --files 16378 --file-size 64 --record-size 64 --fsync N --operation create

2. smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 8 --files 16378 --file-size 64 --record-size 64 --fsync N --operation read

Note that data set created is only 32GB whereas hot tier has a capacity of 360GB. So, all accesses should be going to the hot tier.

Compare performance to (a) non-tiered volume created on the 2x(8+4) EC volume (b) non-tiered volume created on the 2x2 SSD volume.

For tiered volume, also repeat with ctr-enabled off.

create performance issues with tiered volume is being tracked elsewhere, so will focus on the read results here.

Results:
ctr on: 220.862684 MB/sec
ctr off: 260.501528 MB/sec
2x(8+4) ec volume: 101.108332 MB/sec
2x2 SSD volume: 490.835347 MB/sec

Now, substitute file-size as 16384 and files-per-thread as 16 in smallfile command line. So, we are creating 16MB files now.

Results for 16MB file size:
ctr on: 945.299436 MB/sec
ctr off: 978.266685 MB/sec
2x(8+4) ec volume: 849.426594 MB/sec
2x2 SSD volume: 1146.902127 MB/sec

For large files, tiering is able to get close to the ideal for this configuration which is the hot tier performance. For small-files however, there  is a big gap between tiered volume performance and hot tier performance.
Comment 3 Manoj Pillai 2015-12-17 10:04:01 EST
(In reply to Manoj Pillai from comment #2)
> More details on the test:

clarification...

${top_dir} in the smallfile command line is a directory created after attach-tier and just before the actual run.

Also, results with an 8MB file size are given below:
tiered vol, ctr on: 975.318167 MB/sec
tiered vol, ctr off: 1012.929358 MB/sec
2x(8+4) ec volume: 672.716669 MB/sec
2x2 SSD volume: 1121.599904 MB/sec

It is easier to see in this set of results that tiered vol performance is close to  SSD volume performance, rather than EC volume performance.
Comment 4 Joseph Elwin Fernandes 2016-01-11 03:41:08 EST
Test have above with sql-cache:12500 and sql-wal size:25000 pages

Option: features.ctr-sql-db-cachesize
Default Value: 12500
Description: Defines the cache size of the sqlite database of changetimerecorder xlator.The input to this option is in pages.Each page is 4096 bytes. Default value is 12500 pages i.e ~ 49 MB. The max value is 262144 pages i.e 1 GB and the min value is 1000 pages i.e ~ 4 MB. 

Option: features.ctr-sql-db-wal-autocheckpoint
Default Value: 25000
Description: Defines the autocheckpoint of the sqlite database of  changetimerecorder. The input to this option is in pages. Each page is 4096 bytes. Default value is 25000 pages i.e ~ 98 MB.The max value is 262144 pages i.e 1 GB and the min value is 1000 pages i.e ~4 MB.
Comment 6 Manoj Pillai 2016-01-12 04:01:34 EST
This particular performance issue is not primarily ctr-related. Note that in comment #2 and comment #3, turning ctr off results only in a small benefit. I expect this particular issue to be the result of tier xlator, the extra round-trip for accessing dht-link files and the impact of this on small-file performance.

In any case, I re-ran the same tests with glusterfs*-3.7.5-14.el7.x86_64, with sql-cache and sql-wal size set as in comment #4. I'm not seeing any significant improvement compared to results in comment #2.
Comment 11 Raghavendra G 2016-01-29 00:15:23 EST
This issue is most likely because of the necessity of a lookup following a readdirp. For files present on hot-tier, entry-resolution (path to gfid conversion) and attribute/stat fetching has to be done on hot-tier. But a readdir is issued on cold-tier. This necessitates a lookup on file. In small file workload, this overhead of an extra lookup is relatively significant and hence might causing the performance drop.
Comment 12 Raghavendra G 2016-01-29 00:27:34 EST
One not so neat fix can be:

1. Initiate a readdir on hot tier too when we initiate readdir on cold tier.
2. Link the inodes while handling readdir response from hot tier.
3. Directory entries are always returned from cold tier. If the dentry corresponds to a linkto file (as is the case for files present on hot tier), do an inode_find (gfid). Since linkto file present on cold tier also contains gfid, this should be possible. If inode is found in inode table (readdir on hot tier has fetched the inode corresponding to this dentry) use that inode and send a response dentry with a valid inode.

This I feel an experiment worth trying. However, the flip side of this approach memory consumption (for inodes linked which may or may not be required) for large directories on hot-tier. Probably we can refine this idea and see how it fares in tests.
Comment 13 Manoj Pillai 2016-02-04 11:05:19 EST
Upgraded my setup to add PCI SSDs on the hot tier (previous results were with SAS SSDs). Now we can switch between PCI-SSDs and SAS-SSDs on the hot tier.

I'm repeating the small-file test with the 'smallfile' benchmark, as described in comment #0. For the tiered volume case, the test is simply running the test on an empty tiered volume (hence files are getting created on the hot tier, and getting read back from the hot tier). The data set size is small: 32GB of data on a 320 GB hot tier. Since all accesses for the tiered volume case is to the SSD tier, a reasonable expectation is performance close to the SSD volume.


2x(8+4) volume:
create phase: 113.934506 MB/sec
read phase: 100.424073 MB/sec

PCI-SSD results:

2x2 PCI-SSD volume:
create phase: 218.233151 MB/sec
read phase: 704.615155 MB/sec

tiered volume with 2x(8+4) cold tier; 2x2 PCI-SSD hot tier:
create phase: 109.017227 MB/sec
read phase: 279.837460 MB/sec


Comparing with SAS-SSD results

2x2 SAS-SSD volume:
create phase: 54.649998 MB/sec
read phase: 487.091146 MB/sec

tiered volume with 2x(8+4) cold tier; 2x2 SAS-SSD hot tier:
create phase: 52.703876 MB/sec
read phase: 260.699312 MB/sec

So, when we replaced the SAS-SSD hot tier with a PCI-SSD hot tier that is 44% faster on reads, we got a 7% boost for the tiered volume.

Or: in the SAS-SSD case, tiered volume is at 53% of hot tier performance on reads. in the PCI-SSD case, tiered volume is at 39% of hot tier performance on reads.

The current implementation seems to be allowing the cold tier to drag down its performance, which becomes more obvious as you use faster storage at the hot tier.
Comment 15 Manoj Pillai 2016-03-07 04:09:57 EST
Tried some variations to see the impact of cold tier layout on this particular issue. Results in this comment are for the read phase of a smallfile test with 64KB files, 32GB data set.

Results for 2x2 volume on PCIe SSDs: 695 MB/s

Results for tiered volume with hot tier on 2x2 PCIe SSDs. Tried this with different cold tier layouts.

2x(8+4) cold tier: 293 MB/s
4x(4+2) cold tier: 349 MB/s
8x3 cold tier: 544 MB/s

Note that in case, the files are being read from the hot tier, so ideally we would see same performance for tiered volume and 2x2 SSD volume. However, the tiered volume performance with 2x(8+4) cold tier is only about 42% of the ideal; in contrast 8x3 cold tier gives about 78% of ideal. EC cold tier seems to drag down the performance of tiered volume for this workload with (4+2) better than (8+4).
Comment 16 Manoj Pillai 2016-03-07 04:55 EST
Created attachment 1133705 [details]
gluster volume profile for tiered volume for read phase of smallfile benchmark


gluster volume profile for test in comment #15 with 2x(8+4) cold tier and 2x2 PCIe SSD hot tier.

gprfc082 and gprfc83 are the hot tier servers. bricks on these servers receive a lot of lookup AND read requests.

gprfs045-gprfs048 are the cold tier servers. No significant read requests seen on the cold tier bricks, because the entire data set fits withing the hot tier. lookups seen as expected. Also, inodelk requests with very high latencies in a few cases.
Comment 18 Dan Lambright 2016-03-07 10:25:41 EST
I have written a fix upstream "optimize lookups for tiering", which may mitigate some of the overhead. DHT performs a "revalidate" lookup on every subvolume to confirm
the layout for a directory has not changed. This is not necessary
for the tiering translator. It does not use DHT layouts. An
existance test can be performed by only checking the cold tier. This
will cut the number of lookups and improve small file performance.

I can build downstream RPMs with this fix, so we can quantify the benefit.
Comment 20 Manoj Pillai 2016-03-21 05:22 EDT
Created attachment 1138520 [details]
gluster volume profile for tiered volume for read phase 20s interval

this profile covers the same case as comment #16. however info is gathered at 20s intervals to better separate startup ops from regular ops.
Comment 21 Manoj Pillai 2016-03-24 05:45:44 EDT
Updating the bz with results of some investigations that have been going on.

We tried runs with private builds incorporating this patch: http://review.gluster.org/#/c/13605/

With disperse.eager-lock off, we found some improvement when running with an EC cold tier.

Results for small-file read test:

SSD volume: 695 MB/s
tiered vol with 8x3 cold tier: 544 MB/s

tiered vol with 2x(8+4) cold tier: 293 MB/s
tiered vol with 2x(8+4) cold tier with eager-lock off: 371 MB/s

tiered vol with 4x(4+2) cold tier: 349 MB/s
tiered vol with 4x(4+2) cold tier with eager-lock off: 466 MB/s

So, turning disperse.eager-lock off does give some benefit in performance. but still doesn't get us where we would like to be.

Plan now is to try changes to tier xlator.
Comment 22 Dan Lambright 2016-04-28 10:00:45 EDT
We have RPMs to try location : root@10.19.96.31:/root/rpmbuild/RPMS/x86_64/

Email if you have trouble obtaining them.

The patch implements comment 16. We direct lookups to the hot tier rather than the cold tier. I measured 10-20% improvement with it using a similar configuration and workload as described in comment 1. 

The client needs to "learn" where the files are and cache this in its inode. The first time smallfile runs it learns. The second time it has the location known and this is cached. In my measurements I saw benefit the second time smallfile ran.
Comment 23 Dan Lambright 2016-05-01 17:11:25 EDT
Another set of RPMs was created 5/1 in the same location, fixing a bug related to low memory clients. This set should be used for performance tests.
Comment 24 Joseph Elwin Fernandes 2016-05-02 09:39:48 EDT
http://review.gluster.org/13601
Upstream patch
Comment 25 Manoj Pillai 2016-05-03 04:52:50 EDT
The tests so far have been with caches dropped between create/write and read phases. I'm also adding a test where caches are not dropped between create/write and read phases.

Another change in the tests reported in this comment: 
So far I have been running the smallfile benchmark with 4 clients, 8 threads per client. The standard test is 16K files per thread, 64K file size. Recently, I've been running into problems where some threads seem to be getting starved and the benchmark reports "not enough files processed error".To get around the error, I'm now running with 4 clients, 4 threads per client, 32K files per thread, 64K file size. 

So total data set size is the same, but expect to see lower throughput across the board (i.e. for tiered and non-tiered volumes) compared to the earlier tests.

Since I didn't have the baselines for these modified tests from RHGS 3.1.2, re-ran with those rpms.

Results:

Test: smallfile test with NO drop-cache between create/write and read

smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 32768 --file-size 64 --record-size 64 --fsync Y --response-times N --operation create

smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 32768 --file-size 64 --record-size 64 --response-times N --operation read

RHGS 3.1.2 GA

2x(8+4): write: 29 MB/s; read: 177 MB/s
2x2 SSD volume: write: 153 MB/s; read: 615 MB/s
TIER: write: 81 MB/s; read: 462 MB/s

New private build
2x(8+4): write: 28; read: 176
2x2 SSD volume: write: 150; read: 607
TIER: write: 72; read: 446


Test: smallfile test with drop-cache between create/write and read

RHGS 3.1.2 GA
2x(8+4): write: 28 MB/s; read: 90 MB/s
2x2 SSD volume: write: 154 MB/s; read: 468 MB/s
TIER: write: 79 MB/s; read: 201 MB/s

New private build
2x(8+4): write: 28 ; read: 91
2x2 SSD volume: write: 150 ; read: 461
TIER: write: 71 ; read: 183

New private build:
glusterfs-libs-3.7.9-2.13.git852f5ea.el7rhs.x86_64
glusterfs-client-xlators-3.7.9-2.13.git852f5ea.el7rhs.x86_64
glusterfs-fuse-3.7.9-2.13.git852f5ea.el7rhs.x86_64
glusterfs-server-3.7.9-2.13.git852f5ea.el7rhs.x86_64
glusterfs-3.7.9-2.13.git852f5ea.el7rhs.x86_64
glusterfs-api-3.7.9-2.13.git852f5ea.el7rhs.x86_64
glusterfs-cli-3.7.9-2.13.git852f5ea.el7rhs.x86_64

Summary:
No gains seen in the private build over 3.1.2 GA.
Comment 26 Dan Lambright 2016-05-03 07:08:48 EDT
Can you get a profile and tell me if you see LOOKUPs on the cold tier. 

If you do see texts on the cold tier, the patch has a bug or your testing methods are different than mine. If you do not, then the theory of LOOKUPs being expensive is incorrect.
Comment 27 Manoj Pillai 2016-05-03 07:37:23 EDT
Dan, we're both using BAGL systems and the smallfile benchmark, so our conclusions should not be too different. I'd ask you to post your tests and results, where you're seeing performance gains with this patch. If this patch is showing a benefit, it will be good to see under what circumstances.
Comment 29 Dan Lambright 2016-05-03 08:37:56 EDT
Sure, I can do TWO people's jobs. I have all the time in the world, right? I'll post the results.
Comment 31 Nithya Balachandran 2016-08-03 03:11:16 EDT
Targeting this BZ for 3.2.0.
Comment 33 Dan Lambright 2016-11-01 09:28:48 EDT
The metadata translator has been updated to cache file stat information on the client. For this bug's smallfile workload, LOOKUPs were done on each directory level. The cumulative overhead of the round trips for each of those exceeded any benefits from the SSD file transfer. The effect is described further below along with the benefits to client side caching [1]. The upcall support with tiering should be tested together.

[1]
http://blog.gluster.org/2016/10/gluster-tiering-and-small-file-performance/
Comment 34 Shekhar Berry 2017-02-08 01:23:57 EST
Here we are trying to see whether md-cache implementation in RHGS 3.2 helps in improving smallfile performance on tiered volume.

Benchmark: smallfile (bengland2/smallfile ยท GitHub )
Base Volume: 2x(8+4) disperse volume on HDD (6 servers)
Cache Tier: 2x2 NVMe SSD/JBOD (4 servers)
OS and software: RHEL 7.3; glusterfs*-3.8.4-13.el7rhgs.x86_64

The performance comparison in this test is between:

>tiered volume
>volume created on the slow tier storage (in other words,  2x(8+4) disperse volume on HDD/JBOD).
>volume created on fast tier storage (in other words, 2x2 distributed-replicated volume on NVMe-SSD/JBOD).
 

Workload details:

>small-files, 32KB file size
>total data set size of 32GB (a million files)
>5 clients, 4 threads per client, 32K files per thread
>top_dir (see commands below) is a directory within the mount point, created just before benchmark run.
>tests were run with FSYNC (see command below) set to Y.

Tests for default and metadata enabled setting: 
Set of tests
            smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 52432 --file-size 32 --fsync ${FSYNC} --operation create
            smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 52432 --file-size 32  --operation read

Note above, cache was not dropped between tests

Test Results

Default Metadata Cache Settings:

Dispersed Volume Create(Write) : 259 Files/Sec
NVMe SSD Volume Create(Write)  : 5406 Files/Sec
Tiered Volume Create(Write)    : 589 Files/Sec

Dispersed Volume Read : 3480 Files/Sec
NVMe SSD Volume Read  : 17325 Files/Sec
Tiered Volume Read    : 5569 Files/Sec

Metadata Cache Enabled on Volume:

Dispersed Volume Create(Write) : 257 Files/Sec
NVMe SSD Volume Create(Write)  : 5378 Files/Sec
Tiered Volume Create(Write)    : 736 Files/Sec

Dispersed Volume Read : 3479 Files/Sec
NVMe SSD Volume Read  : 17884 Files/Sec
Tiered Volume Read    : 16438 Files/Sec

Observations:

    Performance of the create (write) phase is seen to be a problem with tiered volume both with metadata cache enabled as well as with no metadata cache implementation. Their is 86% drop in performance for tiered volume create (write) when metadata cache is enabled. MD-cache has no affect on create (write) performance as write is the first phase of testing.

    Performance of read phase has shown 195% improvement when metadata cache is enabled versus when no md-cache is there. When metadata cache is enabled read phase performs only 8% less than what standalone SSD NVMe volume performs, This may be due to tiering db overhead and overhead in handling of operations (like lookup) in the tiered architecture.

    The results shows that md-cache significantly helps in overcoming the tiering overhead for read phase but tiering overhead still persists for create phase for smallfile workload.
Comment 35 Shekhar Berry 2017-02-08 01:54:25 EST
In addition to tests performed in Comment 34, performance numbers were also captured by dropping cache in between create (write) and read phase.

Test Results with cache dropped: 

Default Metadata Cache Settings:

Dispersed Volume Create(Write) : 288 Files/Sec
NVMe SSD Volume Create(Write)  : 5384 Files/Sec
Tiered Volume Create(Write)    : 522 Files/Sec

Dispersed Volume Read : 1217 Files/Sec
NVMe SSD Volume Read  : 17476 Files/Sec
Tiered Volume Read    : 1823 Files/Sec

Metadata Cache Enabled on Volume:

Dispersed Volume Create(Write) : 254 Files/Sec
NVMe SSD Volume Create(Write)  : 5302 Files/Sec
Tiered Volume Create(Write)    : 657 Files/Sec

Dispersed Volume Read : 1098 Files/Sec
NVMe SSD Volume Read  : 17717 Files/Sec
Tiered Volume Read    : 2863 Files/Sec

Observations:

   When cache was dropped between create and read phase, create phase performance in tiered volume continues to suffer for both default metadata cache setting and metadata enabled cache setting.

   The read performance is 57% better for tiered volume with md-cache enabled compared to default md-cache setting but read phase performs 519% slower for tiered volume compared to SSD volume for md-cache enabled setting when cache was dropped between create and write phase. 

   The results shows that md-cache significantly helps in overcoming the tiering overhead for read phase when cache is not dropped between tests but tiering overhead still persists for create phase as well as for read phase when cache is dropped for smallfile workload.
Comment 36 Shekhar Berry 2017-02-08 07:59:01 EST
Correction:

In comment 34 in observation section its mentioned:

Their is 86% drop in performance for tiered volume create (write) when metadata cache is enabled.

Instead of that it would be

The write performance is 630% slower for tiered volume create (write) when metadata cache is enabled compared to SSD volume case.
Comment 42 Raghavendra G 2017-07-12 12:49:00 EDT
Me and Manoj had a discussion on this. Following are the things we think that are bottlenecks in the create phase:

1. Negative lookup

Tier doesn't have lookup-optimize set. This means a lookup is done on cold tier followed by parallel lookups on hot and cold tier. So, the latency is likely to be twice the latency of a negative lookup on cold tier. Given the current design of having cold tier having entire directory structure and cold tier is chosen as hashed subvol for all files/directories there is no need to do lookup_everywhere. IOW, lookup-optimize behavior for tier translator should be enabled.

2. create

Since data files exist on hot tier, a linkto file will be created on cold tier. This linkto creation can drag down the performance

3. writes

There is unlikely any overhead involved here unless the file is under migration. Manoj confirmed that there are no writes here.
Comment 43 Raghavendra G 2017-07-12 12:50:26 EDT
(In reply to Raghavendra G from comment #42)
> Me and Manoj had a discussion on this. Following are the things we think
> that are bottlenecks in the create phase:
> 
> 1. Negative lookup
> 
> Tier doesn't have lookup-optimize set. This means a lookup is done on cold
> tier followed by parallel lookups on hot and cold tier. 

However lookup-optimize is set on dht itself. So hot and cold tier won't be doing  lookup-everywhere.

Note You need to log in before you can comment on or make changes to this bug.