Bug 1629589

Summary:	Gluster-file Volume under-performing than Gluster-block Volume for postgresql workload
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Shekhar Berry <shberry>
Component:	glusterfs	Assignee:	Susant Kumar Palai <spalai>
Status:	CLOSED WONTFIX	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.4	CC:	amukherj, csaba, dcain, ekuric, guillaume.pavese, jahernan, kdhananj, mpillai, pkarampu, psuriset, puebele, rhs-bugs, rsussman, shberry, spalai, vbellur
Target Milestone:	---	Keywords:	Performance
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-09 12:26:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1563509, 1648781, 1670710, 1670719, 1676468
Bug Blocks:

Description Shekhar Berry 2018-09-17 07:04:10 UTC

Description of problem:

Using NVME drive as backend brick, Replica 3 gluster file volume and Replica 3 gluster block volume was created on a stand-alone baremetal environment.

The volume was mounted to a single client where Postgresql-9.6 (pgbench tool) database was ran on the mounted gluster volume.

The database was scaled to 120GB using the following command:

/usr/pgsql-9.6/bin/pgbench -i -s 8000 <db_name>

and next workload was run on the database using following command:

/usr/pgsql-9.6/bin/pgbench -c 10 -j 2 -t 10000 <db_name>

For Gluster_File Volume the reported Transactions Per Second was ~540TPS whereas for Gluster_Block Volume the reported Transactions Per Second was ~890TPS.

Here are the detailed output:

Gluster File
------------
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 8000
query mode: simple
number of clients: 10
number of threads: 2
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 18.523 ms
tps = 539.879180 (including connections establishing)
tps = 539.901685 (excluding connections establishing)

Gluster Block
-------------
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 8000
query mode: simple
number of clients: 10
number of threads: 2
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 11.208 ms
tps = 892.228552 (including connections establishing)
tps = 892.253554 (excluding connections establishing)

As it can be seen, Gluster File volume is performing ~40% slower than gluster block volume.


Version-Release number of selected component (if applicable):

Server
------
rpm -qa | grep gluster
gluster-block-debuginfo-0.2.1-26.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-18.el7rhgs.x86_64
glusterfs-api-3.12.2-18.el7rhgs.x86_64
glusterfs-3.12.2-18.el7rhgs.x86_64
glusterfs-cli-3.12.2-18.el7rhgs.x86_64
glusterfs-fuse-3.12.2-18.el7rhgs.x86_64
glusterfs-server-3.12.2-18.el7rhgs.x86_64
glusterfs-devel-3.12.2-18.el7rhgs.x86_64
gluster-block-0.2.1-26.el7rhgs.x86_64
glusterfs-libs-3.12.2-18.el7rhgs.x86_64
glusterfs-api-devel-3.12.2-18.el7rhgs.x86_64



Client
------
rpm -qa | grep gluster
glusterfs-libs-3.12.2-18.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-18.el7rhgs.x86_64
glusterfs-api-3.12.2-18.el7rhgs.x86_64
glusterfs-3.12.2-18.el7rhgs.x86_64
glusterfs-cli-3.12.2-18.el7rhgs.x86_64
glusterfs-fuse-3.12.2-18.el7rhgs.x86_64
glusterfs-server-3.12.2-18.el7rhgs.x86_64


How reproducible:
Tried the test for 10 times and it was always reproducible

Steps to Reproduce:
1. Create gluster file Volume and gluster block volume
2. Setup Postgresql-9.6
3. Scale DB and Execute runs

Comment 2 Shekhar Berry 2018-09-17 10:26:46 UTC

Gluster Volume information for gluster file volume is as follows:
-----------------------------------------------------------------
gluster v info
 
Volume Name: filetest
Type: Replicate
Volume ID: b2a6479c-e7a4-4849-acd5-573353e829d3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.17.40.33:/bricks/b/g1
Brick2: 172.17.40.34:/bricks/b/g1
Brick3: 172.17.40.35:/bricks/b/g1
Options Reconfigured:
performance.readdir-ahead: off
performance.io-cache: off
performance.read-ahead: off
performance.strict-o-direct: on
performance.quick-read: off
performance.open-behind: off
performance.write-behind: off
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.brick-multiplex: on


Gluster Volume information for gluster block volume is as follows:
-----------------------------------------------------------------

gluster v info
 
Volume Name: blocktest
Type: Replicate
Volume ID: b2a6479c-e7a4-4849-acd5-573353e829d3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.17.40.33:/bricks/b/g1
Brick2: 172.17.40.34:/bricks/b/g1
Brick3: 172.17.40.35:/bricks/b/g1
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.open-behind: off
performance.readdir-ahead: off
performance.strict-o-direct: on
network.remote-dio: disable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
features.shard-block-size: 64MB
user.cifs: off
server.allow-insecure: on
cluster.choose-local: off

Comment 3 Xavi Hernandez 2018-09-17 11:57:03 UTC

I've run the same configuration for gluster file on another setup. I have observed an average throughput of 766 TPS:

tps = 773.017263 (including connections establishing)
tps = 768.233319 (including connections establishing)
tps = 769.983772 (including connections establishing)
tps = 758.845237 (including connections establishing)
tps = 761.061668 (including connections establishing)
Average TPS (5 runs): 766.23

This is higher than expected compared to Shekhar's setup, however it could be related to some hardware differences. I'll try to run gluster block on same setup for comparison.

Anyway we have observed that FUSE thread was highly utilized (over 75%) most of the time. To try to reduce this utilization, we enabled 'client-io-threads' options, but this seemed to move the load to the epoll threads, so we also increased this value to 4. With these changes we got 839 TPS:

tps = 845.564980 (including connections establishing)
tps = 841.551618 (including connections establishing)
tps = 838.360506 (including connections establishing)
tps = 837.230256 (including connections establishing)
tps = 836.270000 (including connections establishing)
Average TPS (5 runs): 839.80

With this run we saw a high number of FINODELK and FXATTROP requests. Since AFR uses eager-lock, the number shouldn't be so high, unless the files are opened more than once at the same time. In this case, eager-lock is virtually disabled.

We did a quick test by preventing that AFR release the lock even if the file is opened more than once. With this change, performance improved to 1397 TPS:

tps = 1413.390864 (including connections establishing)
tps = 1396.972724 (including connections establishing)
tps = 1396.434035 (including connections establishing)
tps = 1391.433783 (including connections establishing)
tps = 1388.235859 (including connections establishing)
Average TPS (5 runs): 1397.29

Just for reference, I ran a test with same config but using a dispersed volume (disperse already uses a different way to detect inodelk contention that should work fine even if there are multiple fd's opened to the same file). The performance has been 1494 TPS:

tps = 1495.779561 (including connections establishing)
tps = 1495.808580 (including connections establishing)
tps = 1492.475832 (including connections establishing)
tps = 1491.736496 (including connections establishing)
tps = 1496.090887 (including connections establishing)
Average TPS (5 runs): 1494.38

All this is without any perf xlator enabled.

Raghavendra will update with his findings about perf xlators.

Comment 4 Raghavendra G 2018-09-18 04:07:26 UTC

Continuing from the optimal setup of comment #1 (eager-locking fixes, client-io-threads, 4 server-event-threads, 4 client-threads), I did following experiments:

Enabled write-behind:
tps = 1498.302861 (excluding connections establishing)
tps = 1492.893168 (excluding connections establishing)
tps = 1493.730860 (excluding connections establishing)
tps = 1503.250656 (excluding connections establishing)
tps = 1496.409899 (excluding connections establishing)
tps = 1500.046490 (excluding connections establishing)
Average TPS (6 runs): 1497

We observed very high number of fstats in client profile info. Since fstats wait on cached writes, the saved latency in write call comesback as increased latency of fstat. We figured them  to be issued in read path to maintain atime consistency. Xavi found a hack of passing --fopen-keep-cache=no during mount cuts down the number of fstats significantly. With mounting --fopen-keep-cache=no and wb enabled,

tps = 1584.606963 (excluding connections establishing)
tps = 1581.794582 (excluding connections establishing)
tps = 1606.354984 (excluding connections establishing)
tps = 1571.827779 (excluding connections establishing)
tps = 1588.033822 (excluding connections establishing)
tps = 1571.671687 (excluding connections establishing)
Average TPS (6runs): 1584

I also tried https://review.gluster.org/21035 for write-behind with and without --fopen-keep-cache=off

patch https://review.gluster.org/21035 and fopen-keep-cache=off
tps = 1737.741466 (excluding connections establishing)
tps = 1732.987942 (excluding connections establishing)
tps = 1748.621267 (excluding connections establishing)
tps = 1745.229389 (excluding connections establishing)
tps = 1749.041447 (excluding connections establishing)
tps = 1766.022566 (excluding connections establishing)
Average TPS (6 runs): 1747

patch https://review.gluster.org/21035 with --fopen-keep-cache=on
tps = 1790.407151 (excluding connections establishing)
tps = 1794.975445 (excluding connections establishing)
tps = 1804.874935 (excluding connections establishing)
tps = 1783.922527 (excluding connections establishing)
tps = 1806.218331 (excluding connections establishing)
tps = 1808.044262 (excluding connections establishing)
Average TPS (6 runs): 1798

From the above two runs its clear that patch https://review.gluster.org/21035 outperforms --fopen-keep-cache=off and hence will be used in further runs

Since --fopen-keep-cache=on had slightly better number than --fopen-keep-cache=off (this might be because "off" throws away cache of file on every open), I decided to keep --fopen-keep-cache=on (which is default value even otherwise and was the configuration used in tests from comment #1) for further runs.

After enabling performance.open-behind with the above combination,
tps = 2009.272855 (excluding connections establishing)
tps = 2009.649776 (excluding connections establishing)
tps = 2012.016365 (excluding connections establishing)
tps = 2022.746001 (excluding connections establishing)
tps = 2020.497890 (excluding connections establishing)
tps = 2030.787985 (excluding connections establishing)
Average TPS (6 runs): 2018

So, the best configuration till now can get 2018 tps. The configuration is:
1. enable client-io-threads
2. set client-event-threads, server-event-threads to 4
3. Make sure eager-locking is not disabled even when multiple fds are opened on same file
4. enable write-behind with patch https://review.gluster.org/21035
5. enable open-behind

Note that 3 is a hacky fix and we need a better lock contention detection algorithm in afr. I'll put a needinfo on Pranith and Xavi (who had discussed about this to shed more light on that).

Comment 5 Raghavendra G 2018-09-18 04:20:35 UTC

From tests in comment #2, I was surprised that open-behind was adding 200 tps which is around 15% of the improvements we could get over baseline 766. Just to be sure, I did a run with write-behind off and open-behind on
tps = 1534.866032 (excluding connections establishing)
tps = 1542.862411 (excluding connections establishing)
tps = 1507.131634 (excluding connections establishing)
tps = 1531.524721 (excluding connections establishing)
tps = 1520.333410 (excluding connections establishing)
tps = 1533.194226 (excluding connections establishing)
Average TPS(6 runs): 1528

This indeed shows that there is some benefit open-behind can provide for pgbench workload. I see quite a number of open calls and open-behind brings down the average latency of open and flush. However, I've not investigated yet whether this decreased latency gets added to other fops. Nevertheless, test runs show some performance improvement.

Out of curiosity, I disabled eager-lock fixes and got a run with the optimal configuration of comment #2
tps = 1092.786407 (excluding connections establishing)
tps = 1089.363814 (excluding connections establishing)
tps = 1087.911619 (excluding connections establishing)
tps = 1084.864208 (excluding connections establishing)
tps = 1069.001257 (excluding connections establishing)
tps = 1076.230238 (excluding connections establishing)
Average TPS(6 runs): 1083

Though there is an increase of around 300 tps, this is still less than the boost of around 600 tps with same perf xlator combination when eager lock fixes are present. This shows the compounding effect of eager-lock fixes.

Comment 6 Raghavendra G 2018-09-18 05:06:47 UTC

Since fstats issued during reads can interfere with performing of write-behind (as reads and writes are interleaved), the noatime functionality of fuse kernel module becomes very important for this use case. noatime is not implemented yet for fuse filesystems and bz 1563509 tracks the work.

Comment 7 Pranith Kumar K 2018-09-18 05:19:39 UTC

(In reply to Raghavendra G from comment #4)
> 
> Note that 3 is a hacky fix and we need a better lock contention detection
> algorithm in afr. I'll put a needinfo on Pranith and Xavi (who had discussed
> about this to shed more light on that).

I am working on a patch which I will send out by EOD if everything works.

Comment 8 Shekhar Berry 2018-09-18 05:26:36 UTC

In my setup I did some tuning to postgres configuration to improve gluster file performance. Here are my experiments and corresponding results:

Baseline as mentioned in problem statement: 540 TPS (average of 10 RUNs)

Checkpoint Time to 120min
-------------------------
latency average = 17.575 ms
tps = 568.977479 (including connections establishing)
tps = 568.980689 (excluding connections establishing)


Checkpoint Time to 120min and Max WAL size to 30GB
-----------------------------------------------
latency average = 24.069 ms
tps = 415.473125 (including connections establishing)
tps = 415.475571 (excluding connections establishing)

Checkpoint Time to 120min and Max WAL size to 30GB and checkpoint completion target 0.9
--------------------------------------------------------------------------------------------
latency average = 24.225 ms
tps = 412.799726 (including connections establishing)
tps = 412.801901 (excluding connections establishing)

Goes back to 540 TPS when Max_Wal_size and checkpoint_completion goes back to default (1Gb and 0.5 resp)

Checkpoint Time to 120min and shared buffer 2GB, checkpoint_completion is 0.9
------------------------------------------------------------------------------
latency average = 19.970 ms
tps = 500.755293 (including connections establishing)
tps = 500.758218 (excluding connections establishing

Checkpoint Time to 120min and shared buffer 2GB, work_mem 64MB checkpoint_completion is 0.9
--------------------------------------------------------------------------------------------
latency average = 18.730 ms
tps = 533.893499 (including connections establishing)
tps = 533.896463 (excluding connections establishing)


Checkpoint Time to 120min and work_mem 64MB checkpoint_completion is 0.9
-------------------------------------------------------------------------
latency average = 18.802 ms
tps = 531.867076 (including connections establishing)
tps = 531.870136 (excluding connections establishing)

Checkpoint Time to 120min and work_mem 64MB and rest all default
----------------------------------------------------------------
latency average = 16.366 ms
tps = 611.017608 (including connections establishing)
tps = 611.020588 (excluding connections establishing)

Checkpoint Time to 120min and work_mem 128MB and rest all default
------------------------------------------------------------------
latency average = 15.975 ms
tps = 625.961877 (including connections establishing)
tps = 625.965580 (excluding connections establishing)

Checkpoint Time to 120min and work_mem 2GB and rest all default
---------------------------------------------------------------
latency average = 15.960 ms
tps = 626.572056 (including connections establishing)
tps = 626.575451 (excluding connections establishing)

Checkpoint Time to 120min and work_mem 128MB and wal_sync_method is open_sync
----------------------------------------------------------------------------
latency average = 13.673 ms
tps = 731.356256 (including connections establishing)
tps = 731.360554 (excluding connections establishing)


On doing an average over 10 RUNS with setting (Checkpoint Time to 120min and work_mem 128MB and wal_sync_method is open_sync) we get a TPS of ~690 which is almost 25% improvement over the baseline.

As you see the major difference in TPS happened when sync method was changed from fsync (default) to open_sync. Can the same setting be tried on your setup Xavi and Raghavendra and see whether we see improved performance?

Comment 9 Raghavendra G 2018-09-18 06:41:24 UTC

Adding postgres tunings shekhar suggested in comment #8 to optimal configuration of glusterfs in comment #2,

tps = 3183.616650 (excluding connections establishing)
tps = 3155.548769 (excluding connections establishing)
tps = 3171.704790 (excluding connections establishing)
tps = 3163.629948 (excluding connections establishing)
tps = 3165.597652 (excluding connections establishing)
tps = 3150.118107 (excluding connections establishing)
Average TPS (6r runs): 3165

Same configuration with open-behind and write-behind off:
tps = 2014.693320 (excluding connections establishing)
tps = 2045.314307 (excluding connections establishing)
tps = 2019.992084 (excluding connections establishing)
tps = 2025.906543 (excluding connections establishing)
tps = 2023.521044 (excluding connections establishing)
tps = 2030.544424 (excluding connections establishing)
Average TPS (6 runs): 2027

Same configuration with open-behind off and write-behind on:
tps = 2805.473293 (excluding connections establishing)
tps = 2806.299131 (excluding connections establishing)
tps = 2810.771077 (excluding connections establishing)
tps = 2774.695476 (excluding connections establishing)
tps = 2819.857811 (excluding connections establishing)
tps = 2788.863432 (excluding connections establishing)
Average TPS (6 runs): 2801

Also note the compounding of performance of write-behind with postgres tuning - from 400 odd tps to 800tps.

Comment 10 Xavi Hernandez 2018-09-18 09:53:04 UTC

I've tested gluster-block on the same setup. I've used same gluster settings that Shekhar used, but with unmodified gluster packages from 3.12.2-18 (the latest changes crashes gluster-blockd). I've also used optimized postgres settings.

The result is a bit unexpected:

tps = 819.670875 (including connections establishing)
tps = 834.248977 (including connections establishing)
tps = 820.079496 (including connections establishing)
tps = 840.435009 (including connections establishing)
tps = 829.633259 (including connections establishing)
Average TPS (5 runs): 828.81

I installed gluster-block following instructions from https://github.com/gluster/gluster-block. Not sure if I need to configure anything else in gluster-block, tcmu or iscis.

Comment 11 Shekhar Berry 2018-09-18 10:04:05 UTC

Hi Xavi,

In my setup when we tune postgres setting for gluster block following observation is seen:

Gluster-Block baseline
----------------------
TPS: 920


checkpoint_timeout: 120Min
---------------------------
TPS: 931

checkpoint_timeout: 120min and work_mem: 128MB
------------------------------------------------
TPS: 934

Checkpoint Time to 120min and work_mem 128MB and wal_sync_method is open_sync
------------------------------------------------------------------------------
TPS: 537

So open_fsync causes the performance to drop in case of gluster block.

Comment 12 Xavi Hernandez 2018-09-18 10:47:03 UTC

(In reply to Shekhar Berry from comment #11)
> So open_fsync causes the performance to drop in case of gluster block.

I've run a test with default postgres setting, and the performance has improved significantly:

tps = 1415.715298 (including connections establishing)
tps = 1408.253112 (including connections establishing)
tps = 1425.867129 (including connections establishing)
tps = 1426.394465 (including connections establishing)
tps = 1408.044910 (including connections establishing)
Average TPS (5 runs): 1416.85

The baseline for gluster file was 766.23, which means it's ~46% slower than gluster block. This look quite similar to the numbers shown in comment #1, though with different absolute numbers.

Given these numbers, it seems gluster file can match gluster block performance by setting client-io-threads and event-threads options, and improving AFR eager-locking. Additional changes can improve performance even more.

Pranith is working on an approach [1] that could improve eager lock in this case. I'll test the patch and update.

[1] https://review.gluster.org/21107

Comment 13 Xavi Hernandez 2018-09-18 11:41:54 UTC

With Pranith's patch, client-io-threads=on and event-threads=4, performance improves a bit:

tps = 1114.136673 (including connections establishing)
tps = 1108.371303 (including connections establishing)
tps = 1111.904912 (including connections establishing)
tps = 1108.996427 (including connections establishing)
tps = 1102.639386 (including connections establishing)
Average TPS (5 runs): 1109.21

But not as much as avoiding the check on the number of open fd's.

Comment 14 Xavi Hernandez 2018-09-19 15:31:10 UTC

I've repeated the tests using the same configuration used in comment #13 but adding another patch from Pranith: https://review.gluster.org/21210

This time the performance was quite good:

tps = 2311.928116 (including connections establishing)
tps = 2281.287233 (including connections establishing)
tps = 2300.971594 (including connections establishing)
tps = 2316.992359 (including connections establishing)
tps = 2287.787727 (including connections establishing)
Average TPS (5 runs): 2299.79

I tested the same configuration with gluster block, but with one exception: I used fdatasync instead of open_sync in postrges configuration. The improvement is minimal:

tps = 1497.435156 (including connections establishing)
tps = 1489.498803 (including connections establishing)
tps = 1514.883587 (including connections establishing)
tps = 1504.363248 (including connections establishing)
tps = 1484.684683 (including connections establishing)
Average TPS (5 runs): 1498.17

So current best known stable configuration for gluster file is:

performance.write-behind-window-size: 1MB
performance.md-cache-timeout: 600
performance.read-after-open: yes
server.event-threads: 4
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.strict-o-direct: on
performance.write-behind: off
performance.open-behind: on
performance.readdir-ahead: off
performance.stat-prefetch: off
performance.io-cache: off
performance.quick-read: off
performance.read-ahead: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
client.event-threads: 4

The configuration for postgres:

work_mem = 120MB
wal_sync_method = open_sync
checkpoint_timeout = 1h

Comment 22 Raghavendra G 2018-10-23 08:10:06 UTC

(In reply to Raghavendra G from comment #6)
> Since fstats issued during reads can interfere with performing of
> write-behind (as reads and writes are interleaved), the noatime
> functionality of fuse kernel module becomes very important for this use
> case. noatime is not implemented yet for fuse filesystems and bz 1563509
> tracks the work.

https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/?h=for-next&id=802dc0497be2b538ca4300704b45b59bffe29585 will help this case. This should be available in subsequent RHEL updates.

Comment 23 Yaniv Kaul 2018-11-01 11:04:17 UTC

(In reply to Raghavendra G from comment #22)
> (In reply to Raghavendra G from comment #6)
> > Since fstats issued during reads can interfere with performing of
> > write-behind (as reads and writes are interleaved), the noatime
> > functionality of fuse kernel module becomes very important for this use
> > case. noatime is not implemented yet for fuse filesystems and bz 1563509
> > tracks the work.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/
> ?h=for-next&id=802dc0497be2b538ca4300704b45b59bffe29585 will help this case.
> This should be available in subsequent RHEL updates.

How will it be available? Have we asked for a backport?

Comment 25 Raghavendra G 2018-11-12 06:33:45 UTC

(In reply to Yaniv Kaul from comment #23)
> (In reply to Raghavendra G from comment #22)
> > (In reply to Raghavendra G from comment #6)
> > > Since fstats issued during reads can interfere with performing of
> > > write-behind (as reads and writes are interleaved), the noatime
> > > functionality of fuse kernel module becomes very important for this use
> > > case. noatime is not implemented yet for fuse filesystems and bz 1563509
> > > tracks the work.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/
> > ?h=for-next&id=802dc0497be2b538ca4300704b45b59bffe29585 will help this case.
> > This should be available in subsequent RHEL updates.
> 
> How will it be available? Have we asked for a backport?

bz 1648781 has been filed to track this.

Comment 26 Raghavendra G 2019-02-07 03:47:14 UTC

Since fix to bz 1648781 became available and we found an issue with fuse-auto-invalidation purging kernel page-cache - bz 1664934 - I thought of testing the effects of:

* Enabling write-behind with fix to bz 1648781
* Disable fuse-auto-invalidation

Test bed details:
=================

glusterfs version: rhgs-3.4.3 + consistency fixes to md-cache so that md-cache can be turned on.

RHEL kernel version: 3.10.0-983.el7.x86_64

Glusterfs configuration: 1x3 replica with all three bricks on same machine. Bricks are present on NVMe. The only performance xlators I have enabled are quick-read, open-behind, write-behind and md-cache. Since I had custom modified volfile, I am pasting the configuration file below:

volume nvme-r3-replicate-0
    type cluster/replicate
    option use-compound-fops off
    option afr-pending-xattr nvme-r3-client-0,nvme-r3-client-1,nvme-r3-client-2
    subvolumes nvme-r3-client-0 nvme-r3-client-1 nvme-r3-client-2
end-volume

volume nvme-r3-dht
    type cluster/distribute
    option lock-migration off
    subvolumes nvme-r3-replicate-0
end-volume

volume nvme-r3-md-cache
    type performance/md-cache
    option md-cache-timeout 600
    subvolumes nvme-r3-dht
end-volume

volume nvme-r3-write-behind
    type performance/write-behind
    option window-size 20MB
    subvolumes nvme-r3-md-cache
end-volume

volume nvme-r3-open-behind
    type performance/open-behind
    option read-after-open yes
    subvolumes nvme-r3-write-behind
end-volume

volume nvme-r3-quick-read
    type performance/quick-read
    subvolumes nvme-r3-open-behind
end-volume

volume nvme-r3
    type debug/io-stats
    option count-fop-hits on
    option latency-measurement on
    option log-level ERROR
    subvolumes nvme-r3-quick-read
end-volume

Scenario 1: with fuse-auto-invalidations on
======================================

1. without write-behind

[root@shakthi1 ~]# grep tps  ./v3.12.2-25/8000.transactions.kernel-fix.fuse-autoinval.ob.mdc-child-of-wb.mdc-no-invalidation | grep excluding | tail -6; 
tps = 1817.978205 (excluding connections establishing)
tps = 1604.961127 (excluding connections establishing)
tps = 1578.851723 (excluding connections establishing)
tps = 1590.383650 (excluding connections establishing)
tps = 1616.648636 (excluding connections establishing)
tps = 1593.510207 (excluding connections establishing)

Avg tps: 1633

2. with write-behind

[root@shakthi1 ~]# grep tps  ./v3.12.2-25/8000.transactions.kernel-fix.fuse-autoinval.ob.wb.mdc-child-of-wb.mdc-no-invalidation | grep excluding | tail -6; 
tps = 1792.493271 (excluding connections establishing)
tps = 1781.594031 (excluding connections establishing)
tps = 1796.461456 (excluding connections establishing)
tps = 1767.254992 (excluding connections establishing)
tps = 1767.826877 (excluding connections establishing)
tps = 1780.897752 (excluding connections establishing)

Avg tps: 1780.5

Scenario 2: with fuse-auto-invalidations off
============================================

1. without write-behind

[root@shakthi1 ~]# grep tps  ./v3.12.2-25/8000.transactions.kernel-fix.no-fuse-autoinval.ob.mdc-child-of-wb.mdc-no-invalidation | grep excluding | tail -6; 
tps = 1580.503541 (excluding connections establishing)
tps = 1621.528844 (excluding connections establishing)
tps = 1670.834145 (excluding connections establishing)
tps = 1642.694686 (excluding connections establishing)
tps = 1663.766583 (excluding connections establishing)
tps = 1694.190916 (excluding connections establishing)

Avg tps: 1645

2. with write-behind

[root@shakthi1 ~]# grep tps  ./v3.12.2-25/8000.transactions.kernel-fix.no-fuse-autoinval.ob.wb.mdc-child-of-wb.mdc-no-invalidation | grep excluding | tail -6; 
tps = 1887.674495 (excluding connections establishing)
tps = 1853.722739 (excluding connections establishing)
tps = 1875.815394 (excluding connections establishing)
tps = 1865.298756 (excluding connections establishing)
tps = 1929.438132 (excluding connections establishing)
tps = 1898.301352 (excluding connections establishing)

Avg tps: 1884.5

So, turning off fuse-auto-invalidations, enabling write-behind with fix for bz 1648781 improved tps from 1633 tps to 1884.5 tps.

Comment 27 Raghavendra G 2019-02-07 04:07:51 UTC

Since comment #14 (which was the best numbers on baremetal) was performed with postgres tunings I continued experiment in comment #26 with applying postgres tunings of comment #14. Following are the numbers:

[root@shakthi1 ~]# grep tps ./v3.12.2-25/8000.transactions.kernel-fix.postgres-tunings.no-fuse-autoinval.ob.wb.mdc-child-of-wb.mdc-no-invalidation | grep excl
tps = 2169.899908 (excluding connections establishing)
tps = 2413.172544 (excluding connections establishing)
tps = 2471.477298 (excluding connections establishing)
tps = 2497.028224 (excluding connections establishing)
tps = 2474.056120 (excluding connections establishing)
tps = 2466.499389 (excluding connections establishing)

Avg tps: 2415

Note that one crucial difference b/w setup of comment #14 and this test is all bricks for this test are present on a single machine. Comment #14 had on machine for each single brick.

Comment 28 Raghavendra G 2019-02-07 04:14:32 UTC

(In reply to Raghavendra G from comment #27)
> Comment #14 had on machine for each single brick.

s/on/one

Comment 31 Raghavendra G 2019-09-12 02:55:06 UTC

(In reply to Raghavendra G from comment #26)
> Since fix to bz 1648781 became available and we found an issue with
> fuse-auto-invalidation purging kernel page-cache - bz 1664934 - I thought of
> testing the effects of:
> 
> * Enabling write-behind with fix to bz 1648781
> * Disable fuse-auto-invalidation

We need to turn off performance.global-cache-invalidation too so that md-cache too don't purge kernel page cache. This tuning should be done in addition to disabling fuse-auto-invalidation through the glusterfs cmdline option --auto-invalidation while mounting.

Option: performance.global-cache-invalidation
Default Value: true
Description: When "on", purges all read caches in kernel and glusterfs stack whenever a stat change is detected. Stat changes can be detected while processing responses to file operations (fop) or through upcall notifications. Since purging caches can be an expensive operation, it's advised to have this option "on" only when a file can be accessed from multiple different Glusterfs mounts and caches across these different mounts are required to be coherent. If a file is not accessed across different mounts (simple example is having only one mount for a volume), its advised to keep this option "off" as all file modifications go through caches keeping them coherent. This option overrides value of performance.cache-invalidation.


mount option:
      --auto-invalidation[=BOOL]   controls whether fuse-kernel can
                             auto-invalidate attribute, dentry and page-cache.
                             Disable this only if same files/directories are
                             not accessed across two different mounts
                             concurrently [default: "on"]