1491785 – Poor write performance on gluster-block

Bug 1491785 - Poor write performance on gluster-block

Summary: Poor write performance on gluster-block

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Pranith Kumar K
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:	1499644
Blocks:	1503134 1583733
TreeView+	depends on / blocked

Reported:	2017-09-14 16:27 UTC by Manoj Pillai
Modified:	2018-09-17 11:09 UTC (History)
CC List:	18 users (show)
Fixed In Version:	glusterfs-3.12.2-6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1583733 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:36:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fuse-remote vs block-remote (63.68 KB, image/png) 2018-01-17 07:05 UTC, Pranith Kumar K	no flags	Details
fuse-local vs block-local (59.04 KB, image/png) 2018-01-17 07:05 UTC, Pranith Kumar K	no flags	Details
profile for fuse randread test (4.40 KB, text/plain) 2018-02-20 06:41 UTC, Manoj Pillai	no flags	Details
profile for fuse randwrite test (5.34 KB, text/plain) 2018-02-20 06:42 UTC, Manoj Pillai	no flags	Details
profile for gluster-block randread test (6.97 KB, text/plain) 2018-02-20 06:43 UTC, Manoj Pillai	no flags	Details
profile for gluster-block randwrite test (5.57 KB, text/plain) 2018-02-20 06:44 UTC, Manoj Pillai	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:38:14 UTC

Description Manoj Pillai 2017-09-14 16:27:07 UTC

Description of problem:

Seeing poor write performance on gluster-block. This is true for both sequential and random writes, but random write is the more important workload for gluster-block to handle well.

Random write performance on an fio test:
fuse: 8434 iops
gluster-block: 4321 iops (51% of fuse)

The job file used in this test is below (some options like directory and filename_format have to do with the way I run the test; can be changed as needed).

<quote>
[global]
rw=randwrite
end_fsync=1
startdelay=0
ioengine=libaio
direct=1
bs=8k

[randwrite]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=16
numjobs=2
nrfiles=4
openfiles=4
filesize=10g
size=40g
io_size=2048m
</quote>

Version-Release number of selected component (if applicable):

glusterfs-libs-3.8.4-44.el7rhgs.x86_64
glusterfs-fuse-3.8.4-44.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-44.el7rhgs.x86_64
glusterfs-api-3.8.4-44.el7rhgs.x86_64
glusterfs-cli-3.8.4-44.el7rhgs.x86_64
glusterfs-3.8.4-44.el7rhgs.x86_64
glusterfs-server-3.8.4-44.el7rhgs.x86_64

gluster-block-0.2.1-10.el7rhgs.x86_64

tcmu-runner-1.2.0-12.el7rhgs.x86_64

How reproducible:
Consistently

Expected results:

With the number of jobs, files and iodepth used in the test above, random write performance should be close to what fuse can give us. [In bz#1455992 we see that gluster-block random read performance has improved to the point that it is close to fuse performance for an fio test with similar options].

Additional info:
Potentially related to bz #1480188

Comment 2 Manoj Pillai 2017-10-04 08:11:18 UTC

Reporting results from a random I/O write test with 3 clients. 3 gluster block devices created on a replica-3 gluster vol on 3 servers. Each client using 1 block device; each client using a different server as the target.

Distributed fio test running on 3 clients. Other than that test is similar to the one in comment #0, same job file. Since there are 3 clients, data set total size and accessed size is 3x what it is in comment #0: 
data set size = 240g
accessed size = 12g

Comparing results between glusterfs-fuse runs and gluster-block:
glusterfs-fuse: 24400 iops
gluster-block: 7950 iops

In this case, gluster-block only gets 32.5% of glusterfs-fuse iops.

Comment 5 Pranith Kumar K 2018-01-11 10:53:00 UTC

I tried out a smaller fio job with help from Manoj on my VMs and found that profile on fs is giving around 1 fsync for 3 writes in case of fs based I/O where as 1 fsync for 1 write in case of block based I/O.

On FS:
      1.33   81453.41 us     371.58 us  693940.98 us          24939       WRITE
     98.39 18539332.22 us     288.80 us 49637840.86 us           8128       FSYNC


On Block:
     16.32   11502.98 us     340.74 us  229053.95 us          66345       WRITE
     82.80   62249.78 us     276.28 us  443554.46 us          62204       FSYNC

I will do some more debugging to find out the reason for this difference and post my findings.

Comment 6 Pranith Kumar K 2018-01-13 04:51:55 UTC

Did more tests and found that both fuse and block are giving similar performance when O_DIRECT is enabled. We have been running fuse tests without enabling o-direct correctly. i.e. mounting of the volume should happen with "direct-io-mode=yes":

mount -t glusterfs -o direct-io-mode=yes 192.168.122.184:/rep3 /mnt/fs

When I run I/O like that even on fuse we are seeing one fsync per write:
     18.99    3190.52 us     248.82 us  204002.75 us          60836       WRITE
     77.03   12939.99 us     224.67 us  725944.71 us          60845       FSYNC

In my tests block is performing slightly better for a similar workload.

Ravi/Du/Amar,
   Do you know Why direct-io-mode is not enabled by default for fuse-mounts?

Pranith

Comment 7 Pranith Kumar K 2018-01-13 07:48:28 UTC

(In reply to Pranith Kumar K from comment #6)
> Did more tests and found that both fuse and block are giving similar
> performance when O_DIRECT is enabled. We have been running fuse tests
> without enabling o-direct correctly. i.e. mounting of the volume should
> happen with "direct-io-mode=yes":
> 
> mount -t glusterfs -o direct-io-mode=yes 192.168.122.184:/rep3 /mnt/fs
> 
> When I run I/O like that even on fuse we are seeing one fsync per write:
>      18.99    3190.52 us     248.82 us  204002.75 us          60836      
> WRITE
>      77.03   12939.99 us     224.67 us  725944.71 us          60845      
> FSYNC
> 
> In my tests block is performing slightly better for a similar workload.
> 
> Ravi/Du/Amar,
>    Do you know Why direct-io-mode is not enabled by default for fuse-mounts?
> 
> Pranith

I think I found the answer here:
http://lists.gluster.org/pipermail/gluster-devel/2016-February/048427.html

So direct-io-mode bypasses page-cache for all fds.

My next question is why is the performance different when the file is opened with O_DIRECT but direct-io-mode is enabled vs disabled? Is there a bug somewhere in fuse?

Comment 8 Ravishankar N 2018-01-14 04:32:21 UTC

If the application (FIO) opened the file(s) with O_DIRECT successfully, then I assume it is *not* the one who is sending fsyncs (otherwise it beats the purpose of direct io which bypasses the page cache). That means the fsyncs coming from a client side xlator. 

> We have been running fuse tests without enabling o-direct correctly. 
1. Did you also enable performance.strict-o-direct for write-behind?

2. Is it safe to assume that it is AFR that sends the 1:1 fsyncs for the writes (because its random I/O)? If yes, then we need to figure out what is it that is masking the fyscs with not using "direct-io-mode=yes" mount flag.

3. Does disabling write-behind altogether re-introduce the 1:1 fsyncs irrespective of whether the "direct-io-mode=yes" mount flag is used?

Comment 9 Pranith Kumar K 2018-01-14 04:55:54 UTC

(In reply to Ravishankar N from comment #8)
> If the application (FIO) opened the file(s) with O_DIRECT successfully, then
> I assume it is *not* the one who is sending fsyncs (otherwise it beats the
> purpose of direct io which bypasses the page cache). That means the fsyncs
> coming from a client side xlator. 

fsyncs are being sent by AFR. I verified it. Side note: Sending writes with O_DIRECT doesn't mean the application doesn't need to send fsync, it may still have to do that if it wants the write to be placed on disk.

> 
> > We have been running fuse tests without enabling o-direct correctly. 
> 1. Did you also enable performance.strict-o-direct for write-behind?

Yes, we are using gluster-block profile, so strict-o-direct is enabled even client xlator is not filtering o-direct. So that part is fine. Eager-lock is disabled, so only piggy-back is enabled.

For full set of options, refer:
https://github.com/gluster/glusterfs/blob/master/extras/group-gluster-block

> 
> 2. Is it safe to assume that it is AFR that sends the 1:1 fsyncs for the
> writes (because its random I/O)? If yes, then we need to figure out what is
> it that is masking the fyscs with not using "direct-io-mode=yes" mount flag.
> 
> 3. Does disabling write-behind altogether re-introduce the 1:1 fsyncs
> irrespective of whether the "direct-io-mode=yes" mount flag is used?

From whatever I could gather, direct-io-mode which is the option between fuse kernel and fuse-bridge is changing the number of writes that can be piggy-backed in afr. So what ends up happening is on an average AFR is getting 3 writes before fsync is done by AFR, this is not the case both in fuse with direct-io-mode=yes and tcmu-runner doing gfapi based writes with o-direct.

I wanted to understand why fuse behaves differently when direct-io-mode is enabled vs disabled even when the file is opened with o-direct.

Comment 10 Pranith Kumar K 2018-01-15 05:16:31 UTC

Also adding csaba to the thread.

Comment 11 Pranith Kumar K 2018-01-15 13:56:30 UTC

Found that with just O_DIRECT writes are sent asynchronously where as with direct-io-mode=yes, writes are sent synchronously. Ravi confirmed the same by looking at the code. I checked some more on the gluster-block side and found that gluster-block is sending write-cmds in parallel. Will need to continue debugging to find why they are not leading to piggy-backs.

Comment 12 Pranith Kumar K 2018-01-17 07:02:32 UTC

Did more tests and found that the timing issue in gluster-block is happening probably because the writes are not getting pumped fast enough as there is an extra network hop from initiator to target in the case of gluster-block. So it looks like fuse will be faster than gluster-block as long as this is the architecture.

I am attaching the iop comparison graphs which show that the iops are closer when gluster-block and fuse are on one of the trusted storage pools instead of on separate machines.

I had a discussion with Manoj about this data and he said he will do one more round of this test and provide the new data. I'll resume working on this then.

Based on all this data, it looks like the best we can do is to enable eager-lock to reduce xattrop/fsync.

Comment 13 Pranith Kumar K 2018-01-17 07:05:03 UTC

Created attachment 1382238 [details]
fuse-remote vs block-remote

Comment 14 Pranith Kumar K 2018-01-17 07:05:35 UTC

Created attachment 1382239 [details]
fuse-local vs block-local

Comment 15 Pranith Kumar K 2018-01-17 07:18:36 UTC

Manoj,
      Could you capture profile info of the new runs and attach them to the bz?

Pranith

Comment 16 Manoj Pillai 2018-01-22 08:57:37 UTC

I did a round of tests with the following configurations:

1. 1 client with iscsi block device backed by gluster-block.
2. 1 client fuse-mounting glusterfs volume with following tunable parameters set:
performance.strict-o-direct: on
network.remote-dio: disable
performance.io-cache: off
performance.read-ahead: off
3. 1 client fuse-mounting glusterfs volume on which the "group gluster-block" tuning profile has been applied.

In each case, the glusterfs volume is replica-3, on 3 servers, none co-located with client. gluster brick on each server is on an nvme ssd.

Results for an fio random I/O test with bs=8K

1. gluster-block performance:
read: IOPS=18.3k, BW=143Mi (150M)(8192MiB/57235msec)
write: IOPS=2234, BW=17.5Mi (18.3M)(8192MiB/469356msec)

2. glusterfs-fuse with strict-o-direct, remote-dio, io-cache and read-ahead tuned:
read: IOPS=19.4k, BW=151Mi (159M)(8192MiB/54110msec)
write: IOPS=11.6k, BW=90.4Mi (94.8M)(8192MiB/90652msec)

3. glusterfs-fuse with gluster-block group profile applied on the volume:
read: IOPS=6429, BW=50.2Mi (52.7M)(8192MiB/163080msec)
write: IOPS=1930, BW=15.1Mi (15.8M)(8192MiB/543085msec)

The comparison that was being made in this bz in comment #0 is between configuration 1 and 2 (gluster-block vs a glusterfs volume tuned minimally for random I/O). This is how applications would perceive the performance of gluster-block and glusterfs-fuse. [The actual results here are different from those in comment #0, possibly because this is a compeltely different setup].

But it seems clear from this comparison that most of the poor write performance for gluster-block for this workload is coming from the gluster-block group tuning profile. The main contributor seems to be eager-lock: if for configuration 3, I apply "group gluster-block" tuning profile and then turn on just eager-lock, write performance improves to 10.3k IOPS.

Not all the results here completely make sense to me yet, particularly read results. But posting this while I continue with the analysis.

Comment 17 Pranith Kumar K 2018-01-23 06:13:38 UTC

[global]
rw=randwrite
end_fsync=1
startdelay=0
ioengine=libaio
direct=1
bs=8k

[randwrite]
directory=/mnt/block
filename_format=f.$jobnum.$filenum
iodepth=16
numjobs=2
nrfiles=4
openfiles=4
filesize=256m
size=1g
io_size=256m


This is the fio job file I used.

Comment 18 Manoj Pillai 2018-02-20 06:40:25 UTC

Runs on a newly allocated setup. Only 2 of the 3 servers have NVMe SSD, so doing replica-2 runs for the SSD case [discussed with Pranith]. But results are similar to what we have seen earlier:

single client (iscsi over 10ge vs fuse over 10ge)
for gluster-block:
read: IOPS=13.7k, BW=53.4Mi (55.0M)(6406MiB/120003msec)
write: IOPS=3839, BW=14.0Mi (15.7M)(1800MiB/120010msec)

for glusterfs-fuse:
read: IOPS=14.1k, BW=55.2Mi (57.9M)(6624MiB/120001msec)
write: IOPS=9973, BW=38.0Mi (40.9M)(4675MiB/120006msec)

Collected profile for the randread test as well as randwrite test. For reference, the script for the run looks like this:
ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rread_start.txt"
    sleep 30
    ${FIO} --output=${FIO_OUT_PRFX}randread --client=${FIO_HOSTS} ${FIO_JOB_PRFX}randread
ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rread_end.txt"

ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rwrite_start.txt"
    sleep 30
    ${FIO} --output=${FIO_OUT_PRFX}randwrite --client=${FIO_HOSTS} ${FIO_JOB_PRFX}randwrite
ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rwrite_end.txt"

The jobfile is similar to before, but this time it is time_based, with a runtime of 120s. Job for randwrite:

[global]
rw=randwrite
end_fsync=1
startdelay=0
ioengine=libaio
direct=1
bs=4k
numjobs=2

[randwrite]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=16
nrfiles=4
openfiles=4
filesize=10g
size=40g
time_based=1
runtime=120

Comment 19 Manoj Pillai 2018-02-20 06:41:19 UTC

Created attachment 1398100 [details]
profile for fuse randread test

Comment 20 Manoj Pillai 2018-02-20 06:42:23 UTC

Created attachment 1398101 [details]
profile for fuse randwrite test

Comment 21 Manoj Pillai 2018-02-20 06:43:23 UTC

Created attachment 1398102 [details]
profile for gluster-block randread test

Comment 22 Manoj Pillai 2018-02-20 06:44:05 UTC

Created attachment 1398103 [details]
profile for gluster-block randwrite test

Comment 24 Pranith Kumar K 2018-02-27 13:50:24 UTC

https://review.gluster.org/19503

Comment 31 Pranith Kumar K 2018-03-23 12:25:41 UTC

https://code.engineering.redhat.com/gerrit/133659 storage/posix: Add active-fd-count option in gluster
https://code.engineering.redhat.com/gerrit/133660 cluster/afr: Switch to active-fd-count for open-fd checks
https://code.engineering.redhat.com/gerrit/131944 cluster/afr: Remove unused code paths
https://code.engineering.redhat.com/gerrit/131945 cluster/afr: Make AFR eager-locking similar to EC

Comment 32 Manoj Pillai 2018-04-12 08:46:59 UTC

fio random write test with bs=8K on glusterfs-3.12.2-5.el7rhgs.x86_64:
write: IOPS=2328, BW=18.2Mi (19.1M)(8192MiB/450264msec)

same test on glusterfs-3.12.2-6.el7rhgs.x86_64:
write: IOPS=5814, BW=45.4Mi (47.6M)(8192MiB/180340msec)

There is a 2.5x improvement in write IOPS on this test with the 3.12.2-6 build, compared to 3.12.2-5.

Read IOPS was unaffected in the test:
3.12.2-5: read: IOPS=10.7k, BW=83.6Mi (87.6M)(8192MiB/98010msec)
3.12.2-6: read: IOPS=10.8k, BW=84.1Mi (88.2M)(8192MiB/97426msec)

Comment 41 errata-xmlrpc 2018-09-04 06:36:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.