Description of problem: Seeing poor write performance on gluster-block. This is true for both sequential and random writes, but random write is the more important workload for gluster-block to handle well. Random write performance on an fio test: fuse: 8434 iops gluster-block: 4321 iops (51% of fuse) The job file used in this test is below (some options like directory and filename_format have to do with the way I run the test; can be changed as needed). <quote> [global] rw=randwrite end_fsync=1 startdelay=0 ioengine=libaio direct=1 bs=8k [randwrite] directory=/mnt/glustervol/${HOSTNAME} filename_format=f.$jobnum.$filenum iodepth=16 numjobs=2 nrfiles=4 openfiles=4 filesize=10g size=40g io_size=2048m </quote> Version-Release number of selected component (if applicable): glusterfs-libs-3.8.4-44.el7rhgs.x86_64 glusterfs-fuse-3.8.4-44.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-44.el7rhgs.x86_64 glusterfs-api-3.8.4-44.el7rhgs.x86_64 glusterfs-cli-3.8.4-44.el7rhgs.x86_64 glusterfs-3.8.4-44.el7rhgs.x86_64 glusterfs-server-3.8.4-44.el7rhgs.x86_64 gluster-block-0.2.1-10.el7rhgs.x86_64 tcmu-runner-1.2.0-12.el7rhgs.x86_64 How reproducible: Consistently Expected results: With the number of jobs, files and iodepth used in the test above, random write performance should be close to what fuse can give us. [In bz#1455992 we see that gluster-block random read performance has improved to the point that it is close to fuse performance for an fio test with similar options]. Additional info: Potentially related to bz #1480188
Reporting results from a random I/O write test with 3 clients. 3 gluster block devices created on a replica-3 gluster vol on 3 servers. Each client using 1 block device; each client using a different server as the target. Distributed fio test running on 3 clients. Other than that test is similar to the one in comment #0, same job file. Since there are 3 clients, data set total size and accessed size is 3x what it is in comment #0: data set size = 240g accessed size = 12g Comparing results between glusterfs-fuse runs and gluster-block: glusterfs-fuse: 24400 iops gluster-block: 7950 iops In this case, gluster-block only gets 32.5% of glusterfs-fuse iops.
I tried out a smaller fio job with help from Manoj on my VMs and found that profile on fs is giving around 1 fsync for 3 writes in case of fs based I/O where as 1 fsync for 1 write in case of block based I/O. On FS: 1.33 81453.41 us 371.58 us 693940.98 us 24939 WRITE 98.39 18539332.22 us 288.80 us 49637840.86 us 8128 FSYNC On Block: 16.32 11502.98 us 340.74 us 229053.95 us 66345 WRITE 82.80 62249.78 us 276.28 us 443554.46 us 62204 FSYNC I will do some more debugging to find out the reason for this difference and post my findings.
Did more tests and found that both fuse and block are giving similar performance when O_DIRECT is enabled. We have been running fuse tests without enabling o-direct correctly. i.e. mounting of the volume should happen with "direct-io-mode=yes": mount -t glusterfs -o direct-io-mode=yes 192.168.122.184:/rep3 /mnt/fs When I run I/O like that even on fuse we are seeing one fsync per write: 18.99 3190.52 us 248.82 us 204002.75 us 60836 WRITE 77.03 12939.99 us 224.67 us 725944.71 us 60845 FSYNC In my tests block is performing slightly better for a similar workload. Ravi/Du/Amar, Do you know Why direct-io-mode is not enabled by default for fuse-mounts? Pranith
(In reply to Pranith Kumar K from comment #6) > Did more tests and found that both fuse and block are giving similar > performance when O_DIRECT is enabled. We have been running fuse tests > without enabling o-direct correctly. i.e. mounting of the volume should > happen with "direct-io-mode=yes": > > mount -t glusterfs -o direct-io-mode=yes 192.168.122.184:/rep3 /mnt/fs > > When I run I/O like that even on fuse we are seeing one fsync per write: > 18.99 3190.52 us 248.82 us 204002.75 us 60836 > WRITE > 77.03 12939.99 us 224.67 us 725944.71 us 60845 > FSYNC > > In my tests block is performing slightly better for a similar workload. > > Ravi/Du/Amar, > Do you know Why direct-io-mode is not enabled by default for fuse-mounts? > > Pranith I think I found the answer here: http://lists.gluster.org/pipermail/gluster-devel/2016-February/048427.html So direct-io-mode bypasses page-cache for all fds. My next question is why is the performance different when the file is opened with O_DIRECT but direct-io-mode is enabled vs disabled? Is there a bug somewhere in fuse?
If the application (FIO) opened the file(s) with O_DIRECT successfully, then I assume it is *not* the one who is sending fsyncs (otherwise it beats the purpose of direct io which bypasses the page cache). That means the fsyncs coming from a client side xlator. > We have been running fuse tests without enabling o-direct correctly. 1. Did you also enable performance.strict-o-direct for write-behind? 2. Is it safe to assume that it is AFR that sends the 1:1 fsyncs for the writes (because its random I/O)? If yes, then we need to figure out what is it that is masking the fyscs with not using "direct-io-mode=yes" mount flag. 3. Does disabling write-behind altogether re-introduce the 1:1 fsyncs irrespective of whether the "direct-io-mode=yes" mount flag is used?
(In reply to Ravishankar N from comment #8) > If the application (FIO) opened the file(s) with O_DIRECT successfully, then > I assume it is *not* the one who is sending fsyncs (otherwise it beats the > purpose of direct io which bypasses the page cache). That means the fsyncs > coming from a client side xlator. fsyncs are being sent by AFR. I verified it. Side note: Sending writes with O_DIRECT doesn't mean the application doesn't need to send fsync, it may still have to do that if it wants the write to be placed on disk. > > > We have been running fuse tests without enabling o-direct correctly. > 1. Did you also enable performance.strict-o-direct for write-behind? Yes, we are using gluster-block profile, so strict-o-direct is enabled even client xlator is not filtering o-direct. So that part is fine. Eager-lock is disabled, so only piggy-back is enabled. For full set of options, refer: https://github.com/gluster/glusterfs/blob/master/extras/group-gluster-block > > 2. Is it safe to assume that it is AFR that sends the 1:1 fsyncs for the > writes (because its random I/O)? If yes, then we need to figure out what is > it that is masking the fyscs with not using "direct-io-mode=yes" mount flag. > > 3. Does disabling write-behind altogether re-introduce the 1:1 fsyncs > irrespective of whether the "direct-io-mode=yes" mount flag is used? From whatever I could gather, direct-io-mode which is the option between fuse kernel and fuse-bridge is changing the number of writes that can be piggy-backed in afr. So what ends up happening is on an average AFR is getting 3 writes before fsync is done by AFR, this is not the case both in fuse with direct-io-mode=yes and tcmu-runner doing gfapi based writes with o-direct. I wanted to understand why fuse behaves differently when direct-io-mode is enabled vs disabled even when the file is opened with o-direct.
Also adding csaba to the thread.
Found that with just O_DIRECT writes are sent asynchronously where as with direct-io-mode=yes, writes are sent synchronously. Ravi confirmed the same by looking at the code. I checked some more on the gluster-block side and found that gluster-block is sending write-cmds in parallel. Will need to continue debugging to find why they are not leading to piggy-backs.
Did more tests and found that the timing issue in gluster-block is happening probably because the writes are not getting pumped fast enough as there is an extra network hop from initiator to target in the case of gluster-block. So it looks like fuse will be faster than gluster-block as long as this is the architecture. I am attaching the iop comparison graphs which show that the iops are closer when gluster-block and fuse are on one of the trusted storage pools instead of on separate machines. I had a discussion with Manoj about this data and he said he will do one more round of this test and provide the new data. I'll resume working on this then. Based on all this data, it looks like the best we can do is to enable eager-lock to reduce xattrop/fsync.
Created attachment 1382238 [details] fuse-remote vs block-remote
Created attachment 1382239 [details] fuse-local vs block-local
Manoj, Could you capture profile info of the new runs and attach them to the bz? Pranith
I did a round of tests with the following configurations: 1. 1 client with iscsi block device backed by gluster-block. 2. 1 client fuse-mounting glusterfs volume with following tunable parameters set: performance.strict-o-direct: on network.remote-dio: disable performance.io-cache: off performance.read-ahead: off 3. 1 client fuse-mounting glusterfs volume on which the "group gluster-block" tuning profile has been applied. In each case, the glusterfs volume is replica-3, on 3 servers, none co-located with client. gluster brick on each server is on an nvme ssd. Results for an fio random I/O test with bs=8K 1. gluster-block performance: read: IOPS=18.3k, BW=143Mi (150M)(8192MiB/57235msec) write: IOPS=2234, BW=17.5Mi (18.3M)(8192MiB/469356msec) 2. glusterfs-fuse with strict-o-direct, remote-dio, io-cache and read-ahead tuned: read: IOPS=19.4k, BW=151Mi (159M)(8192MiB/54110msec) write: IOPS=11.6k, BW=90.4Mi (94.8M)(8192MiB/90652msec) 3. glusterfs-fuse with gluster-block group profile applied on the volume: read: IOPS=6429, BW=50.2Mi (52.7M)(8192MiB/163080msec) write: IOPS=1930, BW=15.1Mi (15.8M)(8192MiB/543085msec) The comparison that was being made in this bz in comment #0 is between configuration 1 and 2 (gluster-block vs a glusterfs volume tuned minimally for random I/O). This is how applications would perceive the performance of gluster-block and glusterfs-fuse. [The actual results here are different from those in comment #0, possibly because this is a compeltely different setup]. But it seems clear from this comparison that most of the poor write performance for gluster-block for this workload is coming from the gluster-block group tuning profile. The main contributor seems to be eager-lock: if for configuration 3, I apply "group gluster-block" tuning profile and then turn on just eager-lock, write performance improves to 10.3k IOPS. Not all the results here completely make sense to me yet, particularly read results. But posting this while I continue with the analysis.
[global] rw=randwrite end_fsync=1 startdelay=0 ioengine=libaio direct=1 bs=8k [randwrite] directory=/mnt/block filename_format=f.$jobnum.$filenum iodepth=16 numjobs=2 nrfiles=4 openfiles=4 filesize=256m size=1g io_size=256m This is the fio job file I used.
Runs on a newly allocated setup. Only 2 of the 3 servers have NVMe SSD, so doing replica-2 runs for the SSD case [discussed with Pranith]. But results are similar to what we have seen earlier: single client (iscsi over 10ge vs fuse over 10ge) for gluster-block: read: IOPS=13.7k, BW=53.4Mi (55.0M)(6406MiB/120003msec) write: IOPS=3839, BW=14.0Mi (15.7M)(1800MiB/120010msec) for glusterfs-fuse: read: IOPS=14.1k, BW=55.2Mi (57.9M)(6624MiB/120001msec) write: IOPS=9973, BW=38.0Mi (40.9M)(4675MiB/120006msec) Collected profile for the randread test as well as randwrite test. For reference, the script for the run looks like this: ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rread_start.txt" sleep 30 ${FIO} --output=${FIO_OUT_PRFX}randread --client=${FIO_HOSTS} ${FIO_JOB_PRFX}randread ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rread_end.txt" ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rwrite_start.txt" sleep 30 ${FIO} --output=${FIO_OUT_PRFX}randwrite --client=${FIO_HOSTS} ${FIO_JOB_PRFX}randwrite ssh $RHS_LEAD_SERV "gluster volume profile ${RHS_VOL} info > /tmp/gprof.rwrite_end.txt" The jobfile is similar to before, but this time it is time_based, with a runtime of 120s. Job for randwrite: [global] rw=randwrite end_fsync=1 startdelay=0 ioengine=libaio direct=1 bs=4k numjobs=2 [randwrite] directory=/mnt/glustervol/${HOSTNAME} filename_format=f.$jobnum.$filenum iodepth=16 nrfiles=4 openfiles=4 filesize=10g size=40g time_based=1 runtime=120
Created attachment 1398100 [details] profile for fuse randread test
Created attachment 1398101 [details] profile for fuse randwrite test
Created attachment 1398102 [details] profile for gluster-block randread test
Created attachment 1398103 [details] profile for gluster-block randwrite test
https://review.gluster.org/19503
https://code.engineering.redhat.com/gerrit/133659 storage/posix: Add active-fd-count option in gluster https://code.engineering.redhat.com/gerrit/133660 cluster/afr: Switch to active-fd-count for open-fd checks https://code.engineering.redhat.com/gerrit/131944 cluster/afr: Remove unused code paths https://code.engineering.redhat.com/gerrit/131945 cluster/afr: Make AFR eager-locking similar to EC
fio random write test with bs=8K on glusterfs-3.12.2-5.el7rhgs.x86_64: write: IOPS=2328, BW=18.2Mi (19.1M)(8192MiB/450264msec) same test on glusterfs-3.12.2-6.el7rhgs.x86_64: write: IOPS=5814, BW=45.4Mi (47.6M)(8192MiB/180340msec) There is a 2.5x improvement in write IOPS on this test with the 3.12.2-6 build, compared to 3.12.2-5. Read IOPS was unaffected in the test: 3.12.2-5: read: IOPS=10.7k, BW=83.6Mi (87.6M)(8192MiB/98010msec) 3.12.2-6: read: IOPS=10.8k, BW=84.1Mi (88.2M)(8192MiB/97426msec)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607