Bug 1328580 - [Perf] : Switching to "rhgs-random-io" profile makes random writes go really slow
Summary: [Perf] : Switching to "rhgs-random-io" profile makes random writes go really ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Ravishankar N
QA Contact: Ambarish
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-19 18:26 UTC by Ambarish
Modified: 2018-04-16 18:17 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-16 18:17:17 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Ambarish 2016-04-19 18:26:48 UTC
Description of problem:
----------------------

As per the admin guide,the tuned-adm profile tailored for random workloads is "rhgs-random-io".I see that the random write performance goes for a toss when  nodes are tuned to it. 

Version-Release number of selected component (if applicable):
------------------------------------------------------------

glusterfs-3.7.9-1.el6rhs.x86_64


How reproducible:
----------------

2/2


Steps to Reproduce:
--------------------

1. Tune all the nodes in the gluster to any profile,say "rhgs-sequential-io" or "throughput-performance".Run Random Write workload thrice.

2. Clean mount point.Switch to "rhgs-random-io".Restart glusterd.Remount volume on clients using FUSE.

3. Run random write workload again,thrice.


Actual results:
--------------
I see >50% performance hit on rand writes with "rhgs-random-io" as compared to other profiles on the same setup.


Expected results:
-----------------

RHGS should be equally or more performant on random writes when nodes are tuned to "rhgs-random-io" profile.


Additional info:
---------------

OS : RHEL 7.2

Iozone was used in a distributed multithreaded manner with a 2G file size ,record size of 64K and a total of 16 threads.

Setup consisted of 4 servers,4 clients (1X mount per server) on 10GbE network.


Volume Settings :


[root@gqas001 ~]# gluster v info

 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 2a668beb-7f26-48f9-8550-157108fe1a55
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas001.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas016.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
performance.readdir-ahead: on
performance.stat-prefetch: off
server.allow-insecure: on
[root@gqas001 ~]# 
[root@gqas001 ~]#

Comment 2 Ambarish 2016-04-19 18:32:49 UTC
*****************
On RHEL 7.2 Setup 
*****************

Random Writes was ran thrice against each tuned profile.Mean throughputs for each of the profiles is given below:

> rhgs-random-io : 133086.533333 KB/s
> rhgs-sequential-io : 337787.506667 KB/s
> throughput-performance :  356840.130000

Comment 3 Ambarish 2016-04-19 18:35:13 UTC
I see the same problem on a  RHEL 6.X setup :

> rhgs-random-io : 146881.85 KB/s
> rhgs-sequential-io : 326301.74 KB/s
> throughput-performance :  366735.39 KB/s

Comment 4 Ambarish 2016-04-19 18:37:30 UTC
The exact workload :

3 * [iozone -+m <Your Iozone conf file containing all hostnames here> -+h <one of the hostnames> -C -w -c -e -i 2 -J 3 -+n -r 64k -s 2g -t 16 ]

Comment 5 Ambarish 2016-04-19 18:39:04 UTC
I'll update the BZ with server profiles soon.

Comment 6 Ambarish 2016-04-28 11:22:58 UTC
I'll check this with 3.1.2 , just to see if it's regression and update my findings.

Comment 9 Manoj Pillai 2016-05-25 07:56:44 UTC
Here are the results of some fio tests varying vm.dirty* parameters. The fio tests have a sequential write test for which I used a jobfile like this:

<body>
[global]
rw=write
create_on_open=1
fsync_on_close=1
size=4g
bs=64k
openfiles=1
startdelay=0
ioengine=sync

[lgf-write]
directory=/mnt/glustervol/${HOSTNAME}
nrfiles=1
filename_format=f.$jobnum.$filenum
numjobs=8
</body>

And a random write test, with a jobfile like this:

<body>
[global]
rw=randwrite
fsync_on_close=1
io_size=1g
size=4g
bs=64k
openfiles=1
startdelay=0
ioengine=sync

[lgf-randwrite]
directory=/mnt/glustervol/${HOSTNAME}
nrfiles=1
filename_format=f.$jobnum.$filenum
numjobs=8
</body>

So, the random write test accesses only a portion of the file (determined by io_size) instead of the whole file (determined by size).

The tests are run from 4 clients (8 jobs on each client) to a 2x2 gluster volume on 4 servers.

Here are results for different values of size and io_size, for different values of vm_dirty* parameters. I'm only reporting results for the randwrite test.

size=4g, io_size=1g
-------------------

vm.dirty_ratio=20; vm.dirty_background_ratio=10
write: io=32768MB, bw=79145K/s, iops=1236, runt=423963msec
clat (usec): min=97, max=115652K, avg=21957.24, stdev=637489.85

vm.dirty_ratio=5; vm.dirty_background_ratio=2
write: io=32768MB, bw=58347K/s, iops=911, runt=575089msec
clat (usec): min=116, max=14990K, avg=33738.72, stdev=108229.87

[in this case, the vm.dirty* values that correspond to rhgs-sequential-io is giving a big boost in throughput. but note that the max clat and clat stdev are much higher when the vm.dirty* values are higher]

size=16g, io_size=1g
-------------------

vm.dirty_ratio=20; vm.dirty_background_ratio=10
write: io=32768MB, bw=46383K/s, iops=724, runt=723428msec
clat (usec): min=126, max=251072K, avg=36163.69,stdev=1386874.60

vm.dirty_ratio=5; vm.dirty_background_ratio=2
write: io=32768MB, bw=37976K/s, iops=593, runt=883569msec
clat (usec): min=110, max=44342K, avg=50008.68, stdev=232231.87

size=16g, io_size=0.5g
-------------------

vm.dirty_ratio=20; vm.dirty_background_ratio=10
write: io=16384MB, bw=42787K/s, iops=668, runt=392107msec
clat (usec): min=119, max=299196K, avg=21617.41,stdev=2337747.27

vm.dirty_ratio=5; vm.dirty_background_ratio=2
write: io=16384MB, bw=41841K/s, iops=653, runt=400973msec
clat (usec): min=118, max=25975K, avg=46318.03, stdev=222292.08

[in this case, there is hardly any difference in throughput. but like before the clat max and stdev are much lower for vm.dirty*=5,2.]

Comment 10 Manoj Pillai 2016-05-25 08:12:39 UTC
(In reply to Manoj Pillai from comment #9)
> Here are the results of some fio tests varying vm.dirty* parameters.

What I see from these results, is that when the workload is truly random, there is not much difference in throughput between the two profiles. However, by reducing the amount of dirty data buffering in server memory, rhgs-random-io should reduce unpleasant effects like long delays (that appear like system freeze-up) while dirty data is written back. The goal of the rhgs-random-io profile is not so much to boost throughput for random io (compared to rhgs-sequential-io), but to smoothen the latency spikes.

My fio test with size=4g,io_size=1g and Ambarish's iozone test where the entire 2g file is overwritten in the random io phase are not particularly random during the writeback to disk. Individual iozone or fio writes will get buffered in the server page cache, and will often lead to multiple writes getting batched into larger writes before writeback. This effect will be more for rhgs-sequential-io, since it allows more dirty data to be buffered and therefore allows more batching of writes before writeback.


Note You need to log in before you can comment on or make changes to this bug.