Bug 1480188 - writes amplified on brick with gluster-block
Summary: writes amplified on brick with gluster-block
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Pranith Kumar K
QA Contact: Sweta Anandpara
URL:
Whiteboard:
Depends On: 1499644
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2017-08-10 11:14 UTC by Manoj Pillai
Modified: 2018-09-17 11:10 UTC (History)
12 users (show)

Fixed In Version: glusterfs-3.12.2-6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 06:34:23 UTC
Embargoed:


Attachments (Terms of Use)
iostat output over an hour (223.60 KB, text/plain)
2017-08-10 11:14 UTC, Manoj Pillai
no flags Details
iostat from client (79.61 KB, text/plain)
2018-02-27 12:14 UTC, Manoj Pillai
no flags Details
iostat from a brick (184.37 KB, text/plain)
2018-02-27 12:15 UTC, Manoj Pillai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:35:54 UTC

Description Manoj Pillai 2017-08-10 11:14:10 UTC
Created attachment 1311709 [details]
iostat output over an hour

Description of problem:

With gluster block, iostat frequently shows bytes written to the brick device to be much higher than the bytes written on the (iSCSI) gluster block device. Consider iostat output below:

<iostat output snippet on smerf03>
08/09/2017 12:37:50 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
nvme0n1           0.00     1.20    0.10  268.10     0.40 16377.70   122.13
sdc               0.00     1.70    0.00  198.20     0.00  3413.30    34.44
</iostat output snippet on smerf03>

In this case, the gluster brick is created on the nvme0n1 device. There are 2 other bricks, on 2 other systems, that together form a 1x3 gluster volume on which the block device is created.
sdc is the iscsi device corresponding to the gluster block device and is being accessed on one of the nodes, smerf03, in the gluster cluster.

writes/s (w/s) on nvme0n1 is slightly higher than on sdc. That can probably be explained by extra metadata writes needed in the case of gluster-block.

wkB/s on nvme0n1 is much higher than on sdc. Almost 5x. This is unexpected.

iostat on the other 2 nodes for the same period:

smerf02:
08/09/2017 12:37:51 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
nvme0n1           0.00     1.40    0.30  300.30     1.20 18190.05   121.03

smerf04:
08/09/2017 12:37:56 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
nvme0n1           0.00     0.80    0.00  271.30     0.00 16605.75   122.42


Detailed iostat output for about an hour is attached. The request sizes that are being sent to the nvme brick devices are much higher than the request sizes that the sdc block device is receiving.

Version-Release number of selected component (if applicable):

glusterfs-api-3.8.4-37.el7rhgs.x86_64
glusterfs-server-3.8.4-37.el7rhgs.x86_64
glusterfs-libs-3.8.4-37.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-37.el7rhgs.x86_64
glusterfs-fuse-3.8.4-37.el7rhgs.x86_64
glusterfs-3.8.4-37.el7rhgs.x86_64
glusterfs-cli-3.8.4-37.el7rhgs.x86_64

gluster-block-0.2.1-6.el7rhgs.x86_64

libtcmu-1.2.0-11.el7rhgs.x86_64
tcmu-runner-1.2.0-11.el7rhgs.x86_64


Additional info:

Volume Name: perfvol
Type: Replicate
Volume ID: 18c6ddde-0f69-4796-9c25-35f1ce7c3350
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: smerf02-10ge:/mnt/rhs_brick1
Brick2: smerf03-10ge:/mnt/rhs_brick1
Brick3: smerf04-10ge:/mnt/rhs_brick1
Options Reconfigured:
server.allow-insecure: on
user.cifs: off
features.shard-block-size: 64MB
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.quorum-type: auto
cluster.eager-lock: disable
network.remote-dio: disable
performance.readdir-ahead: off
performance.open-behind: off
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on

Comment 4 Pranith Kumar K 2018-01-13 05:48:54 UTC
Manoj,
      I did the test with smaller values adjusted for my environment based on our earlier discussion on IRC and I find the number of writes which are more than 8kb I/O size seem to be very less(65482 8k+ writes whereas 28 16k+ writes) in comparison and I wonder if I am doing something wrong.

Interval 3 Stats:
   Block Size:                512b+                1024b+                2048b+ 
 No. of Reads:                    0                     0                     0 
No. of Writes:                    2                     8                    17 
 
   Block Size:               8192b+               16384b+ 
 No. of Reads:                    0                     0 
No. of Writes:                65482 <<<<-----           28  <<<<<------
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us             18  RELEASEDIR
      0.00       5.24 us       2.56 us       9.36 us             18     OPENDIR
      0.00      62.36 us      29.43 us      96.55 us             18      LOOKUP
      0.00     192.93 us      42.84 us     467.66 us             24    GETXATTR
      0.00    1239.89 us     567.02 us    2906.22 us             36     READDIR
      0.59      60.04 us      13.21 us   32508.10 us         131074    FINODELK
      1.95     212.93 us      46.80 us   46163.11 us         122842    FXATTROP
     14.64    2997.92 us     262.91 us  290958.54 us          65537       WRITE
     82.82   18092.43 us     217.65 us  339738.47 us          61433       FSYNC

This is the job file used:
[global]
rw=randwrite
end_fsync=1
startdelay=0
ioengine=libaio
direct=1
bs=8k

[randwrite]
directory=/mnt/block
filename_format=f.$jobnum.$filenum
iodepth=16
numjobs=2
nrfiles=2
openfiles=4
filesize=256m
size=2g
io_size=256m


Pranith

Comment 5 Manoj Pillai 2018-02-27 12:07:01 UTC
My recent observations are documented here:
https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c18

The relevant info pertaining to this bz i.e. write-amplification:

From the gluster profile, it seems that request sizes being received are 4K, as expected for the fio job file with bs=4k.

Interval 4 Stats:
   Block Size:               2048b+                4096b+
 No. of Reads:                    0                     0
No. of Writes:                    8                460811
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      7.64      28.07 us      10.00 us     818.00 us         921638    FINODELK
     15.04     110.50 us      61.00 us    5352.00 us         460819       WRITE
     32.34     119.16 us      63.00 us    5158.00 us         918726    FXATTROP
     44.97     331.33 us      23.00 us    6656.00 us         459385       FSYNC

However, iostat on iscsi client and gluster brick show a big difference in request size and throughput:

iostat on client:
02/20/2018 12:36:13 AM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
sdc               0.00     0.00    0.00 3827.20     0.00 15308.80     8.00    

iostat on brick:
02/20/2018 12:36:12 AM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
nvme0n1           0.00     1.30    0.00 4774.40     0.00 135809.60    56.89  

avgrq-sz on client is 8 i.e. 4K expressed in 512b sectors.
avgrq-sz on server is 56+ i.e. 28K+

Similarly wkB/s being written on brick is much higher than wkB/s that client is getting.

Comment 6 Manoj Pillai 2018-02-27 12:14:27 UTC
Created attachment 1401305 [details]
iostat from client

Comment 7 Manoj Pillai 2018-02-27 12:15:13 UTC
Created attachment 1401307 [details]
iostat from a brick

Comment 8 Manoj Pillai 2018-02-27 12:18:20 UTC
comment #6 has iostat output on the iscsi client for a run that includes sequential write, random read, random write in that order.

comment #7 has iostat output on one of the 2 bricks for the same run (I'm using a replica-2 config because I was unable to get an allocation with 3 SSDs).

gluster profile info for the randwrite phase is here: https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c22.

Comment 9 Manoj Pillai 2018-02-27 12:20:31 UTC
One of the questions when we saw this write amplification with gluster-block was which component is causing it -- whether this was coming from tcmu or above. These recent observations suggest that the problem is in gluster.

Comment 10 Manoj Pillai 2018-02-27 12:28:31 UTC
(In reply to Manoj Pillai from comment #9)
> One of the questions when we saw this write amplification with gluster-block
> was which component is causing it -- whether this was coming from tcmu or
> above. These recent observations suggest that the problem is in gluster.

Another point to add here: Nikhil Chawla in his RHHI testing is also reporting very similar behavior, where he sees throughput reported by iostat on the brick to be much higher than what iostat reports in the VM. So the issue seems to be something common to the RHHI and gluster-block configurations.

[The other similarity between the RHHI and gluster-block configurations is performance problems with eager-lock turned on: https://bugzilla.redhat.com/show_bug.cgi?id=1528566. Poor performance with eager-lock turned on is the reason it was turned off in the gluster-block group tuning profile].

Comment 11 Pranith Kumar K 2018-02-27 12:35:55 UTC
(In reply to Manoj Pillai from comment #5)
> My recent observations are documented here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c18
> 
> The relevant info pertaining to this bz i.e. write-amplification:
> 
> From the gluster profile, it seems that request sizes being received are 4K,
> as expected for the fio job file with bs=4k.
> 
> Interval 4 Stats:
>    Block Size:               2048b+                4096b+
>  No. of Reads:                    0                     0
> No. of Writes:                    8                460811
>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls        
> Fop
>  ---------   -----------   -----------   -----------   ------------       
> ----
>       7.64      28.07 us      10.00 us     818.00 us         921638   
> FINODELK
>      15.04     110.50 us      61.00 us    5352.00 us         460819      
> WRITE
>      32.34     119.16 us      63.00 us    5158.00 us         918726   
> FXATTROP
>      44.97     331.33 us      23.00 us    6656.00 us         459385      
> FSYNC
> 
> However, iostat on iscsi client and gluster brick show a big difference in
> request size and throughput:
> 
> iostat on client:
> 02/20/2018 12:36:13 AM
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> sdc               0.00     0.00    0.00 3827.20     0.00 15308.80     8.00  
> 
> 
> iostat on brick:
> 02/20/2018 12:36:12 AM
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> nvme0n1           0.00     1.30    0.00 4774.40     0.00 135809.60    56.89  
> 
> avgrq-sz on client is 8 i.e. 4K expressed in 512b sectors.
> avgrq-sz on server is 56+ i.e. 28K+
> 
> Similarly wkB/s being written on brick is much higher than wkB/s that client
> is getting.

How are fxattrops i.e. getxattr+setxattr translated in iostat?
There are 918726 number of fxattrops i.e. 918726 number of getxattrs and 918726 number of setxattrs

Comment 12 Manoj Pillai 2018-02-27 13:22:21 UTC
There is xfs sitting between the gluster brick-side xlators and the block device. The translation will happen there -- iostat only deals with block device reads and writes. not exactly sure of details of xattr handling in xfs.

I'll do a run with cluster.ensure-durability off to see if that gives some insight, whether fsync forcing write of metadata in addition to data is what is at work here.

Comment 13 Pranith Kumar K 2018-02-27 13:56:19 UTC
(In reply to Manoj Pillai from comment #12)
> There is xfs sitting between the gluster brick-side xlators and the block
> device. The translation will happen there -- iostat only deals with block
> device reads and writes. not exactly sure of details of xattr handling in
> xfs.
> 
> I'll do a run with cluster.ensure-durability off to see if that gives some
> insight, whether fsync forcing write of metadata in addition to data is what
> is at work here.

Hey,
    I just realized that the fsyncs that come from AFR are fdatasyncs. Based on the flags in the FSYNC fop, it gets converted to fdatasync at the posix layer. So may be this will not give us the data we need.

Pranith

Comment 14 Manoj Pillai 2018-02-28 03:03:09 UTC
Will wait for eager-locking changes and then evaluate to see what its impact is:
https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c24.

Comment 16 Pranith Kumar K 2018-03-23 13:16:09 UTC
https://code.engineering.redhat.com/gerrit/133659 storage/posix: Add active-fd-count option in gluster
https://code.engineering.redhat.com/gerrit/133660 cluster/afr: Switch to active-fd-count for open-fd checks
https://code.engineering.redhat.com/gerrit/131944 cluster/afr: Remove unused code paths
https://code.engineering.redhat.com/gerrit/131945 cluster/afr: Make AFR eager-locking similar to EC

Comment 17 Manoj Pillai 2018-04-12 07:14:21 UTC
iostat from fio random write test:

Before AFR eager-lock fixes: glusterfs-3.12.2-5.el7rhgs.x86_64

on client
03/28/2018 09:25:11 AM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
sdc               0.00     0.00    0.00 2294.40     0.00 18355.20    16.00 

on one of the bricks:
03/28/2018 09:25:10 AM 
nvme0n1           0.00     9.90    0.00 2775.30     0.00 76901.65    55.42

After AFR eager-lock fixes: glusterfs-3.12.2-6.el7rhgs.x86_64
on client
03/28/2018 03:27:57 PM
sdc               0.00     0.00    0.00 5791.70     0.00 46332.60    16.00

on one of the bricks:
nvme0n1           0.00     4.60    0.00 5799.70     0.00 48393.45    16.69

Write amplification is not seen in glusterfs-3.12.2-6.el7rhgs.x86_64: wkB/s is almost same on both client and brick in this build.

This bz can be moved to verified.

Comment 21 errata-xmlrpc 2018-09-04 06:34:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.