Created attachment 1311709 [details] iostat output over an hour Description of problem: With gluster block, iostat frequently shows bytes written to the brick device to be much higher than the bytes written on the (iSCSI) gluster block device. Consider iostat output below: <iostat output snippet on smerf03> 08/09/2017 12:37:50 PM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz nvme0n1 0.00 1.20 0.10 268.10 0.40 16377.70 122.13 sdc 0.00 1.70 0.00 198.20 0.00 3413.30 34.44 </iostat output snippet on smerf03> In this case, the gluster brick is created on the nvme0n1 device. There are 2 other bricks, on 2 other systems, that together form a 1x3 gluster volume on which the block device is created. sdc is the iscsi device corresponding to the gluster block device and is being accessed on one of the nodes, smerf03, in the gluster cluster. writes/s (w/s) on nvme0n1 is slightly higher than on sdc. That can probably be explained by extra metadata writes needed in the case of gluster-block. wkB/s on nvme0n1 is much higher than on sdc. Almost 5x. This is unexpected. iostat on the other 2 nodes for the same period: smerf02: 08/09/2017 12:37:51 PM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz nvme0n1 0.00 1.40 0.30 300.30 1.20 18190.05 121.03 smerf04: 08/09/2017 12:37:56 PM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz nvme0n1 0.00 0.80 0.00 271.30 0.00 16605.75 122.42 Detailed iostat output for about an hour is attached. The request sizes that are being sent to the nvme brick devices are much higher than the request sizes that the sdc block device is receiving. Version-Release number of selected component (if applicable): glusterfs-api-3.8.4-37.el7rhgs.x86_64 glusterfs-server-3.8.4-37.el7rhgs.x86_64 glusterfs-libs-3.8.4-37.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-37.el7rhgs.x86_64 glusterfs-fuse-3.8.4-37.el7rhgs.x86_64 glusterfs-3.8.4-37.el7rhgs.x86_64 glusterfs-cli-3.8.4-37.el7rhgs.x86_64 gluster-block-0.2.1-6.el7rhgs.x86_64 libtcmu-1.2.0-11.el7rhgs.x86_64 tcmu-runner-1.2.0-11.el7rhgs.x86_64 Additional info: Volume Name: perfvol Type: Replicate Volume ID: 18c6ddde-0f69-4796-9c25-35f1ce7c3350 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: smerf02-10ge:/mnt/rhs_brick1 Brick2: smerf03-10ge:/mnt/rhs_brick1 Brick3: smerf04-10ge:/mnt/rhs_brick1 Options Reconfigured: server.allow-insecure: on user.cifs: off features.shard-block-size: 64MB features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.quorum-type: auto cluster.eager-lock: disable network.remote-dio: disable performance.readdir-ahead: off performance.open-behind: off performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet nfs.disable: on
Manoj, I did the test with smaller values adjusted for my environment based on our earlier discussion on IRC and I find the number of writes which are more than 8kb I/O size seem to be very less(65482 8k+ writes whereas 28 16k+ writes) in comparison and I wonder if I am doing something wrong. Interval 3 Stats: Block Size: 512b+ 1024b+ 2048b+ No. of Reads: 0 0 0 No. of Writes: 2 8 17 Block Size: 8192b+ 16384b+ No. of Reads: 0 0 No. of Writes: 65482 <<<<----- 28 <<<<<------ %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 18 RELEASEDIR 0.00 5.24 us 2.56 us 9.36 us 18 OPENDIR 0.00 62.36 us 29.43 us 96.55 us 18 LOOKUP 0.00 192.93 us 42.84 us 467.66 us 24 GETXATTR 0.00 1239.89 us 567.02 us 2906.22 us 36 READDIR 0.59 60.04 us 13.21 us 32508.10 us 131074 FINODELK 1.95 212.93 us 46.80 us 46163.11 us 122842 FXATTROP 14.64 2997.92 us 262.91 us 290958.54 us 65537 WRITE 82.82 18092.43 us 217.65 us 339738.47 us 61433 FSYNC This is the job file used: [global] rw=randwrite end_fsync=1 startdelay=0 ioengine=libaio direct=1 bs=8k [randwrite] directory=/mnt/block filename_format=f.$jobnum.$filenum iodepth=16 numjobs=2 nrfiles=2 openfiles=4 filesize=256m size=2g io_size=256m Pranith
My recent observations are documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c18 The relevant info pertaining to this bz i.e. write-amplification: From the gluster profile, it seems that request sizes being received are 4K, as expected for the fio job file with bs=4k. Interval 4 Stats: Block Size: 2048b+ 4096b+ No. of Reads: 0 0 No. of Writes: 8 460811 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 7.64 28.07 us 10.00 us 818.00 us 921638 FINODELK 15.04 110.50 us 61.00 us 5352.00 us 460819 WRITE 32.34 119.16 us 63.00 us 5158.00 us 918726 FXATTROP 44.97 331.33 us 23.00 us 6656.00 us 459385 FSYNC However, iostat on iscsi client and gluster brick show a big difference in request size and throughput: iostat on client: 02/20/2018 12:36:13 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz sdc 0.00 0.00 0.00 3827.20 0.00 15308.80 8.00 iostat on brick: 02/20/2018 12:36:12 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz nvme0n1 0.00 1.30 0.00 4774.40 0.00 135809.60 56.89 avgrq-sz on client is 8 i.e. 4K expressed in 512b sectors. avgrq-sz on server is 56+ i.e. 28K+ Similarly wkB/s being written on brick is much higher than wkB/s that client is getting.
Created attachment 1401305 [details] iostat from client
Created attachment 1401307 [details] iostat from a brick
comment #6 has iostat output on the iscsi client for a run that includes sequential write, random read, random write in that order. comment #7 has iostat output on one of the 2 bricks for the same run (I'm using a replica-2 config because I was unable to get an allocation with 3 SSDs). gluster profile info for the randwrite phase is here: https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c22.
One of the questions when we saw this write amplification with gluster-block was which component is causing it -- whether this was coming from tcmu or above. These recent observations suggest that the problem is in gluster.
(In reply to Manoj Pillai from comment #9) > One of the questions when we saw this write amplification with gluster-block > was which component is causing it -- whether this was coming from tcmu or > above. These recent observations suggest that the problem is in gluster. Another point to add here: Nikhil Chawla in his RHHI testing is also reporting very similar behavior, where he sees throughput reported by iostat on the brick to be much higher than what iostat reports in the VM. So the issue seems to be something common to the RHHI and gluster-block configurations. [The other similarity between the RHHI and gluster-block configurations is performance problems with eager-lock turned on: https://bugzilla.redhat.com/show_bug.cgi?id=1528566. Poor performance with eager-lock turned on is the reason it was turned off in the gluster-block group tuning profile].
(In reply to Manoj Pillai from comment #5) > My recent observations are documented here: > https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c18 > > The relevant info pertaining to this bz i.e. write-amplification: > > From the gluster profile, it seems that request sizes being received are 4K, > as expected for the fio job file with bs=4k. > > Interval 4 Stats: > Block Size: 2048b+ 4096b+ > No. of Reads: 0 0 > No. of Writes: 8 460811 > %-latency Avg-latency Min-Latency Max-Latency No. of calls > Fop > --------- ----------- ----------- ----------- ------------ > ---- > 7.64 28.07 us 10.00 us 818.00 us 921638 > FINODELK > 15.04 110.50 us 61.00 us 5352.00 us 460819 > WRITE > 32.34 119.16 us 63.00 us 5158.00 us 918726 > FXATTROP > 44.97 331.33 us 23.00 us 6656.00 us 459385 > FSYNC > > However, iostat on iscsi client and gluster brick show a big difference in > request size and throughput: > > iostat on client: > 02/20/2018 12:36:13 AM > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > sdc 0.00 0.00 0.00 3827.20 0.00 15308.80 8.00 > > > iostat on brick: > 02/20/2018 12:36:12 AM > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > nvme0n1 0.00 1.30 0.00 4774.40 0.00 135809.60 56.89 > > avgrq-sz on client is 8 i.e. 4K expressed in 512b sectors. > avgrq-sz on server is 56+ i.e. 28K+ > > Similarly wkB/s being written on brick is much higher than wkB/s that client > is getting. How are fxattrops i.e. getxattr+setxattr translated in iostat? There are 918726 number of fxattrops i.e. 918726 number of getxattrs and 918726 number of setxattrs
There is xfs sitting between the gluster brick-side xlators and the block device. The translation will happen there -- iostat only deals with block device reads and writes. not exactly sure of details of xattr handling in xfs. I'll do a run with cluster.ensure-durability off to see if that gives some insight, whether fsync forcing write of metadata in addition to data is what is at work here.
(In reply to Manoj Pillai from comment #12) > There is xfs sitting between the gluster brick-side xlators and the block > device. The translation will happen there -- iostat only deals with block > device reads and writes. not exactly sure of details of xattr handling in > xfs. > > I'll do a run with cluster.ensure-durability off to see if that gives some > insight, whether fsync forcing write of metadata in addition to data is what > is at work here. Hey, I just realized that the fsyncs that come from AFR are fdatasyncs. Based on the flags in the FSYNC fop, it gets converted to fdatasync at the posix layer. So may be this will not give us the data we need. Pranith
Will wait for eager-locking changes and then evaluate to see what its impact is: https://bugzilla.redhat.com/show_bug.cgi?id=1491785#c24.
https://code.engineering.redhat.com/gerrit/133659 storage/posix: Add active-fd-count option in gluster https://code.engineering.redhat.com/gerrit/133660 cluster/afr: Switch to active-fd-count for open-fd checks https://code.engineering.redhat.com/gerrit/131944 cluster/afr: Remove unused code paths https://code.engineering.redhat.com/gerrit/131945 cluster/afr: Make AFR eager-locking similar to EC
iostat from fio random write test: Before AFR eager-lock fixes: glusterfs-3.12.2-5.el7rhgs.x86_64 on client 03/28/2018 09:25:11 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz sdc 0.00 0.00 0.00 2294.40 0.00 18355.20 16.00 on one of the bricks: 03/28/2018 09:25:10 AM nvme0n1 0.00 9.90 0.00 2775.30 0.00 76901.65 55.42 After AFR eager-lock fixes: glusterfs-3.12.2-6.el7rhgs.x86_64 on client 03/28/2018 03:27:57 PM sdc 0.00 0.00 0.00 5791.70 0.00 46332.60 16.00 on one of the bricks: nvme0n1 0.00 4.60 0.00 5799.70 0.00 48393.45 16.69 Write amplification is not seen in glusterfs-3.12.2-6.el7rhgs.x86_64: wkB/s is almost same on both client and brick in this build. This bz can be moved to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607