1378867 – Poor smallfile read performance on Arbiter volume compared to Replica 3 volume

Bug 1378867 - Poor smallfile read performance on Arbiter volume compared to Replica 3 volume

Summary: Poor smallfile read performance on Arbiter volume compared to Replica 3 volume

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	arbiter
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Ravishankar N
QA Contact:	Karan Sandha
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1413021 (view as bug list)
Depends On:	1377193
Blocks:	1351528 1378684 1379528
TreeView+	depends on / blocked

Reported:	2016-09-23 12:16 UTC by Ravishankar N
Modified:	2020-06-11 13:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1377193
Environment:
Last Closed:	2017-03-23 05:48:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3041061	0	None	None	None	2017-05-23 10:42:30 UTC
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Ravishankar N 2016-09-23 12:16:29 UTC

+++ This bug was initially created as a clone of Bug #1377193 +++

Description of problem:

Expectation was smallfile read performance on Arbiter volume would match replica 3 smallfile read performance.
Observation is Arbiter volume read performance is 30% of replica 3 read performance.

Version-Release number of selected component (if applicable):

glusterfs-cli-3.8.2-1.el7.x86_64
glusterfs-3.8.2-1.el7.x86_64
glusterfs-api-3.8.2-1.el7.x86_64
glusterfs-libs-3.8.2-1.el7.x86_64
glusterfs-fuse-3.8.2-1.el7.x86_64
glusterfs-client-xlators-3.8.2-1.el7.x86_64
glusterfs-server-3.8.2-1.el7.x86_64


How reproducible:

Every time.

gluster v info (Replica 3 volume)

Volume Name: rep3
Type: Distributed-Replicate
Volume ID: e7a5d84d-31da-40a8-85d0-2b94b95c3b28
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 172.17.40.13:/bricks/b/g
Brick2: 172.17.40.14:/bricks/b/g
Brick3: 172.17.40.15:/bricks/b/g
Brick4: 172.17.40.16:/bricks/b/g
Brick5: 172.17.40.22:/bricks/b/g
Brick6: 172.17.40.24:/bricks/b/g
Options Reconfigured:
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on

gluster v info (Arbiter Volume)

Volume Name: arb
Type: Distributed-Replicate
Volume ID: e7a5d84d-31da-40a8-85d0-2b94b95c3b28
Status: Started
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Brick1: 172.17.40.13:/bricks/b01/g
Brick2: 172.17.40.14:/bricks/b01/g
Brick3: 172.17.40.15:/bricks/b02/g (arbiter)
Brick4: 172.17.40.15:/bricks/b01/g
Brick5: 172.17.40.16:/bricks/b01/g
Brick6: 172.17.40.22:/bricks/b02/g (arbiter)
Brick7: 172.17.40.22:/bricks/b01/g
Brick8: 172.17.40.24:/bricks/b01/g
Brick9: 172.17.40.13:/bricks/b02/g (arbiter)
Options Reconfigured:
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on


Steps to Reproduce:

For both Replica 3 volume and Arbiter Volume, do the following

1. Creation of files. Drop cache on server and client side. Create smallfile files using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs --host-set clientfile --threads 4 --file-size 256 --files 6554 --record-size 32 --fsync Y --operation create

2. Reading of files. Again drop cache on server and client side. Read smallfiles using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs --host-set clientfile --threads 4 --file-size 256 --files 6554  --record-size 32 --operation read

3. Compare the read performance for both replica 3 and Arbiter volume 

Actual results:

Arbiter read performance is 30% of replica 3 read performance for smallfile workload.

Expected results:

Smallfile read performance of Arbiter volume and Replica 3 volume should ideally be same.

--Shekhar

--- Additional comment from Ravishankar N on 2016-09-19 03:31:50 EDT ---

Note to self: workload used:https://github.com/bengland2/smallfile

--- Additional comment from Shekhar Berry on 2016-09-19 04:07:56 EDT ---

Smallfile Performance numbers:

Create Performance for 256KiB file size
---------------------------------------

Replica 2 Volume : 407 files/sec/server
Arbiter Volume   : 317 files/sec/server
Replica 3 Volume : 306 files/sec/server

Read Performance for 256KiB file size
-------------------------------------

Replica 2 Volume : 380 files/sec/server
Arbiter Volume   : 132 files/sec/server
Replica 3 Volume : 329 files/sec/server

--Shekhar

--- Additional comment from Ravishankar N on 2016-09-22 05:55:55 EDT ---

I was able to get similar results on my testing where the 'files/sec' was almost half for a 1x (2+1) setup when compared to a 1x3 setup for 256KB write size. A summary of the cumulative brick profile info on one such run is given below for some FOPS:

Replica 3 vol
-------------- 
             No of calls:		
	Brick1	Brick2	Brick3
Lookup	28,544	28,545	28,552
Read	17,695	17,507	17,228
FSTAT	17,714	17,535	17,247
Inodelk	8	8	8


Arbiter vol
-----------
	No. of calls:
	Brick1	Brick2	Arbiter brick
Lookup	56,241	56,246	56,245
Read	34,920	17,508	-
FSTAT	34,995	17,533	-
Inodelk	52,442	52,442	52,442


I see that the sum total of the reads on all bricks is similar for both replica and arbiter setups. In arbiter vol, zero reads are served from arbiter brick and so the read load is spread between 1st 2 bricks. Likewise for Fstat.

But the problem seems to be in the number of lookups. For arbiter volume, the number seems to be double than replica-3. I'm guessing this is what is slowing things down. I also see a lot of Inodelks for the arbiter volume, which is unexpected because the I/O was a read operation. I need to figure out why these 2 things are happening.

--- Additional comment from Ravishankar N on 2016-09-23 01:43:42 EDT ---

Pranith suggested that the extra lookups and inodelks could be due to spurious  heals triggered for some reason. Indeed, disabling client side heals brings the read performance numbers in proximity replica-3. On debugging it was found that the lookups were triggering metadata heals due to a mismatching count in the dict, as explained in the patch (BZ 1378684).

Here are the profile numbers with the fix on arbiter vol:
                No. of calls:
	Brick1	Brick2	Arbiter brick
Lookup	28805	28809	28817
Read	34920	17507	-
FSTAT	34991	17547	-
Inodelk	8	8	8

Comment 2 Atin Mukherjee 2016-09-24 08:05:51 UTC

Upstream mainline patch http://review.gluster.org/15548 posted.

Comment 4 Ravishankar N 2016-09-27 08:31:19 UTC

Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/85739/

Comment 6 Karan Sandha 2017-01-16 11:31:39 UTC

performed the same steps using comment#1 steps by shekhar on 1x3 repicate volume and 1(2+1)arbiter.
The issue is fixed and the files creation is very similiar to the replicate volume.
For both Replica 3 volume and Arbiter Volume, do the following

1. Creation of files. Drop cache on server and client side. Create smallfile files using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs --host-set clientfile --threads 4 --file-size 256 --files 6554 --record-size 32 --fsync Y --operation create

2. Reading of files. Again drop cache on server and client side. Read smallfiles using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs --host-set clientfile --threads 4 --file-size 256 --files 6554  --record-size 32 --operation read

Comment 8 errata-xmlrpc 2017-03-23 05:48:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Comment 9 Riyas Abdulrasak 2017-05-23 10:42:30 UTC

*** Bug 1413021 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.