1375125 – arbiter volume write performance is bad.

Bug 1375125 - arbiter volume write performance is bad.

Summary: arbiter volume write performance is bad.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	arbiter
Sub Component:
Version:	3.8.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1384906 1385224 1385226
Blocks:	1380276
TreeView+	depends on / blocked

Reported:	2016-09-12 08:33 UTC by Max Raba
Modified:	2016-11-29 09:36 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.6
Clone Of:	1324004
Environment:
Last Closed:	2016-11-29 09:36:04 UTC
Regression:	---
Mount Type:	fuse
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Ravishankar N 2016-09-12 13:59:15 UTC

Hello, could you try the experiment on a volume without sharding enabled and see if there is a difference in write performance between 2x2 and 2x(2+1) volumes?

Comment 2 Max Raba 2016-09-13 15:53:05 UTC

Hi,

I think I have the experiment you were looking for.
Indeed, it seems like setting the features.shard to off will big the performance up again.

Regards.
Max


Experiment:

[root@localhost ~]# gluster volume create storage replica 3 arbiter 1 192.168.122.14:/data/brick1 192.168.122.15:/data/brick1 192.168.122.167:/data/brick1-arbiter 192.168.122.167:/data/brick1 192.168.122.230:/data/brick1 192.168.122.14:/data/brick1-arbiter force
volume create: storage: success: please start the volume to access data

[root@localhost ~]# gluster volume start storage
volume start: storage: success
[root@localhost ~]# gluster volume info
 
Volume Name: storage
Type: Distributed-Replicate
Volume ID: c601be67-2857-4bfd-a226-504e8d1f3c5b
Status: Started
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 192.168.122.14:/data/brick1
Brick2: 192.168.122.15:/data/brick1
Brick3: 192.168.122.167:/data/brick1-arbiter (arbiter)
Brick4: 192.168.122.167:/data/brick1
Brick5: 192.168.122.230:/data/brick1
Brick6: 192.168.122.14:/data/brick1-arbiter (arbiter)
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
[root@localhost ~]# gluster volume set storage features.shard on
volume set: success
[root@localhost ~]# gluster volume set storage features.shard-block-size 16MB
volume set: success
[root@localhost ~]# mkdir /srv/storage
[root@localhost ~]# mount -t glusterfs 127.0.0.1:storage /srv/storage/
[root@localhost ~]# cd /srv/storage/
[root@localhost storage]# df -h /srv/storage/
Filesystem         Size  Used Avail Use% Mounted on
127.0.0.1:storage   20G  2,0G   18G  11% /srv/storage
[root@localhost storage]# dd if=/dev/zero of=testfile count=1 bs=10M
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 11,6287 s, 902 kB/s
[root@localhost storage]# gluster volume set storage features.shard off
volume set: success
[root@localhost storage]# dd if=/dev/zero of=testfile count=1 bs=10M
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0,0328133 s, 320 MB/s
[root@localhost storage]# gluster volume set storage features.shard on
volume set: success
[root@localhost storage]# dd if=/dev/zero of=testfile count=1 bs=10M
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 11,2339 s, 933 kB/s
[root@localhost storage]# gluster volume remove-brick storage replica 2 192.168.122.167:/data/brick1-arbiter 192.168.122.14:/data/brick1-arbiter force
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit force: success
[root@localhost storage]# dd if=/dev/zero of=testfile count=1 bs=10M1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0,0843147 s, 124 MB/s
[root@localhost storage]# gluster volume set storage features.shard off
[root@localhost storage]# dd if=/dev/zero of=testfile count=1 bs=10M
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0,0365119 s, 287 MB/s
[root@localhost storage]#

Comment 3 Max Raba 2016-09-14 16:43:05 UTC

Accoring to the experiment is seems that it is a bad idea to do 

[root@localhost storage]# gluster volume set storage features.shard off

because then all files are shrinked to the shard-block-size.

Comment 4 Ravishankar N 2016-09-15 04:51:27 UTC

(In reply to Max Raba from comment #3)
> Accoring to the experiment is seems that it is a bad idea to do 
> 
> [root@localhost storage]# gluster volume set storage features.shard off
> 
> because then all files are shrinked to the shard-block-size.

Yes I was only trying to isolate the cause. I'm able to recreate the issue. I'll update once I find out what the issue is.

Note; Making the BZ description private upon the reporter's request as it contains some sensitive IP information.

Comment 5 Worker Ant 2016-10-15 12:14:43 UTC

REVIEW: http://review.gluster.org/15647 (afr: Take full locks in arbiter only for data transactions) posted (#1) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 6 Worker Ant 2016-10-15 16:26:12 UTC

COMMIT: http://review.gluster.org/15647 committed in release-3.8 by Pranith Kumar Karampuri (pkarampu) 
------
commit a1ee61051c6d284b9e632b975227c07cb4dda93d
Author: Ravishankar N <ravishankar>
Date:   Fri Oct 14 16:09:08 2016 +0530

    afr: Take full locks in arbiter only for data transactions
    
    Problem:
    Sharding exposed a bug in arbiter config. where `dd` throughput was
    extremely slow. Shard xlator was sending a fxattrop to update the file
    size immediately after a writev. Arbiter was incorrectly over-riding the
    LLONGMAX-1 start offset (for metadata domain locks) for this fxattrop,
    causing the inodelk to be taken on the data domain. And since the
    preceeding writev hadn't released the lock (afr does a 'lazy'
    unlock if write succeeds on all bricks), this degraded to a blocking
    lock causing extra lock/unlock calls and delays.
    
    Fix:
    Modify flock.l_len and flock.l_start to take full locks only for data
    transactions.
    
    > Reviewed-on: http://review.gluster.org/15641
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    (cherry picked from commit 3a97486d7f9d0db51abcb13dcd3bc9db935e3a60)
    
    Change-Id: I906895da2f2d16813607e6c906cb4defb21d7c3b
    BUG: 1375125
    Signed-off-by: Ravishankar N <ravishankar>
    Reported-by: Max Raba <max.raba>
    Reviewed-on: http://review.gluster.org/15647
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 7 humaorong 2016-10-26 02:52:52 UTC

 Hi, I installed glusterfs nightly build rpm(2016-10-25),which link from  :http://artifacts.ci.centos.org/gluster/nightly/release-3.8/7/x86_64/?C=M;O=D
 . and create a replicate 3 arbiter 1 volume and enable features.shard （set it enabe or on ）,info as :

[root@horeba ~]# gluster --version
glusterfs 3.8.5 built on Oct 25 2016 02:09:23

[root@horeba ~]# gluster volume info data_volume3
 
Volume Name: data_volume3
Type: Distributed-Replicate
Volume ID: cd5f4322-11e3-4f18-a39d-f0349b8d2a0c
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: 192.168.10.71:/data_sdaa/brick
Brick2: 192.168.10.72:/data_sdaa/brick
Brick3: 192.168.10.73:/data_sdaa/brick (arbiter)
Brick4: 192.168.10.71:/data_sdc/brick
Brick5: 192.168.10.73:/data_sdc/brick
Brick6: 192.168.10.72:/data_sdc/brick (arbiter)
Brick7: 192.168.10.72:/data_sde/brick
Brick8: 192.168.10.73:/data_sde/brick
Brick9: 192.168.10.71:/data_sde/brick (arbiter)
Brick10: 192.168.10.71:/data_sde/brick1
Brick11: 192.168.10.72:/data_sdc/brick1
Brick12: 192.168.10.73:/data_sdaa/brick1 (arbiter)
Options Reconfigured:
server.allow-insecure: on
features.shard: enable
features.shard-block-size: 512MB
storage.owner-gid: 36
storage.owner-uid: 36
nfs.disable: on
cluster.data-self-heal-algorithm: full
auth.allow: *
network.ping-timeout: 10
performance.low-prio-threads: 32
performance.io-thread-count: 32
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on

glusterfs mount one host , add dd test it ,reselst are :
[root@horebb test6]# for i in `seq 3`; do dd if=/dev/zero of=./file   bs=1G count=1 oflag=direct ; done
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 55.9329 s, 19.2 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 54.8481 s, 19.6 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 57.9079 s, 18.5 MB/s

and disable features.shard config and test it  :
[root@horeba ~]# gluster volume reset data_volume3 features.shard
volume reset: success: reset volume successful

[root@horebb test6]# for i in `seq 3`; do dd if=/dev/zero of=./filetest   bs=1G count=1 oflag=direct ; done
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.25607 s, 855 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.18359 s, 907 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.29374 s, 830 MB/s

and I also download master source code from 2016-10-25  (git clone https://github.com/gluster/glusterfs ) and builded rpm ， install builded glusterfs rpm , the test result is the same as nightly build result . 
 
so ,enable glusterfs volume shard config  performance is bad  problem  also exist ,please see  how to resolve it , as we known ,shard config is important for glusterfs usage .

Comment 8 Ravishankar N 2016-10-26 04:15:02 UTC

Hi humaorong, can you try the same test on a normal replica-3 volume with and without sharding enabled and see if you are seeing similar perf differences?

Comment 9 humaorong 2016-10-26 05:41:39 UTC

  Hi Ravishankar N :
     I test as follow :

  gluster volume is replication 3 not arbiter :
[root@horeba ~]# gluster volume info data_volume
 
Volume Name: data_volume
Type: Replicate
Volume ID: 48d74735-db85-44e8-b0d2-1c8cf651418c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.10.71:/data_sdc/brick3
Brick2: 192.168.10.72:/data_sdc/brick3
Brick3: 192.168.10.73:/data_sdc/brick3

 disable   features.shard ,performance OK :
[root@horebc mnt]# for i in `seq 3`; do dd if=/dev/zero of=./file   bs=1G count=1 oflag=direct ; done
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.85306 s, 579 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.85131 s, 580 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.85037 s, 580 MB/s

 enable    features.shard ,performance OK also :
[root@horebc mnt]# for i in `seq 3`; do dd if=/dev/zero of=./filetest   bs=1G count=1 oflag=direct ; done
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.84995 s, 580 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.87079 s, 574 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.85104 s, 580 MB/s

   so now only relication 3 arbiter 1 volume enable features.shard  have performace bad

Comment 10 humaorong 2016-10-26 05:57:54 UTC

  I am sorry I make a mistake in "humaorong 2016-10-26 01:41:39 EDT comment 9 "  that result all are not enable shard.

    and now I  enable shard :
  
   [root@horeba ~]# gluster volume set data_volume features.shard on
volume set: success
  [root@horeba ~]# gluster volume set data_volume features.shard-block-size 512MB
volume set: success

[root@horebc mnt]# for i in `seq 3`; do dd if=/dev/zero of=./filetest2   bs=1G count=1 oflag=direct ; done
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 5.91316 s, 182 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 6.00505 s, 179 MB/s
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 5.92659 s, 181 MB/s

Comment 11 humaorong 2016-10-26 06:34:12 UTC

  so test that , volume perforce bad on shard enable condition , arbiter or not arbiter volume also have this problem .

Comment 12 Ravishankar N 2016-10-26 06:44:49 UTC

Right, can you raise a separate bug with component as replicate? We can take it from there.

Also, for o_directs to be honoured, you will need to disable network.remote-dio and enable performance.strict-o-direct

Comment 13 Niels de Vos 2016-11-29 09:36:04 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.6, please open a new bug report.

glusterfs-3.8.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/packaging/2016-November/000217.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.