Description of problem: Expectation was smallfile read performance on Arbiter volume would match replica 3 smallfile read performance. Observation is Arbiter volume read performance is 30% of replica 3 read performance. Version-Release number of selected component (if applicable): glusterfs-cli-3.8.2-1.el7.x86_64 glusterfs-3.8.2-1.el7.x86_64 glusterfs-api-3.8.2-1.el7.x86_64 glusterfs-libs-3.8.2-1.el7.x86_64 glusterfs-fuse-3.8.2-1.el7.x86_64 glusterfs-client-xlators-3.8.2-1.el7.x86_64 glusterfs-server-3.8.2-1.el7.x86_64 How reproducible: Every time. gluster v info (Replica 3 volume) Volume Name: rep3 Type: Distributed-Replicate Volume ID: e7a5d84d-31da-40a8-85d0-2b94b95c3b28 Status: Started Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 172.17.40.13:/bricks/b/g Brick2: 172.17.40.14:/bricks/b/g Brick3: 172.17.40.15:/bricks/b/g Brick4: 172.17.40.16:/bricks/b/g Brick5: 172.17.40.22:/bricks/b/g Brick6: 172.17.40.24:/bricks/b/g Options Reconfigured: server.event-threads: 4 client.event-threads: 4 cluster.lookup-optimize: on performance.readdir-ahead: on gluster v info (Arbiter Volume) Volume Name: arb Type: Distributed-Replicate Volume ID: e7a5d84d-31da-40a8-85d0-2b94b95c3b28 Status: Started Number of Bricks: 3 x (2 + 1) = 9 Transport-type: tcp Brick1: 172.17.40.13:/bricks/b01/g Brick2: 172.17.40.14:/bricks/b01/g Brick3: 172.17.40.15:/bricks/b02/g (arbiter) Brick4: 172.17.40.15:/bricks/b01/g Brick5: 172.17.40.16:/bricks/b01/g Brick6: 172.17.40.22:/bricks/b02/g (arbiter) Brick7: 172.17.40.22:/bricks/b01/g Brick8: 172.17.40.24:/bricks/b01/g Brick9: 172.17.40.13:/bricks/b02/g (arbiter) Options Reconfigured: server.event-threads: 4 client.event-threads: 4 cluster.lookup-optimize: on performance.readdir-ahead: on Steps to Reproduce: For both Replica 3 volume and Arbiter Volume, do the following 1. Creation of files. Drop cache on server and client side. Create smallfile files using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs --host-set clientfile --threads 4 --file-size 256 --files 6554 --record-size 32 --fsync Y --operation create 2. Reading of files. Again drop cache on server and client side. Read smallfiles using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs --host-set clientfile --threads 4 --file-size 256 --files 6554 --record-size 32 --operation read 3. Compare the read performance for both replica 3 and Arbiter volume Actual results: Arbiter read performance is 30% of replica 3 read performance for smallfile workload. Expected results: Smallfile read performance of Arbiter volume and Replica 3 volume should ideally be same. --Shekhar
Note to self: workload used:https://github.com/bengland2/smallfile
Smallfile Performance numbers: Create Performance for 256KiB file size --------------------------------------- Replica 2 Volume : 407 files/sec/server Arbiter Volume : 317 files/sec/server Replica 3 Volume : 306 files/sec/server Read Performance for 256KiB file size ------------------------------------- Replica 2 Volume : 380 files/sec/server Arbiter Volume : 132 files/sec/server Replica 3 Volume : 329 files/sec/server --Shekhar
I was able to get similar results on my testing where the 'files/sec' was almost half for a 1x (2+1) setup when compared to a 1x3 setup for 256KB write size. A summary of the cumulative brick profile info on one such run is given below for some FOPS: Replica 3 vol -------------- No of calls: Brick1 Brick2 Brick3 Lookup 28,544 28,545 28,552 Read 17,695 17,507 17,228 FSTAT 17,714 17,535 17,247 Inodelk 8 8 8 Arbiter vol ----------- No. of calls: Brick1 Brick2 Arbiter brick Lookup 56,241 56,246 56,245 Read 34,920 17,508 - FSTAT 34,995 17,533 - Inodelk 52,442 52,442 52,442 I see that the sum total of the reads on all bricks is similar for both replica and arbiter setups. In arbiter vol, zero reads are served from arbiter brick and so the read load is spread between 1st 2 bricks. Likewise for Fstat. But the problem seems to be in the number of lookups. For arbiter volume, the number seems to be double than replica-3. I'm guessing this is what is slowing things down. I also see a lot of Inodelks for the arbiter volume, which is unexpected because the I/O was a read operation. I need to figure out why these 2 things are happening.
Pranith suggested that the extra lookups and inodelks could be due to spurious heals triggered for some reason. Indeed, disabling client side heals brings the read performance numbers in proximity replica-3. On debugging it was found that the lookups were triggering metadata heals due to a mismatching count in the dict, as explained in the patch (BZ 1378684). Here are the profile numbers with the fix on arbiter vol: No. of calls: Brick1 Brick2 Arbiter brick Lookup 28805 28809 28817 Read 34920 17507 - FSTAT 34991 17547 - Inodelk 8 8 8
REVIEW: http://review.gluster.org/15578 (afr: Ignore gluster internal (virtual) xattrs in metadata heal check) posted (#1) for review on release-3.8 by Ravishankar N (ravishankar)
COMMIT: http://review.gluster.org/15578 committed in release-3.8 by Pranith Kumar Karampuri (pkarampu) ------ commit 44dbec60a2cd8fe6a68ff30cb6b8a1cf67b717be Author: Ravishankar N <ravishankar> Date: Tue Sep 27 10:39:58 2016 +0530 afr: Ignore gluster internal (virtual) xattrs in metadata heal check Backport of http://review.gluster.org/#/c/15548/ Problem: In arbiter configuration, posix-xlator in the arbiter brick always sets the GF_CONTENT_KEY in the response dict with a value 0. If the file size on the data bricks is more than quick-read's max-file-size (64kb default), those bricks don't set the key. Because of this difference in the no. of dict elements, afr triggers metadata heal in lookup code path, in turn leading to extra lookups+inodelks. Fix: Changed afr dict comparison logic to ignore all virtual xattrs and the on-disk ones that we should not be healing. Change-Id: I05730bdd39d8fb0b9a49a5fc9c0bb01f0d3bb308 BUG: 1377193 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/15578 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.5, please open a new bug report. glusterfs-3.8.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/announce/2016-October/000061.html [2] https://www.gluster.org/pipermail/gluster-users/