Description of problem: Using a replica 2 + arbiter 1 configuration for ovirt storage domain. When arbiter node is up client IO is decreased by ~%30. Monitoring bandwidth on arbiter node shows significant rx network, in this case 50-60MB/s but the brick local path on the arbiter node is not showing significant disk space usage (< 1MB). Version-Release number of selected component (if applicable): CentOS 6.7 / 7.1 Gluster 3.7.3 How reproducible: Always. If Arbiter node is killed, client IO is higher. Initially discovered using ovirt, but also easily reproduced by writing to client fuse mount point. Steps to Reproduce: 1. Create gluster volume replica 2 arbiter 1 2. Write data on client fuse mount point 3. Watch realtime network bandwidth on arbiter node Actual results: Client is sending IO writes to arbiter node, decreasing expected performance. Expected results: No heavy IO should be going to the arbiter node, as it has no reason to receive data when it doesn't have any storage bricks. This considerably slows client IO as it is writing to 3 nodes instead of two. I would assume this is the same performance penalty as replica 3 would be vs replica 2. Additional info: During a disk migration from an NFS storage domain to a gluster storage domain, the arbiter node interface shows 37GB of data received: eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.231.62 netmask 255.255.255.0 broadcast 10.0.231.255 inet6 fe80::5054:ff:fe61:a934 prefixlen 64 scopeid 0x20<link> ether 52:54:00:61:a9:34 txqueuelen 1000 (Ethernet) RX packets 5874053 bytes 39820122925 (37.0 GiB) RX errors 0 dropped 650 overruns 0 frame 0 TX packets 4793230 bytes 4387154708 (4.0 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 BUT the arbiter node has a very little storage space available to it ('brick' mount point is on /): # df -h Filesystem Size Used Avail Use% Mounted on /dev/vda3 8.6G 1.2G 7.4G 14% / devtmpfs 912M 0 912M 0% /dev tmpfs 921M 0 921M 0% /dev/shm tmpfs 921M 8.4M 912M 1% /run tmpfs 921M 0 921M 0% /sys/fs/cgroup /dev/vda1 497M 157M 341M 32% /boot
Furthermore, If I do some write tests to a fuse mount point for the volume I get: Active nodes: node1, node2, arbiter: # dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct ^C34+0 records in 34+0 records out 35651584 bytes (36 MB) copied, 33.394 s, 1.1 MB/s Active nodes: node1, node2: # dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 3.19871 s, 65.6 MB/s Active nodes: node2, arbiter # dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 5.86619 s, 35.7 MB/s # dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 7.74412 s, 27.1 MB/s # dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 6.77487 s, 31.0 MB/s
Hi Steve, 1. You're correct on the observation that we send the writes even to the brick process of the arbiter node (even though the writes are not written to the disk). But we need the write to be sent to the brick for things to work. (AFR changelog xattrs depend on this). What we could do is send only one byte to the arbiter instead of the entire data. I'll work on the fix. 2. FWIW, I'm not able to recreate the drastic difference in 'dd' throughputs (1.1 vs 65,6 MB/s ?) as described in comment #1. I notice only marginal difference in my test setup. Can you check if you're hitting the same behaviour on a normal replica 3 volume?
Sorry took a while to get back to this. My arbiter node was a VM before, I decided to get another physical host running and added some storage. I should also mention all my bricks are SSD's, net is 10gig. I added a 3rd disk, replica 3 no arbiter: Fuse mount point: dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 0.360235 s, 582 MB/s I then rebuilt the volume with replica 3 arbiter 1: Fuse mount point: dd if=/dev/zero of=test bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 2.75525 s, 76.1 MB/s Obviously there was some performance impact from the virt env, although I have no idea what that would be. But in either case there is a significant gap here. Also I'm noticing that an Ovirt VM backed by rep3/arbiter 1 gets significantly less write speed: dd if=/dev/zero of=test bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 16.5165 s, 12.7 MB/s In replica 3 no arbiter, the VM gets: dd if=/dev/zero of=test bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 0.392785 s, 534 MB/s
Forgot to mention throughput when arbiter node is down inside a VM is very similar to the fuse mount point: VM: dd if=/dev/zero of=test bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 2.80659 s, 74.7 MB/s fuse: dd if=/dev/zero of=test bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 2.55501 s, 82.1 MB/s
REVIEW: http://review.gluster.org/12095 (afr: Do not wind the full writev payload to arbiter brick) posted (#1) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/12104 (afr: Do not wind the full writev payload to arbiter brick) posted (#1) for review on release-3.7 by Ravishankar N (ravishankar)
COMMIT: http://review.gluster.org/12104 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) ------ commit c87cef1763aac974bcd335965ba44fab46e4326d Author: Ravishankar N <ravishankar> Date: Thu Sep 3 09:49:56 2015 +0530 afr: Do not wind the full writev payload to arbiter brick ...because the arbiter xlator just unwinds it without passing it down till posix anyway. Instead, send a one-byte vector so that afr write transaction works as expected. Backport of http://review.gluster.org/#/c/12095/ Change-Id: I52913ca51dfee0c8472cbadb62c5d39b7badef77 BUG: 1255110 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/12104 Tested-by: Gluster Build System <jenkins.com> Tested-by: NetBSD Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
Hi Steve, 1. Would you be able to compile gluster from source (either master or release-3.7 branch) and see if the fix in comment#7 (which just got merged) makes any improvement? 2. If it doesn't, could you attach the volume profile information to see where the latency is? The steps are documented here http://www.gluster.org/community/documentation/index.php/Gluster_3.2:_Running_Gluster_Volume_Profile_Command You basically need to do these steps for both volume types (normal 3 way and arbiter) -start profiling, -run the dd command (in a loop for a few iterations), -while the dd is going onm run the profile info command 2 times, with an interval of say 10 seconds - stop profiling - give me the results of profile info. Thanks, Ravi
Hi Ravi, Unfortunately not, I've gone into production with 3.6 replica 3 and don't have the hardware for a test.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-glusterfs-3.7.5, please open a new bug report. glusterfs-glusterfs-3.7.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://www.gluster.org/pipermail/gluster-users/2015-October/023968.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.5, please open a new bug report. glusterfs-3.7.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://www.gluster.org/pipermail/gluster-users/2015-October/023968.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user