Description of problem: We have several clusters running in a simple configuration, 3 servers with 1 brick each in Replicate mode (1x3) After upgrading from 3.7.6 to 3.7.8 (which fixed many memory leaks, thanks!) our write performance dropped to almost nothing. Where we would get 60-100mB/sec we are now getting 1-4mB/sec This seems to happen when using the Gluster fuse filesystem, if i mount the volume as NFS it seems to work correctly. Unfortunately we have experienced stability using NFS in our environment so i cannot use this as a work around Version-Release number of selected component (if applicable): 3.7.8 How reproducible: I have created, from scratch, two seperate three node systems (yay for automation.) and installed/created the gluster volume. As well i have three other clusters that were upgraded (Softlayer, Online-Tech and Azure) which are exhibiting the same problem Steps to Reproduce: 1. Provision deploy three servers and create a gluster volume with, "gluster volume create VOLUME_NAME replica 3 transport tcp $NODES" Actual results: Incredible unexplained poor write performance (read performance is okay) Expected results: Reasonable write performance Additional info: Let me know if you would like any additional information from my environment
It was suggested to disable write-behind by hagarth in the #gluster freenode IRC channel gluster volume set VOLUME performance.write-behind Off This seemed to bring the write performance of the volume back to normal levels
I confirm the issue with 3.7.8 client working with replicated volume. However, if the volume is not replicated, the issue does not rise. Testing freshly created pure distributed volume shows 60–90 MB/s. Adding brick for replica 2 with add-brick lowers throughput to 1–6 MB/s. Removing replica brick with remove-brick brings throughput to 60–90 MB/s again.
Testing environment: 3.7.6 server, 3.7.6 client, 3.7.8 client. Benchmark: dd if=/dev/zero of=bwtest bs=1M count=64 === replica 2, performance.write-behind on, storage.linux-aio off === 3.7.6 client: 56.5 MB/s 3.7.8 client: 54.4 MB/s === replica 2, performance.write-behind on, storage.linux-aio on === 3.7.6 client: 57.3 MB/s 3.7.8 client: 6.7 MB/s === replica 2, performance.write-behind off, storage.linux-aio on === 3.7.6 client: 27.1 MB/s 3.7.8 client: 27.5 MB/s === replica 2, performance.write-behind off, storage.linux-aio off === 3.7.6 client: 40.3 MB/s 3.7.8 client: 41.5 MB/s
Also, while copying files with Midnight Commander, 3.7.8 client is always slow regardless of options set/unset.
I am also seeing a severe performance hit with 3.7.8. See email to user and devel email lists below. Note, that setting "performance.write-behind off" did not change my results: The 3.7.8 FUSE client is significantly slower than 3.7.6. Is this related to some of the fixes that were done to correct memory leaks? Is there anything that I can do to recover the performance of 3.7.6? My testing involved creating a "bigfile" that is 20GB. I then installed the 3.6.6 FUSE client and tested the copy of the bigfile from one gluster machine to another. The test was repeated 2x to make sure cache wasn't affect performance. Using Centos7.1 FUSE 3.6.6 took 47-seconds and 38-seconds. FUSE 3.7.6 took 43-seconds and 34-seconds. FUSE 3.7.8 took 205-seconds and 224-seconds I repeated the test on another machine that is running centos 6.7 and the results were even worse. 98-seconds for FUSE 3.6.6 versus 575-seconds for FUSE 3.7.8. My server setup is: Volume Name: gfsbackup Type: Distribute Volume ID: 29b8fae9-dfbf-4fa4-9837-8059a310669a Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: ffib01bkp:/data/brick01/gfsbackup Brick2: ffib01bkp:/data/brick02/gfsbackup Options Reconfigured: performance.readdir-ahead: on cluster.rebal-throttle: aggressive diagnostics.client-log-level: WARNING diagnostics.brick-log-level: WARNING changelog.changelog: off client.event-threads: 8 server.event-threads: 8
@David, Could you check if turning off performance.write-behind improves the write-performance?
(In reply to Soumya Koduri from comment #6) > @David, > > Could you check if turning off performance.write-behind improves the > write-performance? It didn't have any affect. Let me know if you want me to try anything else. David
CCin Ravishankar & Poornima who have been actively looking at it.
So it looks like there are different parts to the perf degradation between 3.7.6 and 3.7.8: 1. http://review.gluster.org/12953 went in for 3.7.7. With this fuse patch, before every writev(), fuse sends a getxattr for 'security.capability' to the bricks. That is an extra FOP for every writev, which was not there in 3.7.6 where fuse returns the call with ENODATA without winding it to the bricks. 2.Now when the brick returns ENODATA for the getxattr, AFR does an inode refresh which again triggers lookups and also seems to trigger data selfheal (need to figure out why). It also goes ahead to wind the getxattr to the other bricks of the replica. (all of them fail with ENODATA). All these are extra FOPS which add to the latency. Potential code fixes needed: a) We need to figure out if 12953 can be reverted/ modified or do we explicty want to wind the getxattr for security.capability. Poornima will work with Michael for that. b)In AFR if getxattr fails with ENODATA, we do not need to wind it to other replica bricks (which would anyway fail with ENODATA). c) Currently, before winding to the getxattr to other bricks, any triggered self-heal will have to be completed. This is anyway going to be fixed with http://review.gluster.org/#/c/13207/ where all client selfheals will be run in background instead of blocking the FOPS.
While there is no workaround in 3.7.8 for not winding the getxattr, disabling client side heals (data selfheal in particular) seems to improve things a bit in our setup: #gluster volume set <VOLNAME> data-self-heal off Tavis, Oleksandr, David - could you try if the above command improves the performance?
data-self-heal on: 1 MB/s data-self-heal off: 11 MB/s
"data-self-heal off" doesn't have any effect on throughput for me
Tavis, have you restarted and remounted the volume?
data-self-heal did not change the results for me. Both took a little over 3-minutes when using 3.7.8 and 30-45 seconds when using 3.7.6. David
I un-mounted, stopped, disabled self-heal, started and remounted No effect on throughput for my test cluster
(In reply to Tavis Paquette from comment #15) > I un-mounted, stopped, disabled self-heal, started and remounted > No effect on throughput for my test cluster Same here. un-mounted/stop/disable/start/remount had no effect for my test case.
If the performance didn't improve with the workaround suggested, then can you provide us with the profile info of the volume, for the I/O workload you are testing. Volume profile can be taken using the following command: 1. gluster vol profile <vol-name> start 2. IO from the mount point /* server profile info */ 3. gluster vol profile <vol-name> info /* client profile info */ 4. setfattr -n trusted.io-stats-dump -v /tmp/io-stats-pre.txt /your/mountpoint 5. gluster vol profile <vol-name> stop
I have the output from the profile, however this contains private information that i would not like released publicly Can i mail it to you privately at pgurusid?
Sure you could mail it me. Also you could remove the private information like the hostname, brick names, file names etc. from the output, as the information that we need would be: 1. In server profile: The table of fops and latency for each brick Which bricks are the replicas 2. In client profile: The table of fops and latency 3. The exact workload, as in how many file create/write of what block sizes etc.
Sent
REVIEW: http://review.gluster.org/13540 (fuse: forbid access to security.selinux and security.capability xattr if not mounted with 'selinux') posted (#1) for review on master by Poornima G (pgurusid)
@Vijay Bellur 13540 definitely makes things better for me, yielding up to 50 MB/s.
Oleksandr Natalenko, oh good to here that, thank you. Tavis Paquette, looked at the profile info, doesn't seem to be the case we fixed. In the case we fixed, the INODELK fop was consuming 99% of the time on the brick side. That doesn't seem to be the case in the profile info you sent. I see a lot of GETXATTR calls, but these shouldn't be reducing the perf drastically as you have mentioned. Also, was this with write-behind off or on? So we could try 2 things: 1. If you generate the profile info as mentioned earlier with 3.7.6 client, for the same workload, we will be able to compare the fops and latency. 2. http://review.gluster.org/13540 reduces the number of getxattrs which will be called. This will increase the perf, may not be by multiple folds. If you have a test system could try this, this patch will anyways will be part of 3.7.9 release.
REVIEW: http://review.gluster.org/13540 (fuse: forbid access to security.selinux and security.capability xattr if not mounted with 'selinux') posted (#2) for review on master by Poornima G (pgurusid)
write-behind was enabled in the scenario where i was experiencing low write performance turning write-behind off seemed to bring my write throughput back to 3.7.6 levels the profile was generated on a cluster that had write-behind disabled i'll try to find time to generate a profile with 3.7.6 in the same configuration, as well as 3.7.8 with write-behind enabled and then 3.7.8 with write-behind disabled it may take me a few days though, i'll follow up as soon as i can
REVIEW: http://review.gluster.org/13540 (fuse: forbid access to security.selinux and security.capability xattr if not mounted with 'selinux') posted (#3) for review on master by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/13567 (distribute/tests: Use a different mount instead of reusing a Mount.) posted (#1) for review on master by Raghavendra G (rgowdapp)
REVIEW: http://review.gluster.org/13540 (fuse: Add a new mount option capability) posted (#4) for review on master by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/13595 (afr: misc performance improvements) posted (#1) for review on master by Ravishankar N (ravishankar)
Note: 13595 aims to address problem #2 in comment#5
(In reply to Ravishankar N from comment #30) > Note: 13595 aims to address problem #2 in comment#5 Ugh. I meant comment#9 :-/
REVIEW: http://review.gluster.org/13595 (afr: misc performance improvements) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/13626 (fuse: Add a new mount option capability) posted (#1) for review on release-3.7 by Poornima G (pgurusid)
COMMIT: http://review.gluster.org/13540 committed in master by Raghavendra G (rgowdapp) ------ commit 5b5f03d2665687ab717f123da1266bcd3a83da0f Author: Poornima G <pgurusid> Date: Fri Feb 26 06:42:14 2016 -0500 fuse: Add a new mount option capability Originally all security.* xattrs were forbidden if selinux is disabled, which was causing Samba's acl_xattr module to not work, as it would store the NTACL in security.NTACL. To fix this http://review.gluster.org/#/c/12826/ was sent, which forbid only security.selinux. This opened up a getxattr call on security.capability before every write fop and others. Capabilities can be used without selinux, hence if selinux is disabled, security.capability cannot be forbidden. Hence adding a new mount option called capability. Only when "--capability" or "--selinux" mount option is used, security.capability is sent to the brick, else it is forbidden. Change-Id: I77f60e0fb541deaa416159e45c78dd2ae653105e BUG: 1309462 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/13540 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp>
REVIEW: http://review.gluster.org/13644 (afr: misc performance improvements) posted (#1) for review on release-3.7 by Ravishankar N (ravishankar)
COMMIT: http://review.gluster.org/13595 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit d1d364634dce0c3dcfe9c2efc883c21af0494d0d Author: Ravishankar N <ravishankar> Date: Thu Mar 3 23:17:17 2016 +0530 afr: misc performance improvements 1. In afr_getxattr_cbk, consider the errno value before blindly launching an inode refresh and a subsequent retry on other children. 2. We want to accuse small files only when we know for sure that there is no IO happening on that inode. Otherwise, the ia_sizes obtained in the post-inode-refresh replies may mismatch due to a race between inode-refresh and ongoing writes, causing spurious heal launches. Change-Id: Ife180f4fa5e584808c1077aacdc2423897675d33 BUG: 1309462 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/13595 Smoke: Gluster Build System <jenkins.com> Tested-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Reviewed-by: Krutika Dhananjay <kdhananj> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
COMMIT: http://review.gluster.org/13644 committed in release-3.7 by Vijay Bellur (vbellur) ------ commit e9fa7aeb1a32e22ff0749d67995e689028ca5a19 Author: Ravishankar N <ravishankar> Date: Tue Mar 8 16:43:12 2016 +0530 afr: misc performance improvements Backport of http://review.gluster.org/#/c/13595/ 1. In afr_getxattr_cbk, consider the errno value before blindly launching an inode refresh and a subsequent retry on other children. 2. We want to accuse small files only when we know for sure that there is no IO happening on that inode. Otherwise, the ia_sizes obtained in the post-inode-refresh replies may mismatch due to a race between inode-refresh and ongoing writes, causing spurious heal launches. Change-Id: I9858485d1061db67353ccf99c59530731649c847 BUG: 1309462 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/13644 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Reviewed-by: Krutika Dhananjay <kdhananj> CentOS-regression: Gluster Build System <jenkins.com>
REVIEW: http://review.gluster.org/13626 (fuse: Add a new mount option capability) posted (#2) for review on release-3.7 by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/13626 (fuse: Add a new mount option capability) posted (#3) for review on release-3.7 by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/13653 (fuse: Address the review comments in the backport) posted (#1) for review on master by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/13626 (fuse: Add a new mount option capability) posted (#4) for review on release-3.7 by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/13626 (fuse: Add a new mount option capability) posted (#5) for review on release-3.7 by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/13626 (fuse: Add a new mount option capability) posted (#6) for review on release-3.7 by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/13653 (fuse: Address the review comments in the backport) posted (#2) for review on master by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/13653 (fuse: Address the review comments in the backport) posted (#3) for review on master by Vijay Bellur (vbellur)
COMMIT: http://review.gluster.org/13626 committed in release-3.7 by Vijay Bellur (vbellur) ------ commit a8a8feb25216db2fa426b09d778f61c0f89d514c Author: Poornima G <pgurusid> Date: Fri Feb 26 06:42:14 2016 -0500 fuse: Add a new mount option capability Originally all security.* xattrs were forbidden if selinux is disabled, which was causing Samba's acl_xattr module to not work, as it would store the NTACL in security.NTACL. To fix this http://review.gluster.org/#/c/12826/ was sent, which forbid only security.selinux. This opened up a getxattr call on security.capability before every write fop and others. Capabilities can be used without selinux, hence if selinux is disabled, security.capability cannot be forbidden. Hence adding a new mount option called capability. Only when "--capability" or "--selinux" mount option is used, security.capability is sent to the brick, else it is forbidden. Backport of : http://review.gluster.org/#/c/13540/ & http://review.gluster.org/#/c/13653/ BUG: 1309462 Change-Id: Ib8d4f32d9f1458f4d71a05785f92b526aa7033ff Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/13626 Tested-by: Vijay Bellur <vbellur> Smoke: Gluster Build System <jenkins.com> CentOS-regression: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Vijay Bellur <vbellur>
Moving the BZ to modified as both fuse and afr patches have been merged in master and release-3.7 branches. The fixes should be available in the 3.7.9 release.
This issue also occured to us. We were able to regain the expected write performance for our replica-2 volumes using the suggested workaround (performance.write-behind off, data-self-heal off). However, the problem remains for replica-3 arbiter-1 volumes and we currently consider downgrading to 3.7.6. Do you have any further suggestions?
(In reply to Robert Rauch from comment #48) > This issue also occured to us. We were able to regain the expected write > performance for our replica-2 volumes using the suggested workaround > (performance.write-behind off, data-self-heal off). > > However, the problem remains for replica-3 arbiter-1 volumes and we > currently consider downgrading to 3.7.6. Do you have any further suggestions? It shouldn't matter if it is an arbiter volume or not. I would suggest you upgrade to 3.7.9 which I think is going to be out in a couple of days.
(In reply to Ravishankar N from comment #49) > (In reply to Robert Rauch from comment #48) > > This issue also occured to us. We were able to regain the expected write > > performance for our replica-2 volumes using the suggested workaround > > (performance.write-behind off, data-self-heal off). > > > > However, the problem remains for replica-3 arbiter-1 volumes and we > > currently consider downgrading to 3.7.6. Do you have any further suggestions? > > It shouldn't matter if it is an arbiter volume or not. I would suggest you > upgrade to 3.7.9 which I think is going to be out in a couple of days. It seems it does matter. I have also upgraded to 3.7.9 today and still see sequential write performance on 1x(2+1) volume stuck at 40MB/s (tested with dd and block size 1M).
(In reply to Robert Rauch fro > > It seems it does matter. I have also upgraded to 3.7.9 today and still see > sequential write performance on 1x(2+1) volume stuck at 40MB/s (tested with > dd and block size 1M). Hmm interesting. Robert, could you raise bug on 3.7.9 for this? The 'Component' of the BZ has to be 'arbiter'. Please also provide the volume profile info in the bug. Thanks.
(In reply to Ravishankar N from comment #51) > (In reply to Robert Rauch fro > > > > It seems it does matter. I have also upgraded to 3.7.9 today and still see > > sequential write performance on 1x(2+1) volume stuck at 40MB/s (tested with > > dd and block size 1M). > > Hmm interesting. Robert, could you raise bug on 3.7.9 for this? The > 'Component' of the BZ has to be 'arbiter'. Please also provide the volume > profile info in the bug. Thanks. Robert I've gone ahead and created BZ 1324004 on 'master'. Posting it here in case you want to track the fix.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.9, please open a new bug report. glusterfs-3.7.9 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-users/2016-March/025922.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report. glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/ [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user