Hide Forgot
This fix is needed for 3.2.3.
CHANGE: http://review.gluster.com/242 (Change-Id: Ia11c3ced4bec5959a5f0d8fcd4c6070b2ead220a) merged in release-3.2 by Anand Avati (avati)
CHANGE: http://review.gluster.com/248 (Change-Id: Ia11c3ced4bec5959a5f0d8fcd4c6070b2ead220a) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.com/254 (cherrypicked patch did not have the logic to reset port number to) merged in master by Anand Avati (avati)
*** Bug 3528 has been marked as a duplicate of this bug. ***
CHANGE: http://review.gluster.com/240 (This patch is a change in the way write transactions hold a lock) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.com/243 (This patch is a change in the way write transactions hold a lock) merged in release-3.2 by Anand Avati (avati)
reducing severity.
Engineering- the customer wants to know if we can meet a 300-325 MB/sec. I need to get back to the customer.
Adding AB as he was involved in this case as well.
I think you *might* have some options for improving performance, but it would help us if we all had some more information about the configuration and workload. - can you reproduce problem with iozone, dd or similar standard sequential write benchmark? What is command that you are using? - is workload sequential? - what is size range of the files? - what is average application write request size (i.e. # bytes/write() system call)? - how many threads are concurrently writing files? On how many clients? - are any special file open options being used by application? - How much RAM in servers? - What kernel tunings if any have already been applied to the servers? client(s)? Have we followed suggestions, particularly deadline scheduler and dirty_ratio, in: http://community.gluster.org/p/linux-kernel-tuning-for-glusterfs - Can you supply /etc/glusterfs/*.vol on server or give me some idea what this looks like? What is storage config underneath gluster bricks. What is throughput of underlying filesystems (can test them directly). Plz respond if details needed about how to do any of this.
My latest result is 400 MB/s consistently on iozone write (really 800 MB/s network throughput) and 640 MB/s on iozone cached read using 16-MB record size and 16-GB file. iozone uncached read gets 440 MB/s. This is without RDMA, I haven't gotten that working yet. We might want to add this tuning to the Gluster kernel tuning page: sysctl -w net.core.{r,w}mem_max = 4096000 default is 131072. Purpose of this tuning is to enable bigger TCP transport window. This has worked consistently for me in past with RHEL5 on Infiniband or 10-GbE. It boosted read performance significantly. I also use kernel vm.dirty tuning, deadline scheduler, read_ahead_kb = 512, 4-way 256-KB LVM stripe, ext4, jumbo frames. I do not know that all of these are needed, that's step 2. First priority was to get write throughput (and read throughput) up. I'm using RHEL5.7 with gluster 3.2.4.1 rpms. [root@perf56 stat-collect]# grep gluster /proc/mounts glusterfs#perf66-10ge:testfs /mnt/glusterfs fuse rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0 [root@perf56 stat-collect]# iozone -w -c -e -i 0 -+n -r 1024k -s 16g -f /mnt/glusterfs/x.ioz Iozone: Performance Test of File I/O Version $Revision: 3.392 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Ben England. Run began: Wed Oct 19 23:04:23 2011 Setting no_unlink Include close in write timing Include fsync in write timing No retest option selected Record Size 1024 KB File size set to 16777216 KB Command line used: iozone -w -c -e -i 0 -+n -r 1024k -s 16g -f /mnt/glusterfs/x.ioz Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 16777216 1024 421798 0 iozone test complete. [root@perf64 ~]# gluster volume info Volume Name: testfs Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 192.168.16.64:/vol Brick2: 192.168.16.66:/vol
(In reply to comment #15) > My latest result is 400 MB/s consistently on iozone write (really 800 MB/s > network throughput) and 640 MB/s on iozone cached read using 16-MB record size > and 16-GB file. iozone uncached read gets 440 MB/s. This is without RDMA, I > haven't gotten that working yet. > > We might want to add this tuning to the Gluster kernel tuning page: > > sysctl -w net.core.{r,w}mem_max = 4096000 > Are these applicable to RHEL 6 too? . BTW does changing this for TCP transport window also effect overall memory pressure on server side? would it be safe to assume that it has to be implemented with fine tuned values for 'vm.dirty_*' parameters? Can you run 'dd' throughput twice the RAM size?. Since this seems really interesting result, we have apparently always limited ourselves with 250MB-300MB/sec for replicate over 10gig on almost every setup. Looks like we missed much in tuning after all and also the performance we needed. Also it is interesting to observe that, changing TCP window has helped here rather than changing the window-size on write-behind module, which is still default 1MB in your configuration. > > I also use kernel vm.dirty tuning, deadline scheduler, read_ahead_kb = 512, > 4-way 256-KB LVM stripe, ext4, jumbo frames. I do not know that all of these > are needed, that's step 2. First priority was to get write throughput (and > read throughput) up. Jumbo frames is interesting, there has been problems in past with that usage at many times. Assuming it would help many, users have constantly complained about packet drops, we figured it to be network drivers + jumbo frames interaction. Also i had a question, would it be better to use RAID at software level rather than hardware level. I have read elsewhere that since the kernel scheduler doesn't know the RAID geometry in case of hardware it always results in inefficient block allocation. While having a software RAID facilitates kernel scheduler efficiency?
Are these applicable to RHEL 6 too? . ben> I don't know whether this result holds for RHEL6, I just tried RHEL5.7. I'm not sure what RHEL release is being used at customer sites including B&N, but the appliance documentation seemed to say CentOS 5. BTW does changing this for TCP transport window also effect overall memory pressure on server side? would it be safe to assume that it has to be implemented with fine tuned values for 'vm.dirty_*' parameters? ben> I think it can be made safe because AFAIK there are fewer TCP connections active at the same time for a gluster server. If this is a concern we can set net.ipv4.tcp_mem to get a tighter limit on TCP buffer usage -- the default value may be insanely large. See "tcp" and "socket" Linux man pages for details. Can you run 'dd' throughput twice the RAM size?. Since this seems really interesting result, we have apparently always limited ourselves with 250MB-300MB/sec for replicate over 10gig on almost every setup. Looks like we missed much in tuning after all and also the performance we needed. ben> For a 128-GB file, iozone reports 370 MB/s, remember that I've limited dirty pages so writes don't fill memory with dirty pages anyway. Remember this is just one workload and we need to try a wide variety of workloads before concluding that this configuration+tuning is right. Also it is interesting to observe that, changing TCP window has helped here rather than changing the window-size on write-behind module, which is still default 1MB in your configuration. ben> Changing TCP transport window affected read performance, not write performance, I should have been clear about this. Cached read performance was at 500 MB/s before tuning, 640 after tuning. I am still learning about Gluster and I did not know about the write-behind, perhaps I'm missing something there.
> but the appliance documentation seemed to say CentOS 5. > Oh yes appliance current version is of CentOS 5.6 > ben> I think it can be made safe because AFAIK there are fewer TCP connections > active at the same time for a gluster server. If this is a concern we can set > net.ipv4.tcp_mem to get a tighter limit on TCP buffer usage -- the default > value may be insanely large. See "tcp" and "socket" Linux man pages for Okay. > ben> For a 128-GB file, iozone reports 370 MB/s, remember that I've limited > dirty pages so writes don't fill memory with dirty pages anyway. Remember > this is just one workload and we need to try a wide variety of workloads before > concluding that this configuration+tuning is right. Yep sure. > > > ben> Changing TCP transport window affected read performance, not write > performance, I should have been clear about this. Cached read performance was > at 500 MB/s before tuning, 640 after tuning. I am still learning about Gluster write-behind -> is solely for writes (Write behind has cache-size or window-size which is 1MB by default, but values above 32MB has been seen to cause performance degradation) read-ahead -> is solely for reads (Read ahead has page-count which is 4 by default - each page is 128KBytes in size, page count can be increased to 8 or 16 but still we have never seen significant boost)
Adding excerpts from Vikas :- Done today: We changed the kernel on the SSA nodes to the default CentOS 5.6 kernel (2.6.18-238) and installed Mellanox OFED (http://www.mellanox.com/downloads/ofed/MLNX_OFED_LINUX-1.5.2-2.1.0-rhel5.6.iso). This solved the IB speed problem at the link level. ibv_srq_pingpong now shows ~20,000Mbit/s across all nodes, both clients and servers. However, there is still a problem. After experimenting with various configurations, the following was observed: All results are for a single-server, single-client setup. The server is always SSA. Client OS | Throughput (MB/s) ------------------------------------------- SSA | ~600 CentOS 6.0 | 130-220 - Tested with two different machines CentOS 5.6 | ~600 Pure-distribute volume shows similar results. A replicate volume never goes above 150MB/s for any configuration. All machines have Mellanox OFED, and the same firmware version. All performance translators except write-behind were disabled/enabled -- no change. Write-behind window size was set to 16MB - no effect. The problem is that the customer would rather have CentOS 6.0 on their clients. Also, the customer raised the question that performance degradation due to a different client OS does not inspire confidence --- "what if we tomorrow find out that a XenServer client also gets bad performance?" Next steps: VS: Can someone in India today setup a simple one-server, one-client IB volume with the server being SSA and client CentOS 6.0, and see if we see a similar performance drop? We've promised the customer that we'll do this. Avati/Harsha: Any theories about the problem on CentOS 6? Any particular GlusterFS IB options that might help? Plan for tomorrow: - Try changing IB options in GlusterFS to see if they help. - Run Intel MPI Benchmark (http://software.intel.com/en-us/articles/intel-mpi-benchmarks/) to measure IB throughput between the nodes. This will atleast shift the blame away from Gluster.
If you are using RDMA, TCP tuning I suggested has no effect. I did not take that into account, sorry. There is a qperf utility that works in RHEL6.2 with RDMA to test performance, perhaps this could help isolate the problem. Are you sure RDMA is being used? The 150 MB/s number sounds suspiciously like TCP without jumbo frames (too many interrupts). What kind of hardware RAID config is being used there? I was using just JBOD (no RAID) at hardware level, depending on Gluster write replication instead. Can you do some simple write tests to the underlying filesystem used by Gluster on the server side? Make sure it's not an I/O bottleneck. Can you try doing this on a client: # vmstat 2 5 > vmstat.log during a test? On server, try: # yum install sysstat # iostat -kx /dev/sd? 2 > iostat.log During a test # lastly, use netstat -i 2 during a test, with RDMA you should see very little traffic because the network card does all the transmits/receives without the host knowing about it. My apologies if you did some of this already ;-)
(In reply to comment #20) > If you are using RDMA, TCP tuning I suggested has no effect. I did not take > that into account, sorry. There is a qperf utility that works in RHEL6.2 Ben the problem seems to be not 'rdma' per se since pure 'distribute' volume gives around 650MB/sec. So what we get with RDMA with 'replicate' is an internal bug, rather than a system level tuning. There is a patch for 'eager transaction updates' in replicate which is not part of 3.2.x releases yet which supposedly can help improve the performance, that needs to be tested.
This is the priority for immediate future (before 3.3.0 GA release). Will bump the priority up once we take RDMA related tasks.
All are 'GlusterFS-Commercial' bugs, mostly related to customers a year back or so. Good to have a resolution on these issues. Moving the component considering the visibility in RHS component :-)
CHANGE: http://review.gluster.com/269 (performance/write-behind: preserve lk-owner while syncing writes.) merged in master by Anand Avati (avati)
This bug is not seen in current master branch (which will get branched as RHS 2.1.0 soon). To consider it for fixing, want to make sure this bug still exists in RHS servers. If not reproduced, would like to close this.