765141 – (GLUSTER-3409) replicate performance drops drastically on RDMA transport when changelog is enabled

Bug 765141 (GLUSTER-3409) - replicate performance drops drastically on RDMA transport when changelog is enabled

Summary: replicate performance drops drastically on RDMA transport when changelog is e...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	GLUSTER-3409
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	1.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	GLUSTER-3528 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-08-12 14:04 UTC by Vikas Gorur
Modified:	2016-09-17 12:10 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-03-17 09:35:07 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Vidya Sakar 2011-08-17 09:23:06 UTC

This fix is needed for 3.2.3.

Comment 2 Anand Avati 2011-08-18 01:01:07 UTC

CHANGE: http://review.gluster.com/242 (Change-Id: Ia11c3ced4bec5959a5f0d8fcd4c6070b2ead220a) merged in release-3.2 by Anand Avati (avati)

Comment 3 Anand Avati 2011-08-18 01:01:58 UTC

CHANGE: http://review.gluster.com/248 (Change-Id: Ia11c3ced4bec5959a5f0d8fcd4c6070b2ead220a) merged in master by Anand Avati (avati)

Comment 4 Anand Avati 2011-08-18 03:03:19 UTC

CHANGE: http://review.gluster.com/254 (cherrypicked patch did not have the logic to reset port number to) merged in master by Anand Avati (avati)

Comment 5 Anand Avati 2011-09-08 07:24:19 UTC

*** Bug 3528 has been marked as a duplicate of this bug. ***

Comment 6 Anand Avati 2011-09-08 11:07:32 UTC

CHANGE: http://review.gluster.com/240 (This patch is a change in the way write transactions hold a lock) merged in master by Anand Avati (avati)

Comment 7 Anand Avati 2011-09-08 11:07:47 UTC

CHANGE: http://review.gluster.com/243 (This patch is a change in the way write transactions hold a lock) merged in release-3.2 by Anand Avati (avati)

Comment 8 Amar Tumballi 2011-09-26 03:24:14 UTC

reducing severity.

Comment 12 Renee 2011-10-12 17:10:40 UTC

Engineering- the customer wants to know if we can meet a 300-325 MB/sec.  I need to get back to the customer.

Comment 13 Renee 2011-10-12 17:15:42 UTC

Adding AB as he was involved in this case as well.

Comment 14 Ben England 2011-10-19 11:33:15 UTC

I think you *might* have some options for improving performance, but it would help us if we all had some more information about the configuration and workload. 

- can you reproduce problem with iozone, dd or similar standard sequential write benchmark?  What is command that you are using?

- is workload sequential?

- what is size range of the files?

- what is average application write request size (i.e. # bytes/write() system call)?

- how many threads are concurrently writing files?  On how many clients? 

- are any special file open options being used by application?

- How much RAM in servers?

- What kernel tunings if any have already been applied to the servers?  client(s)?  Have we followed suggestions, particularly deadline scheduler and dirty_ratio, in:

http://community.gluster.org/p/linux-kernel-tuning-for-glusterfs

- Can you supply /etc/glusterfs/*.vol on server or give me some idea what this looks like?  What is storage config underneath gluster bricks.  What is throughput of underlying filesystems (can test them directly).

Plz respond if details needed about how to do any of this.

Comment 15 Ben England 2011-10-20 00:46:06 UTC

My latest result is 400 MB/s consistently on iozone write (really 800 MB/s network throughput) and 640 MB/s on iozone cached read using 16-MB record size and 16-GB file.  iozone uncached read gets 440 MB/s.  This is without RDMA, I haven't gotten that working yet.  

We might want to add this tuning to the Gluster kernel tuning page:

sysctl -w net.core.{r,w}mem_max = 4096000

default is 131072.  Purpose of this tuning is to enable bigger TCP transport window.  This has worked consistently for me in past with RHEL5 on Infiniband or 10-GbE.  It boosted read performance significantly.

I also use kernel vm.dirty tuning, deadline scheduler, read_ahead_kb = 512, 4-way 256-KB LVM stripe, ext4, jumbo frames.  I do not know that all of these are needed, that's step 2.  First priority was to get write throughput (and read throughput) up. 

I'm using RHEL5.7 with gluster 3.2.4.1 rpms.

[root@perf56 stat-collect]# grep gluster /proc/mounts
glusterfs#perf66-10ge:testfs /mnt/glusterfs fuse rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0

[root@perf56 stat-collect]# iozone -w -c -e -i 0 -+n -r 1024k -s 16g -f /mnt/glusterfs/x.ioz
        Iozone: Performance Test of File I/O
                Version $Revision: 3.392 $
                Compiled for 64 bit mode.
                Build: linux-AMD64 

        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
                     Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
                     Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
                     Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
                     Ben England.

        Run began: Wed Oct 19 23:04:23 2011

        Setting no_unlink
        Include close in write timing
        Include fsync in write timing
        No retest option selected
        Record Size 1024 KB
        File size set to 16777216 KB
        Command line used: iozone -w -c -e -i 0 -+n -r 1024k -s 16g -f /mnt/glusterfs/x.ioz
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                                   random   random     bkwd    record    stride                                       
               KB   reclen    write  rewrite     read     reread     read    write     read   rewrite      read    fwrite  frewrite    fread   freread
        16777216    1024  421798       0                                                                                            

iozone test complete.

[root@perf64 ~]# gluster volume info

Volume Name: testfs
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 192.168.16.64:/vol
Brick2: 192.168.16.66:/vol

Comment 16 Harshavardhana 2011-10-20 01:13:00 UTC

(In reply to comment #15)
> My latest result is 400 MB/s consistently on iozone write (really 800 MB/s
> network throughput) and 640 MB/s on iozone cached read using 16-MB record size
> and 16-GB file.  iozone uncached read gets 440 MB/s.  This is without RDMA, I
> haven't gotten that working yet.  
> 
> We might want to add this tuning to the Gluster kernel tuning page:
> 
> sysctl -w net.core.{r,w}mem_max = 4096000
> 

Are these applicable to RHEL 6 too? . BTW does changing this for TCP transport window also effect overall memory pressure on server side? would it be safe to assume that it has to be implemented with fine tuned values for 'vm.dirty_*' parameters?

Can you run 'dd' throughput twice the RAM size?. Since this seems really interesting result, we have apparently always limited ourselves with 250MB-300MB/sec for replicate over 10gig on almost every setup. Looks like we missed much in tuning after all and also the performance we needed. 

Also it is interesting to observe that, changing TCP window has helped here rather than changing the window-size on write-behind module, which is still default 1MB in your configuration. 

> 
> I also use kernel vm.dirty tuning, deadline scheduler, read_ahead_kb = 512,
> 4-way 256-KB LVM stripe, ext4, jumbo frames.  I do not know that all of these
> are needed, that's step 2.  First priority was to get write throughput (and
> read throughput) up. 

Jumbo frames is interesting, there has been problems in past with that usage at many times. Assuming it would help many, users have constantly complained about packet drops, we figured it to be network drivers + jumbo frames interaction. 

Also i had a question, would it be better to use RAID at software level rather than hardware level. I have read elsewhere that since the kernel scheduler doesn't know the RAID geometry in case of hardware it always results in inefficient block allocation. While having a software RAID facilitates kernel scheduler efficiency?

Comment 17 Ben England 2011-10-20 08:51:42 UTC

Are these applicable to RHEL 6 too? .

ben> I don't know whether this result holds for RHEL6, I just tried RHEL5.7. I'm not sure what RHEL release is being used at customer sites including B&N, but the appliance documentation seemed to say CentOS 5.

BTW does changing this for TCP transport
window also effect overall memory pressure on server side? would it be safe to
assume that it has to be implemented with fine tuned values for 'vm.dirty_*'
parameters?

ben> I think it can be made safe because AFAIK there are fewer TCP connections active at the same time for a gluster server. If this is a concern we can set net.ipv4.tcp_mem to get a tighter limit on TCP buffer usage -- the default value may be insanely large. See "tcp" and "socket" Linux man pages for details.

Can you run 'dd' throughput twice the RAM size?. Since this seems really
interesting result, we have apparently always limited ourselves with
250MB-300MB/sec for replicate over 10gig on almost every setup. Looks like we
missed much in tuning after all and also the performance we needed.

ben> For a 128-GB file, iozone reports 370 MB/s, remember that I've limited dirty pages so writes don't fill memory with dirty pages anyway. Remember this is just one workload and we need to try a wide variety of workloads before concluding that this configuration+tuning is right.

Also it is interesting to observe that, changing TCP window has helped here
rather than changing the window-size on write-behind module, which is still
default 1MB in your configuration.

ben> Changing TCP transport window affected read performance, not write performance, I should have been clear about this. Cached read performance was at 500 MB/s before tuning, 640 after tuning. I am still learning about Gluster and I did not know about the write-behind, perhaps I'm missing something there.

Comment 18 Harshavardhana 2011-10-20 15:33:07 UTC

> but the appliance documentation seemed to say CentOS 5.
>

Oh yes appliance current version is of CentOS 5.6 

> ben> I think it can be made safe because AFAIK there are fewer TCP connections
> active at the same time for a gluster server.   If this is a concern we can set
> net.ipv4.tcp_mem to get a tighter limit on TCP buffer usage -- the default
> value may be insanely large.  See "tcp" and "socket" Linux man pages for

Okay. 

> ben>  For a 128-GB file, iozone reports 370 MB/s, remember that I've limited
> dirty pages so writes don't fill memory with dirty pages anyway.   Remember
> this is just one workload and we need to try a wide variety of workloads before
> concluding that this configuration+tuning is right.

Yep sure.

> 
> 
> ben> Changing TCP transport window affected read performance, not write
> performance, I should have been clear about this.  Cached read performance was
> at 500 MB/s before tuning, 640 after tuning.  I am still learning about Gluster

write-behind -> is solely for writes (Write behind has cache-size or window-size which is 1MB by default, but values above 32MB has been seen to cause performance degradation) 

read-ahead -> is solely for reads (Read ahead has page-count which is 4 by default - each page is 128KBytes in size, page count can be increased to 8 or 16 but still we have never seen significant boost)

Comment 19 Harshavardhana 2011-11-09 17:07:34 UTC

Adding excerpts from Vikas :-

Done today:

We changed the kernel on the SSA nodes to the default CentOS 5.6 kernel (2.6.18-238) and installed Mellanox OFED (http://www.mellanox.com/downloads/ofed/MLNX_OFED_LINUX-1.5.2-2.1.0-rhel5.6.iso). This solved the IB speed problem at the link level. ibv_srq_pingpong now shows ~20,000Mbit/s across all nodes, both clients and servers.

However, there is still a problem. After experimenting with various configurations, the following was observed:

All results are for a single-server, single-client setup. The server is always SSA.

Client OS | Throughput (MB/s)
-------------------------------------------
SSA | ~600
CentOS 6.0 | 130-220 - Tested with two different machines
CentOS 5.6 | ~600

Pure-distribute volume shows similar results. A replicate volume never goes above 150MB/s for any configuration.

All machines have Mellanox OFED, and the same firmware version. All performance translators except write-behind were disabled/enabled -- no change. Write-behind window size was set to 16MB - no effect.

The problem is that the customer would rather have CentOS 6.0 on their clients. Also, the customer raised the question that performance degradation due to a different client OS does not inspire confidence --- "what if we tomorrow find out that a XenServer client also gets bad performance?"

Next steps:

VS: Can someone in India today setup a simple one-server, one-client IB volume with the server being SSA and client CentOS 6.0, and see if we see a similar performance drop? We've promised the customer that we'll do this.

Avati/Harsha: Any theories about the problem on CentOS 6? Any particular GlusterFS IB options that might help?

Plan for tomorrow:

- Try changing IB options in GlusterFS to see if they help.
- Run Intel MPI Benchmark (http://software.intel.com/en-us/articles/intel-mpi-benchmarks/) to measure IB throughput between the nodes. This will atleast shift the blame away from Gluster.

Comment 20 Ben England 2011-11-09 18:07:43 UTC

If you are using RDMA, TCP tuning I suggested has no effect.  I did not take that into account, sorry.   There is a qperf utility that works in RHEL6.2 with RDMA to test performance, perhaps this could help isolate the problem.  

Are you sure RDMA is being used?  The 150 MB/s number sounds suspiciously like TCP without jumbo frames (too many interrupts).  

What kind of hardware RAID config is being used there?  I was using just JBOD (no RAID) at hardware level, depending on Gluster write replication instead.  

Can you do some simple write tests to the underlying filesystem used by Gluster on the server side?  Make sure it's not an I/O bottleneck.

Can you try doing this on a client:

# vmstat 2 5 > vmstat.log 

during a test?  On server, try:

# yum install sysstat
# iostat -kx /dev/sd? 2 > iostat.log

During a test

# lastly, use netstat -i 2 during a test, with RDMA you should see very little traffic because the network card does all the transmits/receives without the host knowing about it.

My apologies if you did some of this already ;-)

Comment 21 Harshavardhana 2011-11-09 18:34:18 UTC

(In reply to comment #20)
> If you are using RDMA, TCP tuning I suggested has no effect.  I did not take
> that into account, sorry.   There is a qperf utility that works in RHEL6.2 

Ben the problem seems to be not 'rdma' per se since pure 'distribute' volume gives around 650MB/sec. 

So what we get with RDMA with 'replicate' is an internal bug, rather than a system level tuning. 

There is a patch for 'eager transaction updates' in replicate which is not part of 3.2.x releases yet which supposedly can help improve the performance, that needs to be tested.

Comment 22 Amar Tumballi 2012-02-27 10:35:59 UTC

This is the priority for immediate future (before 3.3.0 GA release). Will bump the priority up once we take RDMA related tasks.

Comment 23 Amar Tumballi 2012-06-08 10:04:29 UTC

All are 'GlusterFS-Commercial' bugs, mostly related to customers a year back or so. Good to have a resolution on these issues. Moving the component considering the visibility in RHS component :-)

Comment 25 Vijay Bellur 2012-07-25 23:22:53 UTC

CHANGE: http://review.gluster.com/269 (performance/write-behind: preserve lk-owner while syncing writes.) merged in master by Anand Avati (avati)

Comment 26 Amar Tumballi 2012-08-23 06:45:07 UTC

This bug is not seen in current master branch (which will get branched as RHS 2.1.0 soon). To consider it for fixing, want to make sure this bug still exists in RHS servers. If not reproduced, would like to close this.

Note You need to log in before you can comment on or make changes to this bug.