Bug 1028631

Summary:	NFS reads bottlenecked by NFS translator RPC processing
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ben England <bengland>
Component:	glusterd	Assignee:	Bug Updates Notification Mailing List <rhs-bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Sudhir D <sdharane>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.1	CC:	bengland, shaines, spradhan, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-12-13 20:26:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben England 2013-11-08 22:11:06 UTC

Description of problem:

NFS translator's processing of incoming RPC messages is single-threaded, this bottlenecks NFS RPC reads. You can see this by running top utility and pressing H, you will see one thread in glusterfs taking 100% of 1 core. Since there is only one glusterfs NFS server per hode, this one thread bottlenecks the entire node.

For now, the 1 10-GbE link/server also bottlenecks the reads in most cases, but typically we have 2 10-GbE NICs per server and we could increase NFS read (and write) throughput by adding 10-GbE NICs, either in bonded mode or by creating a separate gluster subnet on NIC 2 for replication traffic, if it wasn't for this CPU bottleneck.

This problem was seen earlier in the glusterfsd brick process as well. However, we can configure multiple bricks/volume on that server, whereas we cannot configure multiple NFS glusterfs processes/server.

Version-Release number of selected component (if applicable):

RHS 2.1 GA, has always been there.

How reproducible:

Every time.

Steps to Reproduce:
1. Configure 4 BAGL servers or equivalent with a Gluster volume
2. run iozone -+m to make 8 NFS clients read large files sequentially over the 10-GbE network
3. run top, press "H", wait for it to stabilize, look for a hot thread at the top of the display

Actual results:

You see a single glusterfs process at 99% of 1 core. That means glusterfs can't run any faster.

Expected results:

No one thread should be CPU-bottlenecked while the other cores are idle. If they all get bottlenecked, that's ok, it means you are utilizing all the hardware and that's fine.

Additional info:

workload:

[root@ben-test-driver2-10ge gluster_test]# par-for-all.sh clients.list 'echo 1 > /proc/sys/vm/drop_caches' ; iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -C -i 1 -+n -r 64k -t 8 -s 16g

# head -8 /var/tmp/iozone_cluster.config
gprfc089 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc090 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc091 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc092 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc094 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc095 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc096 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc078 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone

solutions:

Here's why we should make the effort: It will be at least a year before NFS-ganesha is equipped to handle multi-server configuration and is supported by RHS to do this. Until then, the only scalable NFS solution that we have is the NFS translator.

Presumably Avati's multi-thread epoll patch would help here, even if we could only use it for NFS RPCs. An alternative is let one thread do the epoll, but just hand the file descriptor off to a worker thread and let it read the RPCs from the file descriptor, allowing the dispatcher thread to move on immediately to the next RPC socket.

Another solution for large-file sequential I/O workloads is to enable larger NFS RPCs, this solution is already upstream but there are some complications associated with NFS flow control for writes, see bz 1008301. We should enable setting max RPC "nfs.read-size" parameter to 1 MB, while defaulting it to 128 KB, which would still double current size.

Comment 1 santosh pradhan 2013-11-14 16:34:03 UTC

The default I/O size is made to 1MB as kernel NFS. Snippet of commit:

==================


Author: Santosh Kumar Pradhan <spradhan>
Date:   Thu Oct 17 16:17:54 2013 +0530

    gNFS: Make NFS I/O size to 1MB by default
    
    For better NFS performance, make the default I/O size to 1MB, same as
    kernel NFS. Also refactor the description for read-size, write-size
    and readdir-size (i.e. it must be a multiple of 1KB but min value
    is 4KB and max supported value is 1MB). On slower network, rsize/wsize
    can be adjusted to 16/32/64-KB through nfs.read-size or nfs.write-size
    respectively.
    
    Change-Id: I142cff1c3644bb9f93188e4e890478177c9465e3
    BUG: 1009223
    Signed-off-by: Santosh Kumar Pradhan <spradhan>
    Reviewed-on: http://review.gluster.org/6103
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Shyamsundar Ranganathan <srangana>
    Reviewed-by: Anand Avati <avati>
===============

Comment 2 santosh pradhan 2013-12-05 07:41:17 UTC

> Another solution for large-file sequential I/O workloads is to enable larger
> NFS RPCs, this solution is already upstream but there are some complications
> associated with NFS flow control for writes, see bz 1008301.  We should
> enable setting max RPC  "nfs.read-size" parameter to 1 MB, while defaulting
> it to 128 KB, which would still double current size.

The current I/O size supported by NFS server defaults to 1 MB (not 64KB anymore). The commit message is in comment # 1.

Comment 3 Ben England 2013-12-13 20:26:45 UTC

You are correct, I'm using glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64 , which is RHS 2.1 Update 2, and in fact the NFS client does negotiate up to 1 MB RPC size on reads automatically, and performance is excellent.  Close it.