Description of problem: NFS translator's processing of incoming RPC messages is single-threaded, this bottlenecks NFS RPC reads. You can see this by running top utility and pressing H, you will see one thread in glusterfs taking 100% of 1 core. Since there is only one glusterfs NFS server per hode, this one thread bottlenecks the entire node. For now, the 1 10-GbE link/server also bottlenecks the reads in most cases, but typically we have 2 10-GbE NICs per server and we could increase NFS read (and write) throughput by adding 10-GbE NICs, either in bonded mode or by creating a separate gluster subnet on NIC 2 for replication traffic, if it wasn't for this CPU bottleneck. This problem was seen earlier in the glusterfsd brick process as well. However, we can configure multiple bricks/volume on that server, whereas we cannot configure multiple NFS glusterfs processes/server. Version-Release number of selected component (if applicable): RHS 2.1 GA, has always been there. How reproducible: Every time. Steps to Reproduce: 1. Configure 4 BAGL servers or equivalent with a Gluster volume 2. run iozone -+m to make 8 NFS clients read large files sequentially over the 10-GbE network 3. run top, press "H", wait for it to stabilize, look for a hot thread at the top of the display Actual results: You see a single glusterfs process at 99% of 1 core. That means glusterfs can't run any faster. Expected results: No one thread should be CPU-bottlenecked while the other cores are idle. If they all get bottlenecked, that's ok, it means you are utilizing all the hardware and that's fine. Additional info: workload: [root@ben-test-driver2-10ge gluster_test]# par-for-all.sh clients.list 'echo 1 > /proc/sys/vm/drop_caches' ; iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -C -i 1 -+n -r 64k -t 8 -s 16g # head -8 /var/tmp/iozone_cluster.config gprfc089 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone gprfc090 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone gprfc091 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone gprfc092 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone gprfc094 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone gprfc095 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone gprfc096 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone gprfc078 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone solutions: Here's why we should make the effort: It will be at least a year before NFS-ganesha is equipped to handle multi-server configuration and is supported by RHS to do this. Until then, the only scalable NFS solution that we have is the NFS translator. Presumably Avati's multi-thread epoll patch would help here, even if we could only use it for NFS RPCs. An alternative is let one thread do the epoll, but just hand the file descriptor off to a worker thread and let it read the RPCs from the file descriptor, allowing the dispatcher thread to move on immediately to the next RPC socket. Another solution for large-file sequential I/O workloads is to enable larger NFS RPCs, this solution is already upstream but there are some complications associated with NFS flow control for writes, see bz 1008301. We should enable setting max RPC "nfs.read-size" parameter to 1 MB, while defaulting it to 128 KB, which would still double current size.
The default I/O size is made to 1MB as kernel NFS. Snippet of commit: ================== Author: Santosh Kumar Pradhan <spradhan> Date: Thu Oct 17 16:17:54 2013 +0530 gNFS: Make NFS I/O size to 1MB by default For better NFS performance, make the default I/O size to 1MB, same as kernel NFS. Also refactor the description for read-size, write-size and readdir-size (i.e. it must be a multiple of 1KB but min value is 4KB and max supported value is 1MB). On slower network, rsize/wsize can be adjusted to 16/32/64-KB through nfs.read-size or nfs.write-size respectively. Change-Id: I142cff1c3644bb9f93188e4e890478177c9465e3 BUG: 1009223 Signed-off-by: Santosh Kumar Pradhan <spradhan> Reviewed-on: http://review.gluster.org/6103 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Shyamsundar Ranganathan <srangana> Reviewed-by: Anand Avati <avati> ===============
> Another solution for large-file sequential I/O workloads is to enable larger > NFS RPCs, this solution is already upstream but there are some complications > associated with NFS flow control for writes, see bz 1008301. We should > enable setting max RPC "nfs.read-size" parameter to 1 MB, while defaulting > it to 128 KB, which would still double current size. The current I/O size supported by NFS server defaults to 1 MB (not 64KB anymore). The commit message is in comment # 1.
You are correct, I'm using glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64 , which is RHS 2.1 Update 2, and in fact the NFS client does negotiate up to 1 MB RPC size on reads automatically, and performance is excellent. Close it.