Red Hat Bugzilla – Bug 1028631
NFS reads bottlenecked by NFS translator RPC processing
Last modified: 2014-03-19 13:35:38 EDT
Description of problem:
NFS translator's processing of incoming RPC messages is single-threaded, this bottlenecks NFS RPC reads. You can see this by running top utility and pressing H, you will see one thread in glusterfs taking 100% of 1 core. Since there is only one glusterfs NFS server per hode, this one thread bottlenecks the entire node.
For now, the 1 10-GbE link/server also bottlenecks the reads in most cases, but typically we have 2 10-GbE NICs per server and we could increase NFS read (and write) throughput by adding 10-GbE NICs, either in bonded mode or by creating a separate gluster subnet on NIC 2 for replication traffic, if it wasn't for this CPU bottleneck.
This problem was seen earlier in the glusterfsd brick process as well. However, we can configure multiple bricks/volume on that server, whereas we cannot configure multiple NFS glusterfs processes/server.
Version-Release number of selected component (if applicable):
RHS 2.1 GA, has always been there.
Steps to Reproduce:
1. Configure 4 BAGL servers or equivalent with a Gluster volume
2. run iozone -+m to make 8 NFS clients read large files sequentially over the 10-GbE network
3. run top, press "H", wait for it to stabilize, look for a hot thread at the top of the display
You see a single glusterfs process at 99% of 1 core. That means glusterfs can't run any faster.
No one thread should be CPU-bottlenecked while the other cores are idle. If they all get bottlenecked, that's ok, it means you are utilizing all the hardware and that's fine.
[root@ben-test-driver2-10ge gluster_test]# par-for-all.sh clients.list 'echo 1 > /proc/sys/vm/drop_caches' ; iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -C -i 1 -+n -r 64k -t 8 -s 16g
# head -8 /var/tmp/iozone_cluster.config
gprfc089 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc090 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc091 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc092 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc094 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc095 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc096 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
gprfc078 /mnt/glnfs/iozone.d/13-10-24-18-22-25 /usr/local/bin/iozone
Here's why we should make the effort: It will be at least a year before NFS-ganesha is equipped to handle multi-server configuration and is supported by RHS to do this. Until then, the only scalable NFS solution that we have is the NFS translator.
Presumably Avati's multi-thread epoll patch would help here, even if we could only use it for NFS RPCs. An alternative is let one thread do the epoll, but just hand the file descriptor off to a worker thread and let it read the RPCs from the file descriptor, allowing the dispatcher thread to move on immediately to the next RPC socket.
Another solution for large-file sequential I/O workloads is to enable larger NFS RPCs, this solution is already upstream but there are some complications associated with NFS flow control for writes, see bz 1008301. We should enable setting max RPC "nfs.read-size" parameter to 1 MB, while defaulting it to 128 KB, which would still double current size.
The default I/O size is made to 1MB as kernel NFS. Snippet of commit:
Author: Santosh Kumar Pradhan <email@example.com>
Date: Thu Oct 17 16:17:54 2013 +0530
gNFS: Make NFS I/O size to 1MB by default
For better NFS performance, make the default I/O size to 1MB, same as
kernel NFS. Also refactor the description for read-size, write-size
and readdir-size (i.e. it must be a multiple of 1KB but min value
is 4KB and max supported value is 1MB). On slower network, rsize/wsize
can be adjusted to 16/32/64-KB through nfs.read-size or nfs.write-size
Signed-off-by: Santosh Kumar Pradhan <firstname.lastname@example.org>
Tested-by: Gluster Build System <email@example.com>
Reviewed-by: Shyamsundar Ranganathan <firstname.lastname@example.org>
Reviewed-by: Anand Avati <email@example.com>
> Another solution for large-file sequential I/O workloads is to enable larger
> NFS RPCs, this solution is already upstream but there are some complications
> associated with NFS flow control for writes, see bz 1008301. We should
> enable setting max RPC "nfs.read-size" parameter to 1 MB, while defaulting
> it to 128 KB, which would still double current size.
The current I/O size supported by NFS server defaults to 1 MB (not 64KB anymore). The commit message is in comment # 1.
You are correct, I'm using glusterfs-184.108.40.206.1u2rhs-1.el6rhs.x86_64 , which is RHS 2.1 Update 2, and in fact the NFS client does negotiate up to 1 MB RPC size on reads automatically, and performance is excellent. Close it.