Description of problem: If you run a very large NFS write test (128 GB/server) to RHS volume, throughput slows and eventually stops. It appears to be a bug related to write-behind translator. Version-Release number of selected component (if applicable): glusterfs-3.4.0.8 (am now trying with glusterfs-3.4.0.11) kernel 2.6.32-358.6.2.el6.x86_64 How reproducible: every time Steps to Reproduce: 1.Construct Gluster volume as described below 2.run iozone as described below 3.watch what happens as we reach about half of the data written Actual results: iostat shows that throughput starts out very strong, but slows eventually until it effectively ceases. Attachment will show this. "top" shows that glusterfs NFS server process consumes all of 1 core. "perf top" shows that all of its time is spent in wb_requests_overlap() routine in write-behind translator, see screenshots in attachment. I disabled write-behind translator by creating a separate volfile for NFS server process and running it manually, and the test then completes. Expected results: No significant degradation in throughput, at least none due to Gluster. Additional info: how to construct Gluster volume: See attachment for details: - 10-GbE, jumbo frames - usual XFS bricks - gluster volume create perftest replica 2 gprfs045-10ge:/mnt/brick0/brick gprfs046-10ge:/mnt/brick0/brick gprfs047-10ge:/mnt/brick0/brick gprfs048-10ge:/mnt/brick0/brick - gluster volume start perftest - on each NFS client: mount -v -t nfs -o nfsvers=3,tcp,actimeo=1 %s:/perftest /mnt/glnfs We spread the NFS clients evenly across the NFS servers in the volume. how to run iozone: change 172.17.10.32 to the IP address that results from resolving the test driver's hostname. The file /var/tmp/iozone_cluster.config contains this: h1 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone h2 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone ... h8 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone h1 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone h2 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone ... h8 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone So there are two threads per NFS client. root@gprfc032-10ge gluster_test]# /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.10.32 -i 0 -C -+n -s 32g -r 128 -c -e -t 16 I got a core dump using "kill -BUS " on the glusterfs process. The coredump was 8 GB. My hypothesis is that the list of uncommitted writes is so long that this accounts for the large memory size of the process, and the act of checking for overlaps in these writes takes all of our CPU time. [root@gprfs048 ~]# ls -l /core.30528 -rw------- 1 root root 9895763968 Jun 24 15:39 /core.30528 [root@gprfs048 ~]# gdb /usr/sbin/glusterfs /core.30528 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1) ... Reading symbols from /usr/sbin/glusterfs...(no debugging symbols found)...done. [New Thread 30528] ... (gdb) where #0 0x00007fabc8022db7 in wb_requests_overlap () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so #1 0x00007fabc8022f08 in wb_wip_has_conflict () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so #2 0x00007fabc802326b in __wb_pick_winds () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so #3 0x00007fabc80262d9 in wb_process_queue () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so #4 0x00007fabc8026c08 in wb_writev () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so #5 0x00007fabc393b3d7 in io_stats_writev () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/debug/io-stats.so #6 0x00007fabc36f0c47 in nfs_fop_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #7 0x00007fabc36fe95b in __nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #8 0x00007fabc3706423 in nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #9 0x00007fabc370ea3b in nfs3_fh_resolve_inode_done () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #10 0x00007fabc370eb0b in nfs3_fh_resolve_inode () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #11 0x00007fabc370ebb5 in nfs3_fh_resolve_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #12 0x00007fabc370ecc8 in nfs3_fh_resolve_root () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #13 0x00007fabc370cc01 in nfs3_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #14 0x00007fabc370cd62 in nfs3svc_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so #15 0x0000003863808655 in rpcsvc_handle_rpc_call () from /usr/lib64/libgfrpc.so.0 #16 0x00000038638087a3 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0 #17 0x00000038638098d8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 #18 0x00007fabc92ab3a4 in socket_event_poll_in () from /usr/lib64/glusterfs/3.4.0.8rhs/rpc-transport/socket.so #19 0x00007fabc92ab4fd in socket_event_handler () from /usr/lib64/glusterfs/3.4.0.8rhs/rpc-transport/socket.so #20 0x000000386305a9b7 in ?? () from /usr/lib64/libglusterfs.so.0 #21 0x0000000000406776 in main () (gdb) up #1 0x00007fabc8022f08 in wb_wip_has_conflict () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so
For files documenting details of bug, such as core dump and perf top screenshots, see http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/nfs-write-behind-prob.tgz
In my setup, out of my 4 servers (2x2 volume), In one of the brick server, I am seeing a different problem where two of the threads are consuming a lot of CPU time with following stack trace: I dont have debug symbols installed in the machine, ll do that to get the line no. 1. ps -L -p <NFS pid> PID LWP TTY TIME CMD 25563 25563 ? 1-05:01:17 glusterfs 25563 25565 ? 00:00:00 glusterfs 25563 25566 ? 00:00:00 glusterfs 25563 25567 ? 00:00:00 glusterfs 25563 25574 ? 1-02:19:19 glusterfs 25563 25589 ? 00:00:00 glusterfs LWP 25563 and 25574 are consuming more time. 2. top output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25563 root 20 0 4493m 2.9g 7732 R 200.1 37.5 3326:24 glusterfs 3. stack trace: Thread 1 (Thread 0x7f34dcc6c700 (LWP 25563)): #0 0x00007f34d86242a0 in afr_transaction_eager_lock_init () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so #1 0x00007f34d862688d in afr_transaction () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so #2 0x00007f34d861f427 in afr_do_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so #3 0x00007f34d861fa3c in afr_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so #4 0x00007f34d83e17d9 in dht_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/distribute.so #5 0x00007f34d81a8c22 in wb_writev_helper () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so #6 0x00000030b6e2f3a4 in call_resume () from /usr/lib64/libglusterfs.so.0 #7 0x00007f34d81a9318 in wb_do_winds () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so #8 0x00007f34d81aa302 in wb_process_queue () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so #9 0x00007f34d81aac08 in wb_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so #10 0x00007f34d3df23d7 in io_stats_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/debug/io-stats.so #11 0x00007f34d3ba7027 in nfs_fop_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #12 0x00007f34d3bb4d3b in __nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #13 0x00007f34d3bbc913 in nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #14 0x00007f34d3bc4f8b in nfs3_fh_resolve_inode_done () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #15 0x00007f34d3bc505b in nfs3_fh_resolve_inode () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #16 0x00007f34d3bc5105 in nfs3_fh_resolve_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #17 0x00007f34d3bc5218 in nfs3_fh_resolve_root () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #18 0x00007f34d3bc3101 in nfs3_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #19 0x00007f34d3bc32b2 in nfs3svc_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so #20 0x00000030b7208655 in rpcsvc_handle_rpc_call () from /usr/lib64/libgfrpc.so.0 #21 0x00000030b72087a3 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0 #22 0x00000030b72098d8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 #23 0x00007f34d96c5bd6 in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/rpc-transport/socket.so #24 0x00007f34d96c74ed in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/rpc-transport/socket.so #25 0x00000030b6e5b387 in ?? () from /usr/lib64/libglusterfs.so.0 #26 0x00000000004067c6 in main () Thread 3 (Thread 0x7f34d94a4700 (LWP 25574)): #0 0x00000035c9e0c182 in ?? () from /lib64/libpthread.so.0 #1 0x00007f34d86241d5 in afr_fd_ctx_get () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so #2 0x00007f34d86253dd in afr_delayed_changelog_post_op () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so #3 0x00000030b6e2a230 in gf_timer_proc () from /usr/lib64/libglusterfs.so.0 #4 0x00000035c9e07851 in start_thread () from /lib64/libpthread.so.0 #5 0x00000035c96e890d in clone () from /lib64/libc.so.6 Will continue looking into this.
I reproduced the problem with just 2 servers instead of 4. So you don't need a huge configuration to do this. I'll try to cut down configuration even more than this. It appears that the client/server ratio matters here. With 1:1 ratio (2 servers, 2 clients), the problem does not occur, but with 2:1 (2 servers, 4 clients), the problem does occur. I think that NFS clients are bursty, they may overload the server with a large set of writes, and the server can't keep up with the long write-behind list. This would explain why O_DIRECT writes do not encounter the problem. I reproduced with /proc/sys/vm/nr_hugepages=0 on all systems, eager-lock off, and no special tuning of any kind. So there is no reason why QE can't reproduce. I'm going to try increasing vm.min_free_kbytes to see if it's some sort of kernel memory issue. As before, once the problem occurs, you CANNOT CLEANUP the processes on the client, they hang. YOu have to do a hard reset on the clients (I use ipmitool), reboot won't work, for example, you see: [root@gprfc032-10ge gluster_test]# ssh gprfc089 ssh: connect to host gprfc089 port 22: Connection refused Which means that reboot hung while trying to unmount NFS, so it just becomes a useless machine. THe server volume can be stopped but it took 2 min for that to happen while write-behind.so worked its way through the backlog. Be sure your clients are not all mounting same server, this may prevent problem from occurring by limiting data flow from clients. You can get it to happen much quicker by restricting memory available to kernel as follows: echo N > /proc/sys/vm/nr_hugepages where N is number of 2-MB pages you want to reserve. This effectively steals memory away from the Linux buffer cache so the system enters a state where it's out of physical memory much more quickly. Delete old files before writing new ones, overwriting existing file may not give same behavior.
Thanks Ben for all your input. I tried with following config/workload: 1. 1x2 dist-rep volume (2 node cluster: sev1 sev2: one brick in each server) 2. 4 clients (c1, c2, c3, c4: c1/c2 mounting from sev1 and c3/c4 mounts from sev2) 3. iozone workload: When I tried with "iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4" I could hit the issue within 30mins or so. May be this generates more I/O than 128g per server. So I modified the command as below (for 128g per server): iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 32g -r 128 -c -e -t 4 iozone.cfg: =========== 10.70.43.16 /mnt/nfsdir /usr/local/bin/iozone 10.70.43.17 /mnt/nfsdir /usr/local/bin/iozone 10.70.42.243 /mnt/nfsdir /usr/local/bin/iozone 10.70.42.252 /mnt/nfsdir /usr/local/bin/iozone per thread 32g, so total 32*4 = 128g as replica total i/o workload for gluster = 256g as 2 node/brick cluster, workload per brick = 256/2 = 128g Is my calculation for 128g write workload per sever correct? The test is running, still no issue with this workload. Please correct me if my understanding of generating the workload is wrong. Thanks, Santosh
(In reply to Ben England from comment #5) > I reproduced the problem with just 2 servers instead of 4. So you don't > need a huge configuration to do this. > > I'll try to cut down configuration even more than this. > > It appears that the client/server ratio matters here. With 1:1 ratio (2 > servers, 2 clients), the problem does not occur, but with 2:1 (2 servers, 4 > clients), the problem does occur. I think that NFS clients are bursty, > they may overload the server with a large set of writes, and the server > can't keep up with the long write-behind list. This would explain why > O_DIRECT writes do not encounter the problem. > Could you specify the iozone command line you used for (2 server : 4 client)-setup. I tried following command with a 1 x 2 (dist-repl) vol spread across 2 node cluster (one brick in each server) and 4 clients (mount ratio: 2 clients per server): iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 32g -r 128 -c -e -t 4 It just worked. But when I changed the size to 64g per thread (-s 64g), I could reproduce the issue e.g. iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4 I am not clear how to calculate the exact iozone workload (say 128g) per server in 1x2 or 2x2 dist-rep volumes. Could you clarify this part? Thanks, Santosh
Forgot to mention that, I turned the eager-locking OFF in my setup.
Some more update: set up: 2 node cluster with 4 clients iozone command used: iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4 case: 1 (simple distributed vol on 2 node cluster) cluster eager-locking ON iozone completed successfully. case: 2 (distributed replicated 1x2 volume) cluster eager locking ON iozone completed successfully. case: 3 (dist-rep 1x2 vol) cluster eager locking OFF Already tried once where the issue was reproduced. Will try again tomorrow.
Today I tried couple of times (on 1x2 vol with eager lock OFF/ON) but could not reproduce the issue in my setup. Also we (Anand and me) had a discussion with Ben how to take it forward. I requested Ben if he can reproduce the issue in my setup so that debugging would be easier. Otherwise I may need to debug in his setup. Also we ll follow up tomorrow evening 5pm (IST). I tried with latest build (3.4.0.12rhs.beta3).
have you tried disabling the write-behind translator? Why do we need it for NFS? the NFS client will aggregate write requests up to 64-KB RPC size, good enough?
Memory footprint of one of the server came down from 14G to 1.6G. Other server is still running high with ~ 11G. All are in RSS. I profiled one of the server (where memory usage of NFS process came down from 14G to 1.6G RSS) This is perf data: Samples: 29M of event 'cycles', Event count (approx.): 21872713449727 + 81.49% glusterfs write-behind.so [.] wb_requests_overlap + 4.46% glusterfs write-behind.so [.] wb_wip_has_conflict + 3.89% glusterfs libpthread-2.12.so [.] 0x000000000000ed31 + 1.99% glusterfs write-behind.so [.] wb_requests_overlap@plt + 0.85% glusterfs write-behind.so [.] __wb_pick_unwinds + 0.81% glusterfs [kernel.kallsyms] [k] copy_user_generic_string + 0.54% glusterfs [kernel.kallsyms] [k] compaction_alloc + 0.48% glusterfs afr.so [.] afr_transaction_eager_lock_init + 0.28% glusterfs write-behind.so [.] __wb_pick_winds + 0.22% glusterfs libc-2.12.so [.] __memset_sse2 + 0.21% glusterfs libc-2.12.so [.] _int_malloc + 0.18% glusterfs write-behind.so [.] __wb_preprocess_winds + 0.17% glusterfs libgfrpc.so.0.0.0 [.] __saved_frame_copy + 0.16% glusterfs libc-2.12.so [.] memcpy + 0.15% glusterfs write-behind.so [.] wb_fulfill_err There are a lot of Errors/Warnings in NFS log (both the NFS servers). Needs to be looked at as well. iozone looks like hung (not finishing). Will keep updated.
(In reply to Ben England from comment #13) > have you tried disabling the write-behind translator? [Santosh] Yes, I tried disabling write-behind xlator. The behavior was really bad, the memory usage is going up very quickly (I saw upto 24+ GB before I killed them). Why do we need it for [Santosh] wb-xlator really helps, though we see a problem with huge write from multiple clients. May be Vijay/Avati ll have a better answer for this question. I would defer it for Vijjay. > NFS? the NFS client will aggregate write requests up to 64-KB RPC size, > good enough? [Santosh] I dont this it ll help. (from my experiment with write-behind off). Thats why you see huge memory footprint of the NFS server in 2.0 i.e. BZ 965924 (in fact that's not memory leak as RSS is still huge which means NFS server without write-behind is too loaded to handle the I/O burst).
Santosh, you are right, turning write-behind off does not have desired effect. Throughput was about 50% of what it was with write-behind on. And the memory leak still happens. What confused me is that the bottleneck appears to be in the write behind translator, but if you turn write-behind off, the bottleneck moves to somewhere else (libgfrpc saved_frame_{copy,get}). What strikes me as strange is that it just takes in data from NFS client without any flow control. NFS client should be asking for a COMMIT of data to persistent storage, this should cause clients to refrain from pushing additional write data, hence there should be no massive overcommit of memory to glusterfs process. So I captured a protocol trace on c089, and now I'm really confused. There are no NFS headers in this trace. http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/glnfs-c089-wr.tcpdump.gz There are a lot of packets but none of them are NFS, they are all TCP, looks like some kind of TCP flow control problem where TCP never makes progress on the transmit window? But I'm not really sure yet. I also see lots of these: [root@gprfc089 ~]# tail /var/log/messages Jul 16 09:15:58 gprfc089 kernel: nfs: server gprfs017-10ge not responding, still trying I'll try starting tracing before iozone run goes too far.
what looks weird to me is: perf data i.e. wb_requests_overlap() routine which is just fetching the passed parameters (couple ptrs to structures of type wb_request_t), does some arithmetic operation along with a comparison. perf annotate confuses me more: Disassembly of section .text: 0000000000002db0 <wb_requests_overlap>: 1.25 mov 0x4b8(%rdi),%rax 85.37 mov 0x4c0(%rdi),%rcx <===== This looks like loading ordering.size 5.51 test %rax,%rax 0.00 jne 48 I installed the debug symbols, but still perf annotate shows the assembly instead of source lines. Tomorrow, will continue looking into it.
My only hypothesis is that NFS server does not handle COMMIT request from NFS client properly. Otherwise NFS server could not accumulate so many I/O requests to scan in write-behind translator code. re .-1: I think all this perf output means is that it's executing this instruction millions of times and this instruction is just the place where the cache miss occurs perhaps. It doesn't mean that the program is doing something wrong there. It's a hint that your list of requests is getting excessively long. iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 16384k -o -s 32g -t 16 Does not hang or leak memory at all. "-o" opens file with O_SYNC, so after every 16-MB record is written to file by iozone, the write will block until the data actually is written to storage. This means the amount of write requests that are uncommitted is much smaller perhaps? I tried capturing protocol traces on an NFS server gprfs017, while running single-threaded write to make trace analysis easier, but it did not capture the NFS write traffic except for a couple of early requests. It did capture lots of gluster traffic. So protocol tracing isn't going to help much here.
(In reply to Ben England from comment #18) > My only hypothesis is that NFS server does not handle COMMIT request from > NFS client properly. Otherwise NFS server could not accumulate so many I/O > requests to scan in write-behind translator code. > > re .-1: I think all this perf output means is that it's executing this > instruction millions of times and this instruction is just the place where > the cache miss occurs perhaps. It doesn't mean that the program is doing > something wrong there. It's a hint that your list of requests is getting > excessively long. [Santosh] Yes, probably you are right. I was planning to let the perf another go with L1-cache event analysis to check that behaviour. > > iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 > -+n -r 16384k -o -s 32g -t 16 > > Does not hang or leak memory at all. "-o" opens file with O_SYNC, so after > every 16-MB record is written to file by iozone, the write will block until > the data actually is written to storage. [Santosh] Wow, great. At least some relief :) Thanks for the experiment. > This means the amount of write > requests that are uncommitted is much smaller perhaps? [Santosh] Yes, probably. It does not accumulate too many outstanding I/O reqs. > > I tried capturing protocol traces on an NFS server gprfs017, while running > single-threaded write to make trace analysis easier, but it did not capture > the NFS write traffic except for a couple of early requests. It did capture > lots of gluster traffic. So protocol tracing isn't going to help much here. [Santosh] Thanks again for all your effort. I will keep looking at it today and keep you posted about the progress. Best, Santosh
Vijay mentioned that there is a way to get a specific xlator dump, in this case the wb xlator only dump. There is supposedly a filter for this. He couldn't recall that as its been a while he used it but we can try to see if we can use that to get all wb stats. Regarding Ben's comment: > re .-1: I think all this perf output means is that it's executing this > instruction millions of times and this instruction is just the place where > the cache miss occurs perhaps. It doesn't mean that the program is doing > something wrong there. It's a hint that your list of requests is getting > excessively long. (gdb) p *wb_inode $2 = {window_conf = 1048576, window_current = 92311552, transit = 673583104, all = {next = 0xd12ec60, prev = 0x190cfc6a0}, todo = {next = 0x1339a7800, prev = 0x190cfc6b0}, liability = {next = 0xd12ec80, prev = 0xd940340}, temptation = { next = 0xd9b7d00, prev = 0x190cfc6c0}, wip = {next = 0xd12ecb0, prev = 0x127cfe160}, gen = 17, lock = 0, this = 0x2578230} The transit count seems to be the total size of the write fops in transit. The size from one of the runs is 673583104, when in the problem state. Also, from the state dump, callcount was ~129000 which points to a huge backlog of writes pending (which would imply uncommitted to disk and unacknowledged to client/iozone). Can one of you confirm if write-behind option flush_behind is on/off please?
By default flush_behind is on, just wanted to make sure it was not turned off for the last couple of runs. Thanks.
Santosh, Am not able to attach to the Vnc. If you are running a test, could you run nfsstat -rc on one of the clients where iozone is running. If nfsstat outputs retrans > 0, it probably implies the server is not able to handle all the client requests. Just wanted to check this. Anand
Does it make sense to tweak the defaults for the recv and write socket buffers and try a run with increased values? Defaults seem quite low: [root@gprfs017 blr-team]# cat /proc/sys/net/core/rmem_max 229376 same for the wmem_max.
I think we should try with network subsystem tuning. (in whole setup e.g.gprfs017, gprfs018, gprfc089, gprfc090, gprfc091 and gprfc092). This is a nice link to tune the 10GE NIC for better performance: http://fasterdata.es.net/host-tuning/linux/ I am able to attach to vnc. I am not running anything but going to run another instance of iozone. (with the gluster-debug bundles installed). So, that perf annotate would point to the source line. Currently its just mapping to assembly. I have not touched the flush_behind parameter of wb-xlator.
1. I discussed with Raghavendra Gowdappa about the flush_behind paramater, he says setting it to OFF will make the performance worse. 2. I ran the iozone with smaller load (16g per thread with 16 threads) i.e. iozone -+m /var/blr-team/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 32k -s 16g -t 16 I saw something different behaviour this time. Both the server went upto ~ 15G (RSS) and then came down to 2.5G RSS but iozone did not finish. I am not sure but problem seems to be somewhere else. The NFS client might be doing another layer of write-behind for its performance (which might be reason for wb_requests_overlap() to be hit more ). 3. I am not able to map the perf data to source line this time as well. Already installed the debug symbols in gprfs017 (server). 4. In both the servers (17 and 18), in /var/log/messages: I saw something like this, not sure if this is related. Jul 17 08:56:52 gprfs018 GlusterFS[22756]: [2013-07-17 08:56:52.375118] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-0: cannot lookup the saved frame corresponding to xid (450331) Jul 17 08:56:53 gprfs018 GlusterFS[22756]: [2013-07-17 08:56:53.366492] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-1: cannot lookup the saved frame corresponding to xid (463375) 5. In one of the clients i.e. gprfc092 in /var/log/messages: Jul 17 07:50:17 gprfc092 kernel: INFO: task iozone:3160 blocked for more than 120 seconds. Jul 17 07:50:17 gprfc092 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 17 07:50:17 gprfc092 kernel: iozone D 0000000000000000 0 3160 1 0x00000080 Jul 17 07:50:17 gprfc092 kernel: ffff880831b3fe08 0000000000000082 ffff880830566aa0 ffff880830566aa0 Jul 17 07:50:17 gprfc092 kernel: ffff880830566aa0 ffff88083fee6140 ffff880830566aa0 0000001081167330 Jul 17 07:50:17 gprfc092 kernel: ffff880830567058 ffff880831b3ffd8 000000000000fb88 ffff880830567058 Jul 17 07:50:17 gprfc092 kernel: Call Trace: Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118f221>] ? path_put+0x31/0x40 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150efbe>] __mutex_lock_slowpath+0x13e/0x180 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150ee5b>] mutex_lock+0x2b/0x50 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191e39>] do_unlinkat+0xa9/0x1b0 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff810dc8f7>] ? audit_syscall_entry+0x1d7/0x200 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191f56>] sys_unlink+0x16/0x20 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Jul 17 07:50:17 gprfc092 kernel: INFO: task iozone:3167 blocked for more than 120 seconds. Jul 17 07:50:17 gprfc092 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 17 07:50:17 gprfc092 kernel: iozone D 0000000000000000 0 3167 3166 0x00000080 Jul 17 07:50:17 gprfc092 kernel: ffff8807eb81fdd8 0000000000000086 0000000000000000 ffff8807eb81fd68 Jul 17 07:50:17 gprfc092 kernel: ffff8807eb81fd88 ffffffffa0275688 ffff8807eb81fdb8 0000000000000286 Jul 17 07:50:17 gprfc092 kernel: ffff880831d2e5f8 ffff8807eb81ffd8 000000000000fb88 ffff880831d2e5f8 Jul 17 07:50:17 gprfc092 kernel: Call Trace: Jul 17 07:50:17 gprfc092 kernel: [<ffffffffa0275688>] ? nfs_refresh_inode_locked+0x468/0xce0 [nfs] Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150efbe>] __mutex_lock_slowpath+0x13e/0x180 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8121cd6f>] ? security_inode_permission+0x1f/0x30 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150ee5b>] mutex_lock+0x2b/0x50 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118fabc>] vfs_unlink+0x5c/0xe0 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118e84a>] ? lookup_hash+0x3a/0x50 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191e85>] do_unlinkat+0xf5/0x1b0 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff810dc8f7>] ? audit_syscall_entry+0x1d7/0x200 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191f56>] sys_unlink+0x16/0x20 Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Jul 17 07:59:51 gprfc092 kernel: nfsacl: server gprfs018-10ge not responding, still trying I think for this kind of workload, we need to tune: 1. In the client side: a) the options passed to mount command b) N/W tuning for 10GE and high workload 2. In the server side: N/W tuning to handle the high workload 3. Not sure but we may need to look the windows size in wb-xlator. Thanks, Santosh
1. Wrt RHS 2.0 Update 4, using the same testbed from Ben, we see the issue (reported for Big Bend) both with and without write-behind. Ben : please check the testbed you gave us. The last run was with the write-behind xlator enabled for a U4 run. It would be great if you could take a few minutes to take a look and see if there is anything missing in the configuration from what you tested U4 with (speaking of the successful iozone run(s)). 2. Do you also remember the configuration (gluster conf mainly) when you ran iozone successfully. We did go through your iozone results but we are unsure if write-behind was enabled or not for those runs. Which version of RHEL was in your config? 3. Please note that in the testbed you shared with us, all of the following seem to hang: (a) U4 with write-behind (b) U4 without write-behind (glusterfs-nfs and glusterfsd both are OOM killed) (c) Big Bend release 3.4.0.12rhs.beta3/ 4. In our U4 runs, we noticed from iostat output that initially when the iozone workload started, the queue-len await times (not looking at svctm) are bad at times: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 1.67 0.00 1.33 0.00 21.33 16.00 0.00 0.75 0.75 0.10 sdb 0.00 0.00 0.00 951.00 0.00 795400.00 836.38 83.86 88.18 0.77 73.67 dm-0 0.00 0.00 0.00 2.67 0.00 21.33 8.00 0.00 0.75 0.38 0.10 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 698.67 0.00 951.00 0.00 795400.00 836.38 83.96 88.29 0.78 73.77 dm-3 0.00 0.00 0.00 1649.67 0.00 795400.00 482.16 146.38 88.73 0.45 73.77 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.67 8.67 1.00 298.67 13.33 32.28 0.05 5.24 4.79 4.63 sdb 0.00 0.00 0.00 1.33 0.00 1365.33 1024.00 0.15 737.50 43.75 5.83 dm-0 0.00 0.00 8.33 1.67 296.00 13.33 30.93 0.05 4.97 4.63 4.63 dm-1 0.00 0.00 0.33 0.00 2.67 0.00 8.00 0.00 9.00 9.00 0.30 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 1.33 0.00 1365.33 1024.00 0.15 737.50 43.75 5.83 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 0.00 5.83 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.67 0.00 1.33 0.00 13.33 10.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 820.00 0.00 685968.00 836.55 69.78 85.10 0.68 55.37 dm-0 0.00 0.00 0.00 1.67 0.00 13.33 8.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 602.67 0.00 820.00 0.00 685968.00 836.55 69.86 85.19 0.68 55.37 dm-3 0.00 0.00 0.00 1422.67 0.00 685968.00 482.17 121.43 85.30 0.39 55.37 If the latency is ~43ms, then the backpressure could build on the wb xlator as well. We are trying to look at the code and figure out the explanation you wanted wrt nfs commits. 5. It would be great if you could share your RHS-2.0-U5 run results.
I'm not happy, I just reproduced problem with Anshi glusterfs-3.3.0.7 RPMs. I tried to run everything the same way I did before, when it clearly did not hang. I changed my test script so it always logs "rpm -qa" and "gluster volume info", hopefully in the future this will answer questions like this. So it appears that we always had the problem. Are you sure that no customers have encountered this already? You don't need iozone to reproduce the problem. A single dd command will do it. This is the simplest write application that I know so every Gluster user will see this problem when writing large files. This command hung after only 9 GB of data. The replicas were not on the NFS server. You only need 2 servers, 1 client to reproduce this. ssh gprfc081 'f=/mnt/glnfs/x.dd; rm -fv $f ; dd if=/dev/zero of=$f bs=16384k count=1024' But this command never hung and never leaked memory - glusterfs process did not grow its virtual memory at all. This is dd utility equivalent of iozone's "-o" option to use the O_SYNC flag when opening the file. Every write request is flushed to storage before it completes. ssh gprfc081 'f=/mnt/glnfs/x.dd; rm -fv $f ; dd if=/dev/zero of=$f bs=16384k count=1k oflag=sync' removed `/mnt/glnfs/x.dd' 1024+0 records in 1024+0 records out 17179869184 bytes (17 GB) copied, 137.944 s, 125 MB/s Again this suggests NFS COMMIT processing is not working.
gluster version: Big bend (3.4.0.12rhs.beta5-2) Tuning: 1. Server: vm.min_free_kbytes = 1048576 2. client: Mount nfs with bigger rsize/wsize: mount -t nfs -o nfsvers=3,tcp,rsize=32768,wsize=32768 gprfs017-10ge:/perftest /mnt/glnfs 3. In iozone, set the record size smaller to avoid huge rpc retransmissions: (-r 4k with 16 threads) Verified in the client side (al clients), no retransmission happens. glusterfs(d) processes also came down in memory size. With this, iozone finished succesfully, here is the output: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 32g -t 16 Iozone: Performance Test of File I/O Version $Revision: 3.392 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Ben England. Run began: Tue Jul 23 05:27:36 2013 Network distribution mode enabled. Hostname = 10.16.159.187 Setting no_unlink Include close in write timing Include fsync in write timing No retest option selected Record Size 4 KB File size set to 33554432 KB Command line used: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 32g -t 16 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 16 processes Each process writes a 33554432 Kbyte file in 4 Kbyte records Test running: Children see throughput for 16 initial writers = 416480.96 KB/sec Min throughput per process = 24609.73 KB/sec Max throughput per process = 27953.86 KB/sec Avg throughput per process = 26030.06 KB/sec Min xfer = 29644520.00 KB Child[0] xfer count = 30135464.00 KB, Throughput = 24888.27 KB/sec Child[1] xfer count = 29906884.00 KB, Throughput = 24827.08 KB/sec Child[2] xfer count = 32807456.00 KB, Throughput = 27329.24 KB/sec Child[3] xfer count = 32005172.00 KB, Throughput = 26662.56 KB/sec Child[4] xfer count = 30035532.00 KB, Throughput = 24817.52 KB/sec Child[5] xfer count = 29735420.00 KB, Throughput = 24685.27 KB/sec Child[6] xfer count = 32017976.00 KB, Throughput = 26673.22 KB/sec Child[7] xfer count = 31608052.00 KB, Throughput = 26331.74 KB/sec Child[8] xfer count = 30064660.00 KB, Throughput = 24827.14 KB/sec Child[9] xfer count = 30332952.00 KB, Throughput = 25183.77 KB/sec Child[10] xfer count = 33040404.00 KB, Throughput = 27521.22 KB/sec Child[11] xfer count = 33084820.00 KB, Throughput = 27562.14 KB/sec Child[12] xfer count = 30635224.00 KB, Throughput = 25307.52 KB/sec Child[13] xfer count = 29644520.00 KB, Throughput = 24609.73 KB/sec Child[14] xfer count = 32774976.00 KB, Throughput = 27300.69 KB/sec Child[15] xfer count = 33554432.00 KB, Throughput = 27953.86 KB/sec iozone test complete. DRC patch is yet to be tested. Thanks, Santosh
vm.min_free_kbytes = 131072 (Reduced to 128MB instead of 1GB) mount -t nfs -o nfsvers=3,tcp,rsize=32768,wsize=32768 iozone record size = 16k This time, it passed as well: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16k -s 32g -t 16 Iozone: Performance Test of File I/O Version $Revision: 3.392 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Ben England. Run began: Tue Jul 23 06:20:13 2013 Network distribution mode enabled. Hostname = 10.16.159.187 Setting no_unlink Include close in write timing Include fsync in write timing No retest option selected Record Size 16 KB File size set to 33554432 KB Command line used: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16k -s 32g -t 16 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 16 processes Each process writes a 33554432 Kbyte file in 16 Kbyte records Test running: Children see throughput for 16 initial writers = 425322.62 KB/sec Min throughput per process = 26312.52 KB/sec Max throughput per process = 26887.80 KB/sec Avg throughput per process = 26582.66 KB/sec Min xfer = 31964480.00 KB Child[0] xfer count = 32900256.00 KB, Throughput = 26468.06 KB/sec Child[1] xfer count = 31976400.00 KB, Throughput = 26323.77 KB/sec Child[2] xfer count = 33413312.00 KB, Throughput = 26750.23 KB/sec Child[3] xfer count = 33157936.00 KB, Throughput = 26559.38 KB/sec Child[4] xfer count = 33115824.00 KB, Throughput = 26636.75 KB/sec Child[5] xfer count = 31964480.00 KB, Throughput = 26317.89 KB/sec Child[6] xfer count = 33021152.00 KB, Throughput = 26450.86 KB/sec Child[7] xfer count = 33114976.00 KB, Throughput = 26529.80 KB/sec Child[8] xfer count = 33304128.00 KB, Throughput = 26887.80 KB/sec Child[9] xfer count = 32330704.00 KB, Throughput = 26619.55 KB/sec Child[10] xfer count = 33297312.00 KB, Throughput = 26662.14 KB/sec Child[11] xfer count = 33161376.00 KB, Throughput = 26564.73 KB/sec Child[12] xfer count = 32718272.00 KB, Throughput = 26312.52 KB/sec Child[13] xfer count = 32231040.00 KB, Throughput = 26529.39 KB/sec Child[14] xfer count = 33499104.00 KB, Throughput = 26827.74 KB/sec Child[15] xfer count = 33554432.00 KB, Throughput = 26882.01 KB/sec iozone test complete.
sorry, I can't reproduce. I tried all your suggestions, it hangs the same way it did before. However, there are no stack traces, which is consistent with fact that RHEL6.4 isn't as vulnerable to memory fragmentation problem. For example, gprfc090 "nfsstat -c" shows this, and gprfc090 was rebooted immediately before this test was run so counters were all zeroed. Counters stop changing when it hangs, and /var/log/messages says NFS server is not responding. NFS server shows 7 GB in glusterfs and it's using 1 core up. Jul 23 13:31:28 gprfc090 kernel: nfs: server gprfs018-10ge not responding, still trying If we analyze the nfs counters, we have ~2.6 million writes x 32 KB/write = ~83 GB of writes, which makes sense -- iozone has 4 threads running on this client, each is trying to write 32 GB for a total of 128 GB. The number of commit requests is 68 , so we have roughly 1 GB per commit request! This may be the problem. If we issue a COMMIT while system is busy, or perhaps multiple commits from different clients, could it timeout? I don't know of a way to increase the COMMIT frequency on the client side. I thought fs.nfs.nfs_congestion_kb would do it but it doesn't work. Client rpc stats: calls retrans authrefrsh 2657537 25597 2539831 Client nfs v3: null getattr setattr lookup access readlink 0 0% 12 0% 0 0% 20 0% 20 0% 0 0% read write create mkdir symlink mknod 0 0% 2658483 99% 12 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 4 0% 2 0% 68 0%
I now have some proof that this problem is in Gluster-NFS implementation and needs to be fixed there. Jeff Layton helped me figure out how to get the NFS COMMIT frequency from the client to increase -- I set this on all the clients: # sysctl -w vm.dirty_bytes=262144000 vm.dirty_background_bytes=65536000 When I do this, the excessive glusterfs memory consumption does not occur as much. By default, they use vm.dirty_ratio and vm.dirty_background_ratio, which default to 20% and 10% of RAM respectively, so for a 64-GB system that would be 12 GB and 6 GB respectively. So Gluster NFS translator is in fact flushing when COMMIT arrives, they just weren't arriving often enough to keep the write-behind list from getting excessively long. With this setting, I saw glusterfs grow to 2.7 GB VM size, but it still seemed to be able to function, and the test completed, even with a record size of 16384 KB. Now I'm going to remove some of the other tuning and verify that it still works, to confirm the hypothesis. Sorry that it takes a long time. But you cannot expect NFS clients to tune vm.dirty_bytes and vm.dirty_background_bytes, this was just a workaround. What has to happen is that Gluster has to automatically flush uncommitted writes to storage if they exceed some size threshold. So instead of calling nfs_flush from just the nfs3_commit routine, you have to trigger a call to it from somewhere else (maybe during write processing) if this threshold is detected. Can we make the threshold configurable please?
Now that I've increased NFS COMMIT frequency as Jeff Layton suggested with vm.dirty_background_bytes=32768000 vm.dirty_bytes=65536000 on the NFS clients to help Gluster-NFS implementation flush writes to bricks sooner, I have successfully backed out tunings that were introduced earlier when we weren't sure what the cause was - no hangs are occurring anymore. glusterfs NFS server process never exceeds 1 GB, is closer to 600 MB now. - removing NFS mount options rsize=32k,wsize=32k does not cause hang - Removing record size of 16 KB (back to 64 KB) does not cause hang - removing kernel network tunings in /etc/sysctl.conf does not cause hang (TCP autotuning should be handling some of them). - No volume tuning parameters were required - but eager-lock may help performance - vm.min_free_kbytes=1GB is not required (on Big Bend = RHEL6.4) - but may help performance But as soon as I set vm.dirty_background_bytes=0 vm.dirty_bytes=0 (default values), glusterfs becomes huge and hangs again. This bz is not fixed until NFS clients don't have to be tuned.
Not a regression. Requires kernel tunables. Now targeting Corbett.
Working on the DRC patch.
Update: ======= 1. I tried DRC patch (without any tuning in client/server side), same behaviour i.e iozone hangs. My findings after all these testing: =================================== 2. Whenever iozone hangs, these are the behaviours we see in client/server side: a. client side: nfsstat shows a lot of retransmission: e.g. Client rpc stats: calls retrans authrefrsh 418122 31379 396229 which is huge. What does it tell: NFS server is not able to handle the workload. Gluster NFS is single threaded (Now it can not be changed though). [Whenever it's seen, in case of kernel NFS, it's advised to increase the number of NFS kernel thread.] b. server side: "netstat -s" shows a lot of packet overrun issues in network: e.g. 1884 packets pruned from receive queue because of socket buffer overrun 42 packets pruned from receive queue 14 packets dropped from out-of-order queue because of socket buffer overrun 1629267 packets collapsed in receive queue due to low socket buffer It demands "Network Tuning" in server side for better NFS performance. 3. What can be tuned for better NFS performance: a. Network (server side) b. System e.g. vm subsystem in server c. Client e.g. VM subsys in client It's recommended to tune the client VM subsys for better NFS performance, especially the dirty page-cache parameter which may congest the network during the high I/O workload (high volume of write) in a system with bigger RAM (more than 8-GB). In our case, the amount of RAM in clients (64GB) are more than that of server ( ~48GB). The default vm params in client m/c: vm.dirty_background_ratio = 10 ( which is ~ 6.4GB) vm.dirty_ratio = 20 ( which is ~26GB) These are really big for a NFS client which needs tuning. vm.dirty_background_bytes = 0 vm.dirty_bytes = 0 The tuning can be made through either of the set but only one set ll be taken into a/c, later has the preference I guess (not sure). http://www.novell.com/support/kb/doc.php?id=7010287 Not sure if we enforce the write in gluster NFS, how much it is going to help. I am not sure but I guess NFS kernel version also does not do any such enforcement. I would request Anand/Ben/Sayan/Vivek for their input to take it forward from here. Thanks, Santosh
One more thing forgot to mention i.e. performance problem because of RAID: ( which might need some tuning) The glusterfs brick is cut from a RAID disk (in both the servers), they might be imposing some performance bottleneck (as mentioned by Anand in one of the email thread). What is the RAID level for the bricks ? What is the chunk size of the RAID? What is the block size of underlying filesystem (xfs in this case)? I dont know much about the RAID tuning for performance but I think xfs/RAID needs some tuning for better performance. Reference: (http://www.redwax.co.uk/blog-reader/items/linux-ext2-ext3-and-ext4-stride-stripe-calculation-on-raid.html) (1) Ideally 10GE set-up should provide better NFS performance instead of 1GE interface. If the NFS is going to write 64KB data (GlusterFS supports NFS v3 which auto-negotiates to 64KB as rsize/wsize) then it should generate more TCP packets in case of 1GE (1500 MTU) than 10GE (MTU >= 9000). [Exception: 1GE also can support jumbo-frames if configured]. But in our case, the reverse happens. I tested with 1GE setup where iozone test passed with all record size. In fact, iozone record size should not matter, what matters is the NFS (r/w)size. (2) In the past, Ben tested the same with RHS 2.0 U4 (as per the shared test result) where iozone test passed. But with same version of GlusterFS s/w, the test fails in this setup. From both of the above mentioned points, it looks like the disk plays something in the performance degradation. I will check with Anand/system admin (here in bangalore) if jumbo frame can be enabled in the switch where both of BLR-servers are connected. Both of the servers have 10GE interface. Both of the machines are connected to SATA disks (Anand, please rectify me if the disks are SAS or so).
I tried using vm.dirty_ratio=2 vm.dirty_background_ratio=1 and it hung, so no reasonable vm.dirty tuning is going to make Gluster-NFS work. I also saw glusterfs fail to stop when I stopped the volume. THis makes it impossible to fix without killing the glusterfs NFS process, not a good thing. re: .-1 These clients are not abnormally sized. If anything you could see clients with far more memory than this in the field. RAID level is 6 chunk size is 256 KB or 64 KB, not sure which right now. block size is default (directory block size is 8192) XFS file system is made according to RHS best practices document as follows by the test harness used by perf team: mkfs -t xfs -f -i size=512 -n size=8192 -d agcount=31,su=256k,sw=10 -L RHSbrick0 /dev/vg_brick0/lv && mount -v -t xfs -o inode64,noatime -L RHSbrick0 /mnt/brick0 agcount=31 should not be needed if you have latest xfsprogs rpm in Big Bend. noatime is irrelevant for this workload. Typically jumbo frames can be enabled in the switch. jumbo frame size of 7000 is almost as good as 9000 but is much easier for memory allocator to find, so if you are seeing kernel stack traces with jumbo frames try this. as for disks, BAGL hosts use SAS 7200-RPM drives with PERC H700 (MegaRAID) with writeback cache enabled.
(In reply to Ben England from comment #40) > I tried using vm.dirty_ratio=2 vm.dirty_background_ratio=1 and it hung, so > no reasonable vm.dirty tuning is going to make Gluster-NFS work. > [Santosh] dirty_background_ration is 1 i.e. 1% of RAM is ~ 640MB which still looks higher. May be something around 300MB would work for our RAM size as mentioned by the Novell KB article. (yet to be tried out) > I also saw glusterfs fail to stop when I stopped the volume. THis makes it > impossible to fix without killing the glusterfs NFS process, not a good > thing. > [Santosh] Agree, this does not look good. :-( I am not actively working on this bug as this is targeted for Corbett. Thanks for all the experimentation. Thanks, Santosh
I have a hang now even with the kernel tuning. Performance is bad with prior vm.dirty_bytes=65536000 so I tried multiplying by 4... -- gprfc077 -- vm.dirty_background_ratio = 0 vm.dirty_background_bytes = 32768000 vm.dirty_ratio = 0 vm.dirty_bytes = 262144000 and with this workload /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -i 0 -C -w -+n -s 67108864 -r 128 -c -e -t 8 I got multiple of these messages on multiple servers: Aug 5 15:09:53 gprfs017 GlusterFS[17954]: [2013-08-05 15:09:53.438281] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-0: cannot lookup the saved frame corresponding to xid (1618709) .. Aug 5 15:45:04 gprfs019 GlusterFS[16293]: [2013-08-05 15:45:04.467323] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-2: cannot lookup the saved frame corresponding to xid (1130351) ... in /var/log/glusterfs/nfs.log: [2013-08-05 15:45:09.733087] I [client-handshake.c:450:client_set_lk_version_cbk] 0-perftest-client-2: Server lk version = 1 [2013-08-05 16:15:18.775003] E [rpc-clnt.c:207:call_bail] 0-perftest-client-2: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x1138745x sent = 2013-08-05 15:45:08.199970. timeout = 1800 [2013-08-05 16:15:18.775040] W [client-rpc-fops.c:1811:client3_3_fxattrop_cbk] 0-perftest-client-2: remote operation failed: Transport endpoint is not connected [2013-08-05 18:30:09.639496] E [rpc-clnt.c:207:call_bail] 0-perftest-client-2: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x1167741x sent = 2013-08-05 18:00:07.870139. timeout = 1800 ... my theory is that some gluster process crashes and loses state about NFS such that NFS cannot make progress because it can't get COMMITs to succeed.
Ben, I dont think gluster process crashes. They become unresponsive, mostly because of TCP/IP stack pressure. (in server side) 1. From our experimentation, one thing looks clear i.e. if we do NFS COMMIT more frequently with less data (cached in client), the test PASS. dd /iozone with O_DIRECT/sync flag PASS because it by-passes/avoids the client cache. No rpc retransmission and NO TCP/IP packet drops/data lost are seen. 2. So, if we increase the cache amount (vm dirty_ratio/dirty_background_ratio), then the test FAILs. In FAIL case, rpc retransmission goes up crazy along with TCP/IP layer pressure in the server side. So my understanding is no single tunable is going to help for NFS performance. Client side: =========== 1. Tuning needed in vm.dirty_ratio and vm.dirty_background_ratio (or bytes) with m/c with huge RAM. Size needs to be decided based on the experiments/various tests. 2. Increase the number of RPC connections to server i.e. /proc/sys/sunrpc/tcp_slot_table_entries, especially for 10GE interface. [ Ref: http://researcher.watson.ibm.com/researcher/view_person_subpage.php?id=4427] Server side: ========== 1. Tuning vm (already there in place) 2. a) Tuning tcp/ip/socket layer buffers - Needs to be done. b) Tune at NIC level i.e. txqueuelen 3. Tune performance.write-behind-window-size (Not sure how much would be good, need to check). 4. Tune at I/O scheduler level for Disk I/O performance: For NFS server m/c, I guess deadline is the scheduler to go with. It is already SET in /dev/sdb which is brick dir (if I am not wrong) [Ref https://docspace.corp.redhat.com/docs/DOC-92040] In our test bed, looks like the tcp/socket(core)/ip/NIC layer tuning is needed in the SERVER side and vm-cache size and RPC-tcp_slot_table_entries needs to be tuned in the CLIENT side. Thanks, Santosh
I disagree that the problem is single-threaded-ness of NFS. I reduced number of clients from 8 to 4, so there was only 1 NFS client per NFS server, and the same problem still occurs with sufficiently large files. Throughput is better though until the hang occurs. Sayan has said repeatedly that a solution which requires tuning the NFS client is not acceptable, because no one will implement it. We need to fix the NFS server to avoid this problem. I have not been able to tune my way out of the problem by adjusting TCP buffering kernel parameters on the server, for example. Please do not close this bug on the basis that client tuning is a solution.
This is added as a known issue in the 2.1 release notes.
Created attachment 797446 [details] known issue
The GA link is here for the known issue: https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/2.1_Release_Notes/index.html#chap-Documentation-2.1_Release_Notes-Known_Issues
Moving it back to assigned state: https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/2.1_Release_Notes/index.html#chap-Documentation-2.1_Release_Notes-Known_Issues
See bz 1008301 comment 4 for workaround
I tested with minimal changes (from patch posted for review) with downstream, iozone load is passing: Volume info: =========== Volume Name: perftest Type: Distributed-Replicate Volume ID: c052bfd8-fd21-4140-8059-f3d2c5993b76 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gprfs017-10ge:/mnt/brick0/brick Brick2: gprfs018-10ge:/mnt/brick0/brick Brick3: gprfs019-10ge:/mnt/brick0/brick Brick4: gprfs020-10ge:/mnt/brick0/brick How the volume is mounted with 8 clients: ========================================= All mount points on gprfs017.sbu.lab.eng.bos.redhat.com: 172.17.10.89:/perftest 172.17.10.94:/perftest 172.17.70.252:/perftest All mount points on gprfs018.sbu.lab.eng.bos.redhat.com: 172.17.10.90:/perftest 172.17.10.95:/perftest All mount points on gprfs019.sbu.lab.eng.bos.redhat.com: 172.17.10.91:/perftest 172.17.10.96:/perftest All mount points on gprfs020.sbu.lab.eng.bos.redhat.com: 172.17.10.77:/perftest 172.17.10.92:/perftest IOZONE workload: ================= Iozone: Performance Test of File I/O Version $Revision: 3.392 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Ben England. Run began: Mon Oct 7 05:46:00 2013 Network distribution mode enabled. Hostname = 172.17.70.252 Setting no_unlink Include close in write timing Include fsync in write timing No retest option selected Record Size 128 KB File size set to 67108864 KB Command line used: /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -C -w -c -e -i 0 -+n -r 128 -s 64g -t 8 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 8 processes Each process writes a 67108864 Kbyte file in 128 Kbyte records Test running: Children see throughput for 8 initial writers = 984939.92 KB/sec Min throughput per process = 119135.94 KB/sec Max throughput per process = 128661.16 KB/sec Avg throughput per process = 123117.49 KB/sec Min xfer = 59963392.00 KB Child[0] xfer count = 59963392.00 KB, Throughput = 119135.94 KB/sec Child[1] xfer count = 64541440.00 KB, Throughput = 128661.16 KB/sec Child[2] xfer count = 67108864.00 KB, Throughput = 122749.62 KB/sec Child[3] xfer count = 60181632.00 KB, Throughput = 123384.07 KB/sec Child[4] xfer count = 60026624.00 KB, Throughput = 119214.86 KB/sec Child[5] xfer count = 60088448.00 KB, Throughput = 126651.59 KB/sec Child[6] xfer count = 66225664.00 KB, Throughput = 121826.70 KB/sec Child[7] xfer count = 60401152.00 KB, Throughput = 123315.98 KB/sec iozone test complete.
manoj says some of his tests still fail with the patch. I think we need to understand whether the original problem is solved. In the original configuration we used 8 clients, 4 servers, and 1/2/4 threads/server, writing a total of 128 GB data per server in the test. I'd love to think this solves the problem but until I see the original test case pass I'm unconvinced. The reason that the original test case matters is that with 4 servers, you have 50% chance that any file will replicas not on the NFS servers, this induces more network traffic and latency.
(In reply to Ben England from comment #57) > manoj says some of his tests still fail with the patch. I think we need [Santosh] I dont know what is the workload he is trying with. In he was present while I ran the iozone I/O workload (with the 1x2 vol, 4 clients, 2 clients mounted/server) which PASSED. :-) > to understand whether the original problem is solved. In the original > configuration we used 8 clients, 4 servers, and 1/2/4 threads/server, > writing a total of 128 GB data per server in the test. I'd love to think > this solves the problem but until I see the original test case pass I'm > unconvinced. The reason that the original test case matters is that with 4 > servers, you have 50% chance that any file will replicas not on the NFS > servers, this induces more network traffic and latency. [Santosh] This is exactly what was being tested in comment # 55.
avati suggestions in discussion today - turn off write-behind, we don't need it, particularly with NFS client aggregating I/O. - have enough RPC slots to not starve clients (above patch right?) - implement fairness in RPC socket polling level (round-robin scheduling) so 1 client cannot monopolize resources on the NFS server (glusterfs). This last change has not been coded yet.
With so many experiments, it looks like the number of RPC in-flight requests are not the real issue. The real issue is networking subsystem tuning (in driver level). While the clients are sending/bursting huge I/O, then because of auto-tuning nature of TCP buffers, they gets tuned to higher value to receive the I/O packets (NFS I/O gets broken into TCP frame size packets i.e. 1mb NFS I/O size ll be broken into 9000 bytes jumbo-frame to send across). If the Gluster NFS process (through Kernel) is not able to process the packets fast enough because of some reason e.g. especially because of slowness of disk bandwidth (high avgqu-size and await time), then back pressure builds upto NFS and its size grows. Then the real problem builds up i.e. the TCP/IP stack receive queue i.e. netdev_max_backlog gets FULL and starts dropping packets. Eventually the connection gets RESET. It really matters. If the connection between NFS server and brick-server gets reset, the NFS server memory footprint blows up and finally OOM killed or so. Or if the connection between NFS client and server gets reset, then NFS server finishes up with pending I/O to its disk and other bricks (clean up its outstanding/backlog) and its memory footprint comes down to normal. But it ll stop getting the I/O packets from client. Hence the problem. Solution: 1. Faster disk subsystem i.e. less avgqu-size and await time. 2. Tune the txqueuelen (for Tx queue size) and netdev_max_backlog (Rx queue size). There is no standard value but as per the requirement. Till no drop in netstat -in output. I am going to modify the FIX which was up for review. May abandon or may push with smaller change to keep the NFS fopsize in sync with RPC lib. Need Ben/manoj's view on this.
I followed up on a suggestion that came up from Sayan in the RHS-perf call, which was to look at kernel NFS server behavior under the same workload, on the same setup. The kernel NFS server exported the same XFS filesystem that served as the brick FS in the gluster NFS runs. gprfs017 served as the NFS server in these runs. All kernel NFS runs completed smoothly. Variations tried: Iozone run with 4 clients: -s 16g, -t 4. Iozone run with 8 clients: -s 16g, -t 8. Iozone run with 8 clients; 1 NFS server thread instead of default 8: -s16g,-t8. [For comparison, gluster NFS iozone runs were failing with -s8g,-t4 (4 clients), single-head mount of an RHS replicated volume] I/O completion times reported by iostat (await) are fairly high for the kernel NFS runs as well (typically 70+ ms), but neither that nor single-threadedness is resulting in any serious problems.
Note that this is with the same NFS client, same network and same type of storage that Gluster's NFS implementation is using. The replication traffic is not there of course. This result is consistent with hypothesis I have been pursuing, that there is a bug in the NFS translator preventing a) fairness and b) limiting size of write request list. Kernel NFS does not have this problem, I think this is so because vm.dirty_ratio in the kNFS server will eventually force it to flush pages in buffer cache even if client does not issue a COMMIT RPC. Gluster does not have a similar mechanism AFAIK, but it should. Just issuing sync if list exceeds a few hundred MB would be reasonable because this would keep the list small enough. if you write the same amount of data with each thread writing a series of 1-GB files with fsync after each file, instead of 1 big 16-GB file, I bet it doesn't hang. When file is closed and fsync'ed, you can see glusterfs process giving back the memory. Avati suggested this, but it's not clear to me why Gluster even needs a write behind translator in the NFS stack, since the NFS client is itself performing the same function (i.e. it aggregates writes before issuing the RPC). Now that you added the RPC size tuning, NFS client can aggregate much more data before issuing the RPC, up to 1 MB right? That's same as write-behind translator's default tuning.
(In reply to Ben England from comment #64) > > if you write the same amount of data with each thread writing a series of > 1-GB files with fsync after each file, instead of 1 big 16-GB file, I bet it > doesn't hang. When file is closed and fsync'ed, you can see glusterfs > process giving back the memory. > Here are the runs for this case, with the smallfile benchmark. The configuration is 4 clients, single-head NFS mount of a replica volume. # 1st run; each client writes 8 files, each 1GB; fsync after each file. # file-size is in KB for op in cleanup create ; do ./smallfile_cli.py --top /mnt/glnfs/smallfile --threads 1 --host-set gprfc089,gprfc090,gprfc091,gprfc092 --file-size 1048576 --record-size 256 --files 8 --fsync y --operation $op done This run completes successfully. # 2nd run; each client writes 1 8GB file, then fsync for op in cleanup create ; do ./smallfile_cli.py --top /mnt/glnfs/smallfile --threads 1 --host-set gprfc089,gprfc090,gprfc091,gprfc092 --file-size 8388608 --record-size 256 --files 1 --fsync y --operation $op done This run failed.
Patch up for review: https://code.engineering.redhat.com/gerrit/#/c/14685/
Merged the FIX.
see bz 1008301, which is about the same thing right? I don't understand why we clone bzs. To close this bug, you need to fix nfs.drc=on so it does not cause massive memory leak, and you need to change default for nfs.outstanding-rpc-limit to 8 from 64.
(In reply to Ben England from comment #70) > see bz 1008301, which is about the same thing right? I don't understand why > we clone bzs. [Santosh] We clone bugs because we maintain 2 different code base (one downstream/RHS and one upstream/community). > > To close this bug, you need to fix nfs.drc=on [Santosh] Now in downstream/RHS, nfs.drc is going to be OFF by default. See https://bugzilla.redhat.com/show_bug.cgi?id=1047782 so it does not cause massive > memory leak, and you need to change default for nfs.outstanding-rpc-limit to > 8 from 64. [Santosh] I dont think this change is needed. Could you please confirm that the performance goes down with value 64?
(In reply to santosh pradhan from comment #71) [snip] > > memory leak, and you need to change default for nfs.outstanding-rpc-limit to > > 8 from 64. > > [Santosh] I dont think this change is needed. Could you please confirm that > the performance goes down with value 64? Per comment #71, needinfo on Ben England.
Performance doesn't go down -- because the NFS clients hang! The whole point of Avati's patch was to allow us to prevent the scenario where NFS clients are writing more data than Gluster can accept. With this tuning, the numbers are good. I published numbers for RHS 2.1 U2 nightly build a while back, see the NFS tab in this: http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/rhs21u2/rhs21u2-stats.ods In storage bootcamp TOC section on performance. measured with: RPM glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64, from http://download.lab.bos.redhat.com/nightly/RHSS-2.1u2-20131124.n.0/2.1u2/RHS/x86_64/os/ with tuning nfs.outstanding-rpc-limit: 16 raw data at http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/ glnfs-rhs21u2-nfsorl16.csv - samples glnfs-rhs21u2-nfsorl16-4svr-13-12-01-15-47-33 - run with 4 servers glnfs-rhs21u2-nfsorl16-2svr-1repl-13-12-02-08-18-09 - run with 2 servers When I ran tests without nfs.outstanding-rpc-limit=16, the NFS clients would hang, I'd have to reboot them all to get them back. Do you have any evidence that this tuning will cause problems for anyone? Has anyone else tried to reproduce this problem and observed whether or not this fixes it? If not, why not just change it and be done with it?
The default value of outstanding rpc limit is changed to 16 as per the suggestion from Ben. Code is posted for review. https://code.engineering.redhat.com/gerrit/#/c/18326/
(In reply to Ben England from comment #73) > Performance doesn't go down -- because the NFS clients hang! The whole > point of Avati's patch was to allow us to prevent the scenario where NFS > clients are writing more data than Gluster can accept. With this tuning, > the numbers are good. I published numbers for RHS 2.1 U2 nightly build a > while back, see the NFS tab in this: > > http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/rhs21u2/ > rhs21u2-stats.ods > > In storage bootcamp TOC section on performance. > > measured with: > > RPM glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64, from > http://download.lab.bos.redhat.com/nightly/RHSS-2.1u2-20131124.n.0/2.1u2/RHS/ > x86_64/os/ > with tuning nfs.outstanding-rpc-limit: 16 > > raw data at > http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability- > cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/ > glnfs-rhs21u2-nfsorl16.csv - samples > glnfs-rhs21u2-nfsorl16-4svr-13-12-01-15-47-33 - run with 4 servers > glnfs-rhs21u2-nfsorl16-2svr-1repl-13-12-02-08-18-09 - run with 2 servers > > When I ran tests without nfs.outstanding-rpc-limit=16, the NFS clients would > hang, I'd have to reboot them all to get them back. > > Do you have any evidence that this tuning will cause problems for anyone? > Has anyone else tried to reproduce this problem and observed whether or not > this fixes it? If not, why not just change it and be done with it? I did the change and posted for review. With this, I think we are done and the bug can be closed.
Since the fix is abandoned so marking this bug to "Assigned". Please move in to ON_QA with correct fixed-in version.
https://code.engineering.redhat.com/gerrit/#/c/18388/
what does comment 77 mean? Is it in RHS 2.1 Update 2 or is it in Corbett? If you need doctext: " Default Gluster volume parameter values changed from nfs.outstanding-rpc-limit=64 and nfs.drc=on to nfs.outstanding-rpc-limit=16 and nfs.drc=off. These changes improve performance and stability of Gluster NFS implementation for multi-GB file writes".
Update 2 and Corbett are same. Thanks for the doc text.
Vivek, Can you please review the doc text for technical accuracy and sign off?
Verified on glusterfs-3.4.0.58rhs-1.el6rhs.x86_64. I tested several different configurations, this one seemed to be a good repro from earlier comments so I am documenting it: [root@gqas003 ~]# gluster volume status Status of volume: testvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.123.4:/bricks/brick1 49152 Y 4369 Brick 192.168.123.2:/bricks/brick1 49152 Y 4373 Brick 192.168.123.4:/bricks/brick2 49153 Y 4380 Brick 192.168.123.2:/bricks/brick2 49153 Y 4384 Brick 192.168.123.4:/bricks/brick3 49154 Y 4391 Brick 192.168.123.2:/bricks/brick3 49154 Y 4395 Brick 192.168.123.4:/bricks/brick4 49155 Y 4402 Brick 192.168.123.2:/bricks/brick4 49155 Y 4406 NFS Server on localhost 2049 Y 4416 Self-heal Daemon on localhost N/A Y 4421 NFS Server on 192.168.123.2 2049 Y 4420 Self-heal Daemon on 192.168.123.2 N/A Y 4424 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks [root@gqas011 ~]# cat /clients.ioz 192.168.123.7 /gluster-mount /usr/bin/iozone 192.168.123.6 /gluster-mount /usr/bin/iozone 192.168.123.3 /gluster-mount /usr/bin/iozone 192.168.123.5 /gluster-mount /usr/bin/iozone [root@gqas015 ~]# /opt/iozone/bin/iozone -+m /clients.ioz -+h 192.168.123.7 -i 0 -C -+n -s 64g -r 128 -c -e -t 4 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Wed Feb 5 13:58:26 2014 Network distribution mode enabled. Hostname = 192.168.123.7 No retest option selected File size set to 67108864 KB Record Size 128 KB Include close in write timing Include fsync in write timing Command line used: /opt/iozone/bin/iozone -+m /clients.ioz -+h 192.168.123.7 -i 0 -C -+n -s 64g -r 128 -c -e -t 4 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 4 processes Each process writes a 67108864 Kbyte file in 128 Kbyte records Test running: Children see throughput for 4 initial writers = 471090.27 KB/sec Min throughput per process = 78883.48 KB/sec Max throughput per process = 166030.09 KB/sec Avg throughput per process = 117772.57 KB/sec Min xfer = 32556032.00 KB Child[0] xfer count = 32556032.00 KB, Throughput = 78883.48 KB/sec Child[1] xfer count = 40002944.00 KB, Throughput = 95576.45 KB/sec Child[2] xfer count = 53048960.00 KB, Throughput = 130600.24 KB/sec Child[3] xfer count = 67108864.00 KB, Throughput = 166030.09 KB/sec Test cleanup: iozone test complete.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html