Login
[x]
Log in using an account from:
Fedora Account System
Red Hat Associate
Red Hat Customer
Or login using a Red Hat Bugzilla account
Forgot Password
Login:
Hide Forgot
Create an Account
Red Hat Bugzilla – Attachment 915863 Details for
Bug 1073441
large NFS writes to Gluster slow down then stop
[?]
New
Simple Search
Advanced Search
My Links
Browse
Requests
Reports
Current State
Search
Tabular reports
Graphical reports
Duplicates
Other Reports
User Changes
Plotly Reports
Bug Status
Bug Severity
Non-Defaults
|
Product Dashboard
Help
Page Help!
Bug Writing Guidelines
What's new
Browser Support Policy
5.0.4.rh83 Release notes
FAQ
Guides index
User guide
Web Services
Contact
Legal
This site requires JavaScript to be enabled to function correctly, please enable it.
Comment
comment_6888089.txt (text/plain), 112.42 KB, created by
Kaleb KEITHLEY
on 2014-03-06 12:57:41 UTC
(
hide
)
Description:
Comment
Filename:
MIME Type:
Creator:
Kaleb KEITHLEY
Created:
2014-03-06 12:57:41 UTC
Size:
112.42 KB
patch
obsolete
>+++ This bug was initially created as a clone of Bug #1008301 +++ > >+++ This bug was initially created as a clone of Bug #977492 +++ > >Description of problem: > >If you run a very large NFS write test (128 GB/server) to RHS volume, throughput slows and eventually stops. It appears to be a bug related to write-behind translator. > >Version-Release number of selected component (if applicable): > >glusterfs-3.4.0.8 (am now trying with glusterfs-3.4.0.11) >kernel 2.6.32-358.6.2.el6.x86_64 > >How reproducible: > >every time > >Steps to Reproduce: >1.Construct Gluster volume as described below >2.run iozone as described below >3.watch what happens as we reach about half of the data written > >Actual results: > >iostat shows that throughput starts out very strong, but slows eventually until it effectively ceases. Attachment will show this. > >"top" shows that glusterfs NFS server process consumes all of 1 core. > >"perf top" shows that all of its time is spent in wb_requests_overlap() routine in write-behind translator, see screenshots in attachment. > >I disabled write-behind translator by creating a separate volfile for NFS server process and running it manually, and the test then completes. > >Expected results: > >No significant degradation in throughput, at least none due to Gluster. > >Additional info: > >how to construct Gluster volume: See attachment for details: > >- 10-GbE, jumbo frames >- usual XFS bricks >- gluster volume create perftest replica 2 gprfs045-10ge:/mnt/brick0/brick gprfs046-10ge:/mnt/brick0/brick gprfs047-10ge:/mnt/brick0/brick gprfs048-10ge:/mnt/brick0/brick >- gluster volume start perftest >- on each NFS client: mount -v -t nfs -o nfsvers=3,tcp,actimeo=1 %s:/perftest /mnt/glnfs > >We spread the NFS clients evenly across the NFS servers in the volume. > >how to run iozone: > >change 172.17.10.32 to the IP address that results from resolving the test driver's hostname. The file /var/tmp/iozone_cluster.config contains this: > >h1 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone >h2 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone >... >h8 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone >h1 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone >h2 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone >... >h8 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone > >So there are two threads per NFS client. > >root@gprfc032-10ge gluster_test]# /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.10.32 -i 0 -C -+n -s 32g -r 128 -c -e -t 16 > >I got a core dump using "kill -BUS " on the glusterfs process. The coredump was 8 GB. My hypothesis is that the list of uncommitted writes is so long that this accounts for the large memory size of the process, and the act of checking for overlaps in these writes takes all of our CPU time. > >[root@gprfs048 ~]# ls -l /core.30528 >-rw------- 1 root root 9895763968 Jun 24 15:39 /core.30528 >[root@gprfs048 ~]# gdb /usr/sbin/glusterfs /core.30528 >GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1) >... >Reading symbols from /usr/sbin/glusterfs...(no debugging symbols found)...done. >[New Thread 30528] >... >(gdb) where >#0 0x00007fabc8022db7 in wb_requests_overlap () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so >#1 0x00007fabc8022f08 in wb_wip_has_conflict () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so >#2 0x00007fabc802326b in __wb_pick_winds () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so >#3 0x00007fabc80262d9 in wb_process_queue () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so >#4 0x00007fabc8026c08 in wb_writev () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so >#5 0x00007fabc393b3d7 in io_stats_writev () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/debug/io-stats.so >#6 0x00007fabc36f0c47 in nfs_fop_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#7 0x00007fabc36fe95b in __nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#8 0x00007fabc3706423 in nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#9 0x00007fabc370ea3b in nfs3_fh_resolve_inode_done () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#10 0x00007fabc370eb0b in nfs3_fh_resolve_inode () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#11 0x00007fabc370ebb5 in nfs3_fh_resolve_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#12 0x00007fabc370ecc8 in nfs3_fh_resolve_root () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#13 0x00007fabc370cc01 in nfs3_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#14 0x00007fabc370cd62 in nfs3svc_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so >#15 0x0000003863808655 in rpcsvc_handle_rpc_call () from /usr/lib64/libgfrpc.so.0 >#16 0x00000038638087a3 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0 >#17 0x00000038638098d8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 >#18 0x00007fabc92ab3a4 in socket_event_poll_in () from /usr/lib64/glusterfs/3.4.0.8rhs/rpc-transport/socket.so >#19 0x00007fabc92ab4fd in socket_event_handler () from /usr/lib64/glusterfs/3.4.0.8rhs/rpc-transport/socket.so >#20 0x000000386305a9b7 in ?? () from /usr/lib64/libglusterfs.so.0 >#21 0x0000000000406776 in main () >(gdb) up >#1 0x00007fabc8022f08 in wb_wip_has_conflict () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so > >--- Additional comment from RHEL Product and Program Management on 2013-06-24 13:22:59 EDT --- > >Since this issue was entered in bugzilla, the release flag has been >set to ? to ensure that it is properly evaluated for this release. > >--- Additional comment from Ben England on 2013-06-24 13:41:17 EDT --- > >For files documenting details of bug, such as core dump and perf top screenshots, see > >http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/nfs-write-behind-prob.tgz > >--- Additional comment from RHEL Product and Program Management on 2013-07-04 00:41:13 EDT --- > >This bug report previously had all acks and release flag approved. >However since at least one of its acks has been changed, the >release flag has been reset to ? by the bugbot (pm-rhel). The >ack needs to become approved before the release flag can become >approved again. > >--- Additional comment from santosh pradhan on 2013-07-04 12:34:28 EDT --- > > >In my setup, out of my 4 servers (2x2 volume), In one of the brick server, I am seeing a different problem where two of the threads are consuming a lot of CPU time with following stack trace: > >I dont have debug symbols installed in the machine, ll do that to get the line no. > >1. ps -L -p <NFS pid> > > PID LWP TTY TIME CMD >25563 25563 ? 1-05:01:17 glusterfs >25563 25565 ? 00:00:00 glusterfs >25563 25566 ? 00:00:00 glusterfs >25563 25567 ? 00:00:00 glusterfs >25563 25574 ? 1-02:19:19 glusterfs >25563 25589 ? 00:00:00 glusterfs > >LWP 25563 and 25574 are consuming more time. > >2. top output: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >25563 root 20 0 4493m 2.9g 7732 R 200.1 37.5 3326:24 glusterfs > >3. stack trace: > >Thread 1 (Thread 0x7f34dcc6c700 (LWP 25563)): >#0 0x00007f34d86242a0 in afr_transaction_eager_lock_init () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so >#1 0x00007f34d862688d in afr_transaction () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so >#2 0x00007f34d861f427 in afr_do_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so >#3 0x00007f34d861fa3c in afr_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so >#4 0x00007f34d83e17d9 in dht_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/distribute.so >#5 0x00007f34d81a8c22 in wb_writev_helper () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so >#6 0x00000030b6e2f3a4 in call_resume () from /usr/lib64/libglusterfs.so.0 >#7 0x00007f34d81a9318 in wb_do_winds () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so >#8 0x00007f34d81aa302 in wb_process_queue () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so >#9 0x00007f34d81aac08 in wb_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so >#10 0x00007f34d3df23d7 in io_stats_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/debug/io-stats.so >#11 0x00007f34d3ba7027 in nfs_fop_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#12 0x00007f34d3bb4d3b in __nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#13 0x00007f34d3bbc913 in nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#14 0x00007f34d3bc4f8b in nfs3_fh_resolve_inode_done () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#15 0x00007f34d3bc505b in nfs3_fh_resolve_inode () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#16 0x00007f34d3bc5105 in nfs3_fh_resolve_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#17 0x00007f34d3bc5218 in nfs3_fh_resolve_root () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#18 0x00007f34d3bc3101 in nfs3_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#19 0x00007f34d3bc32b2 in nfs3svc_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so >#20 0x00000030b7208655 in rpcsvc_handle_rpc_call () from /usr/lib64/libgfrpc.so.0 >#21 0x00000030b72087a3 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0 >#22 0x00000030b72098d8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 >#23 0x00007f34d96c5bd6 in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/rpc-transport/socket.so >#24 0x00007f34d96c74ed in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/rpc-transport/socket.so >#25 0x00000030b6e5b387 in ?? () from /usr/lib64/libglusterfs.so.0 >#26 0x00000000004067c6 in main () > > >Thread 3 (Thread 0x7f34d94a4700 (LWP 25574)): >#0 0x00000035c9e0c182 in ?? () from /lib64/libpthread.so.0 >#1 0x00007f34d86241d5 in afr_fd_ctx_get () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so >#2 0x00007f34d86253dd in afr_delayed_changelog_post_op () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so >#3 0x00000030b6e2a230 in gf_timer_proc () from /usr/lib64/libglusterfs.so.0 >#4 0x00000035c9e07851 in start_thread () from /lib64/libpthread.so.0 >#5 0x00000035c96e890d in clone () from /lib64/libc.so.6 > > >Will continue looking into this. > >--- Additional comment from Ben England on 2013-07-08 12:50:16 EDT --- > >I reproduced the problem with just 2 servers instead of 4. So you don't need a huge configuration to do this. > >I'll try to cut down configuration even more than this. > >It appears that the client/server ratio matters here. With 1:1 ratio (2 servers, 2 clients), the problem does not occur, but with 2:1 (2 servers, 4 clients), the problem does occur. I think that NFS clients are bursty, they may overload the server with a large set of writes, and the server can't keep up with the long write-behind list. This would explain why O_DIRECT writes do not encounter the problem. > >I reproduced with /proc/sys/vm/nr_hugepages=0 on all systems, eager-lock off, and no special tuning of any kind. So there is no reason why QE can't reproduce. > >I'm going to try increasing vm.min_free_kbytes to see if it's some sort of kernel memory issue. > >As before, once the problem occurs, you CANNOT CLEANUP the processes on the client, they hang. YOu have to do a hard reset on the clients (I use ipmitool), reboot won't work, for example, you see: > >[root@gprfc032-10ge gluster_test]# ssh gprfc089 >ssh: connect to host gprfc089 port 22: Connection refused > >Which means that reboot hung while trying to unmount NFS, so it just becomes a useless machine. > >THe server volume can be stopped but it took 2 min for that to happen while write-behind.so worked its way through the backlog. > >Be sure your clients are not all mounting same server, this may prevent problem from occurring by limiting data flow from clients. > >You can get it to happen much quicker by restricting memory available to kernel as follows: > >echo N > /proc/sys/vm/nr_hugepages > >where N is number of 2-MB pages you want to reserve. This effectively steals memory away from the Linux buffer cache so the system enters a state where it's out of physical memory much more quickly. > >Delete old files before writing new ones, overwriting existing file may not give same behavior. > >--- Additional comment from santosh pradhan on 2013-07-09 08:44:15 EDT --- > >Thanks Ben for all your input. > >I tried with following config/workload: > >1. 1x2 dist-rep volume (2 node cluster: sev1 sev2: one brick in each server) > >2. 4 clients (c1, c2, c3, c4: c1/c2 mounting from sev1 and c3/c4 mounts from sev2) > > >3. iozone workload: > >When I tried with >"iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4" > >I could hit the issue within 30mins or so. May be this generates more I/O than 128g per server. So I modified the command as below (for 128g per server): > >iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 32g -r 128 -c -e -t 4 > >iozone.cfg: >=========== >10.70.43.16 /mnt/nfsdir /usr/local/bin/iozone >10.70.43.17 /mnt/nfsdir /usr/local/bin/iozone >10.70.42.243 /mnt/nfsdir /usr/local/bin/iozone >10.70.42.252 /mnt/nfsdir /usr/local/bin/iozone > >per thread 32g, so total 32*4 = 128g >as replica total i/o workload for gluster = 256g >as 2 node/brick cluster, workload per brick = 256/2 = 128g > >Is my calculation for 128g write workload per sever correct? > >The test is running, still no issue with this workload. > >Please correct me if my understanding of generating the workload is wrong. > >Thanks, >Santosh > >--- Additional comment from santosh pradhan on 2013-07-09 10:56:04 EDT --- > >(In reply to Ben England from comment #5) >> I reproduced the problem with just 2 servers instead of 4. So you don't >> need a huge configuration to do this. >> >> I'll try to cut down configuration even more than this. >> >> It appears that the client/server ratio matters here. With 1:1 ratio (2 >> servers, 2 clients), the problem does not occur, but with 2:1 (2 servers, 4 >> clients), the problem does occur. I think that NFS clients are bursty, >> they may overload the server with a large set of writes, and the server >> can't keep up with the long write-behind list. This would explain why >> O_DIRECT writes do not encounter the problem. >> > >Could you specify the iozone command line you used for (2 server : 4 client)-setup. > >I tried following command with a 1 x 2 (dist-repl) vol spread across 2 node cluster (one brick in each server) and 4 clients (mount ratio: 2 clients per server): > >iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 32g -r 128 -c -e -t 4 > >It just worked. > >But when I changed the size to 64g per thread (-s 64g), I could reproduce the issue e.g. > >iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4 > >I am not clear how to calculate the exact iozone workload (say 128g) per server in 1x2 or 2x2 dist-rep volumes. > >Could you clarify this part? > >Thanks, >Santosh > >--- Additional comment from santosh pradhan on 2013-07-09 10:59:36 EDT --- > > >Forgot to mention that, I turned the eager-locking OFF in my setup. > >--- Additional comment from santosh pradhan on 2013-07-09 13:29:42 EDT --- > > >Some more update: > >set up: 2 node cluster with 4 clients > >iozone command used: >iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4 > >case: 1 >(simple distributed vol on 2 node cluster) >cluster eager-locking ON > >iozone completed successfully. > >case: 2 >(distributed replicated 1x2 volume) >cluster eager locking ON > >iozone completed successfully. > >case: 3 >(dist-rep 1x2 vol) >cluster eager locking OFF >Already tried once where the issue was reproduced. Will try again tomorrow. > >--- Additional comment from santosh pradhan on 2013-07-10 13:52:00 EDT --- > > >Today I tried couple of times (on 1x2 vol with eager lock OFF/ON) but could not reproduce the issue in my setup. Also we (Anand and me) had a discussion with Ben how to take it forward. I requested Ben if he can reproduce the issue in my setup so that debugging would be easier. Otherwise I may need to debug in his setup. Also we ll follow up tomorrow evening 5pm (IST). > >I tried with latest build (3.4.0.12rhs.beta3). > >--- Additional comment from Ben England on 2013-07-12 08:15:03 EDT --- > >Santosh and Anand S have reproduced the problem on BAGL, we discussed it just now. I created a dedicated BAGL config using test driver VM at 10.16.159.187, 8 clients gprfc0{89,90,91,92,94,95,96.97} and servers gprfs0{17,18,19,20}. I sent Santosh and Anand S instructions on how to restart the system after it hangs. So it appears that 10-GbE network is required to overdrive the servers to the point where this problem occurs. This configuration is assigned to them and anyone else working on this problem for as long as they need it, see BAGL page. > >--- Additional comment from santosh pradhan on 2013-07-15 07:51:31 EDT --- > > >1. Stacktrace for thread consuming the most CPU time: >================================================== > >Thread 1 (Thread 0x7f8bfb7e1700 (LWP 5989)): >#0 0x00007f8bf6afcdb7 in wb_requests_overlap () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/performance/write-behind.so >#1 0x00007f8bf6afcf08 in wb_wip_has_conflict () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/performance/write-behind.so >#2 0x00007f8bf6afd26b in __wb_pick_winds () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/performance/write-behind.so >#3 0x00007f8bf6b002d9 in wb_process_queue () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/performance/write-behind.so >#4 0x00007f8bf6b00c08 in wb_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/performance/write-behind.so >#5 0x00007f8bf68ec3d7 in io_stats_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/debug/io-stats.so >#6 0x00007f8bf66a1007 in nfs_fop_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#7 0x00007f8bf66aecbb in __nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#8 0x00007f8bf66b6893 in nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#9 0x00007f8bf66bef0b in nfs3_fh_resolve_inode_done () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#10 0x00007f8bf66befdb in nfs3_fh_resolve_inode () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#11 0x00007f8bf66bf085 in nfs3_fh_resolve_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#12 0x00007f8bf66bf198 in nfs3_fh_resolve_root () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#13 0x00007f8bf66bd081 in nfs3_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#14 0x00007f8bf66bd232 in nfs3svc_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/nfs/server.so >#15 0x0000003556408655 in rpcsvc_handle_rpc_call () from /usr/lib64/libgfrpc.so.0 >#16 0x00000035564087a3 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0 >#17 0x00000035564098d8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 >#18 0x00007f8bf823abd6 in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/rpc-transport/socket.so >#19 0x00007f8bf823c4ed in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta3/rpc-transport/socket.so >#20 0x000000355505b407 in ?? () from /usr/lib64/libglusterfs.so.0 >#21 0x00000000004067c6 in main () > >2. The code flow is: >====================== > >__wb_pick_winds() => > wb_liability_has_conflict() => > => wb_requests_conflict() > > >void >__wb_pick_winds (wb_inode_t *wb_inode, list_head_t *tasks, > list_head_t *liabilities) >{ > wb_request_t *req = NULL; > wb_request_t *tmp = NULL; > > list_for_each_entry_safe (req, tmp, &wb_inode->todo, todo) { > if (wb_liability_has_conflict (wb_inode, req)) > continue; <================== LOOP pt. 2 > >2. > >gf_boolean_t >wb_liability_has_conflict (wb_inode_t *wb_inode, wb_request_t *req) >{ > wb_request_t *each = NULL; > > list_for_each_entry (each, &wb_inode->liability, lie) { > if (wb_requests_conflict (each, req)) > return _gf_true; <================= LOOP pt. 1 > } > >I guess something goes wrong here and control goes back-n-forth producing a loop and hugging one complete CPU and never recovers. > >Need to find out, under what condition, it ll hit the overlapping condition, needs to be discussed with Vijay/Avati. > >Once it hits the overlapping condition (wb_requests_conflict() returns TRUE), it never bails out from there, which I guess is one of the major problem now. > >Nest step: >========= >I am going to instrument the code in wb-xlator to dump some of the data data structures of wb-xlator. > >Review the code in wb-xlator. > >Thanks, >Santosh > >--- Additional comment from Ben England on 2013-07-16 05:56:54 EDT --- > >have you tried disabling the write-behind translator? Why do we need it for NFS? the NFS client will aggregate write requests up to 64-KB RPC size, good enough? > >--- Additional comment from santosh pradhan on 2013-07-16 07:51:33 EDT --- > >Memory footprint of one of the server came down from 14G to 1.6G. Other server is still running high with ~ 11G. All are in RSS. > >I profiled one of the server (where memory usage of NFS process came down from 14G to 1.6G RSS) > >This is perf data: > >Samples: 29M of event 'cycles', Event count (approx.): 21872713449727 >+ 81.49% glusterfs write-behind.so [.] wb_requests_overlap >+ 4.46% glusterfs write-behind.so [.] wb_wip_has_conflict >+ 3.89% glusterfs libpthread-2.12.so [.] 0x000000000000ed31 >+ 1.99% glusterfs write-behind.so [.] wb_requests_overlap@plt >+ 0.85% glusterfs write-behind.so [.] __wb_pick_unwinds >+ 0.81% glusterfs [kernel.kallsyms] [k] copy_user_generic_string >+ 0.54% glusterfs [kernel.kallsyms] [k] compaction_alloc >+ 0.48% glusterfs afr.so [.] afr_transaction_eager_lock_init >+ 0.28% glusterfs write-behind.so [.] __wb_pick_winds >+ 0.22% glusterfs libc-2.12.so [.] __memset_sse2 >+ 0.21% glusterfs libc-2.12.so [.] _int_malloc >+ 0.18% glusterfs write-behind.so [.] __wb_preprocess_winds >+ 0.17% glusterfs libgfrpc.so.0.0.0 [.] __saved_frame_copy >+ 0.16% glusterfs libc-2.12.so [.] memcpy >+ 0.15% glusterfs write-behind.so [.] wb_fulfill_err > >There are a lot of Errors/Warnings in NFS log (both the NFS servers). Needs to be looked at as well. iozone looks like hung (not finishing). > >Will keep updated. > >--- Additional comment from santosh pradhan on 2013-07-16 07:57:37 EDT --- > >(In reply to Ben England from comment #13) >> have you tried disabling the write-behind translator? > >[Santosh] Yes, I tried disabling write-behind xlator. The behavior was really bad, the memory usage is going up very quickly (I saw upto 24+ GB before I killed them). > > > Why do we need it for > >[Santosh] wb-xlator really helps, though we see a problem with huge write from multiple clients. May be Vijay/Avati ll have a better answer for this question. I would defer it for Vijjay. > >> NFS? the NFS client will aggregate write requests up to 64-KB RPC size, >> good enough? > >[Santosh] I dont this it ll help. (from my experiment with write-behind off). Thats why you see huge memory footprint of the NFS server in 2.0 i.e. BZ 965924 (in fact that's not memory leak as RSS is still huge which means NFS server without write-behind is too loaded to handle the I/O burst). > >--- Additional comment from Ben England on 2013-07-16 09:25:29 EDT --- > >Santosh, you are right, turning write-behind off does not have desired effect. Throughput was about 50% of what it was with write-behind on. And the memory leak still happens. What confused me is that the bottleneck appears to be in the write behind translator, but if you turn write-behind off, the bottleneck moves to somewhere else (libgfrpc saved_frame_{copy,get}). > >What strikes me as strange is that it just takes in data from NFS client without any flow control. NFS client should be asking for a COMMIT of data to persistent storage, this should cause clients to refrain from pushing additional write data, hence there should be no massive overcommit of memory to glusterfs process. > >So I captured a protocol trace on c089, and now I'm really confused. There are no NFS headers in this trace. > >http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/glnfs-c089-wr.tcpdump.gz > >There are a lot of packets but none of them are NFS, they are all TCP, looks like some kind of TCP flow control problem where TCP never makes progress on the transmit window? But I'm not really sure yet. I also see lots of these: > >[root@gprfc089 ~]# tail /var/log/messages >Jul 16 09:15:58 gprfc089 kernel: nfs: server gprfs017-10ge not responding, still trying > >I'll try starting tracing before iozone run goes too far. > >--- Additional comment from santosh pradhan on 2013-07-16 10:55:50 EDT --- > > > >what looks weird to me is: perf data i.e. wb_requests_overlap() routine which is just fetching the passed parameters (couple ptrs to structures of type wb_request_t), does some arithmetic operation along with a comparison. > >perf annotate confuses me more: > > Disassembly of section .text: > > 0000000000002db0 <wb_requests_overlap>: > 1.25 mov 0x4b8(%rdi),%rax > 85.37 mov 0x4c0(%rdi),%rcx <===== This looks like loading ordering.size > > 5.51 test %rax,%rax > 0.00 jne 48 > >I installed the debug symbols, but still perf annotate shows the assembly instead of source lines. > >Tomorrow, will continue looking into it. > >--- Additional comment from Ben England on 2013-07-16 18:25:04 EDT --- > >My only hypothesis is that NFS server does not handle COMMIT request from NFS client properly. Otherwise NFS server could not accumulate so many I/O requests to scan in write-behind translator code. > >re .-1: I think all this perf output means is that it's executing this instruction millions of times and this instruction is just the place where the cache miss occurs perhaps. It doesn't mean that the program is doing something wrong there. It's a hint that your list of requests is getting excessively long. > >iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 16384k -o -s 32g -t 16 > >Does not hang or leak memory at all. "-o" opens file with O_SYNC, so after every 16-MB record is written to file by iozone, the write will block until the data actually is written to storage. This means the amount of write requests that are uncommitted is much smaller perhaps? > >I tried capturing protocol traces on an NFS server gprfs017, while running single-threaded write to make trace analysis easier, but it did not capture the NFS write traffic except for a couple of early requests. It did capture lots of gluster traffic. So protocol tracing isn't going to help much here. > >--- Additional comment from santosh pradhan on 2013-07-16 23:26:34 EDT --- > >(In reply to Ben England from comment #18) >> My only hypothesis is that NFS server does not handle COMMIT request from >> NFS client properly. Otherwise NFS server could not accumulate so many I/O >> requests to scan in write-behind translator code. >> >> re .-1: I think all this perf output means is that it's executing this >> instruction millions of times and this instruction is just the place where >> the cache miss occurs perhaps. It doesn't mean that the program is doing >> something wrong there. It's a hint that your list of requests is getting >> excessively long. > > >[Santosh] Yes, probably you are right. I was planning to let the perf another go with L1-cache event analysis to check that behaviour. > >> >> iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 >> -+n -r 16384k -o -s 32g -t 16 >> >> Does not hang or leak memory at all. "-o" opens file with O_SYNC, so after >> every 16-MB record is written to file by iozone, the write will block until >> the data actually is written to storage. > >[Santosh] Wow, great. At least some relief :) Thanks for the experiment. > >> This means the amount of write >> requests that are uncommitted is much smaller perhaps? > >[Santosh] Yes, probably. It does not accumulate too many outstanding I/O reqs. > >> >> I tried capturing protocol traces on an NFS server gprfs017, while running >> single-threaded write to make trace analysis easier, but it did not capture >> the NFS write traffic except for a couple of early requests. It did capture >> lots of gluster traffic. So protocol tracing isn't going to help much here. > >[Santosh] Thanks again for all your effort. > >I will keep looking at it today and keep you posted about the progress. > >Best, >Santosh > >--- Additional comment from Anand Subramanian on 2013-07-17 01:14:51 EDT --- > >Vijay mentioned that there is a way to get a specific xlator dump, in this case the wb xlator only dump. There is supposedly a filter for this. He couldn't recall that as its been a while he used it but we can try to see if we can use that to get all wb stats. > >Regarding Ben's comment: >> re .-1: I think all this perf output means is that it's executing this >> instruction millions of times and this instruction is just the place where >> the cache miss occurs perhaps. It doesn't mean that the program is doing >> something wrong there. It's a hint that your list of requests is getting >> excessively long. > >(gdb) p *wb_inode >$2 = {window_conf = 1048576, window_current = 92311552, transit = 673583104, all = {next = 0xd12ec60, prev = 0x190cfc6a0}, todo = {next = 0x1339a7800, prev = 0x190cfc6b0}, liability = {next = 0xd12ec80, prev = 0xd940340}, temptation = { > next = 0xd9b7d00, prev = 0x190cfc6c0}, wip = {next = 0xd12ecb0, prev = 0x127cfe160}, gen = 17, lock = 0, this = 0x2578230} > >The transit count seems to be the total size of the write fops in transit. The size from one of the runs is 673583104, when in the problem state. >Also, from the state dump, callcount was ~129000 which points to a huge backlog of writes pending (which would imply uncommitted to disk and unacknowledged to client/iozone). > >Can one of you confirm if write-behind option flush_behind is on/off please? > >--- Additional comment from Anand Subramanian on 2013-07-17 01:28:03 EDT --- > >By default flush_behind is on, just wanted to make sure it was not turned off for the last couple of runs. Thanks. > >--- Additional comment from Anand Subramanian on 2013-07-17 01:39:01 EDT --- > >Santosh, > >Am not able to attach to the Vnc. If you are running a test, could you run nfsstat -rc on one of the clients where iozone is running. If nfsstat outputs retrans > 0, it probably implies the server is not able to handle all the client requests. Just wanted to check this. > >Anand > >--- Additional comment from Anand Subramanian on 2013-07-17 01:52:33 EDT --- > >Does it make sense to tweak the defaults for the recv and write socket buffers and try a run with increased values? Defaults seem quite low: > >[root@gprfs017 blr-team]# cat /proc/sys/net/core/rmem_max >229376 > >same for the wmem_max. > >--- Additional comment from santosh pradhan on 2013-07-17 03:11:47 EDT --- > > >I think we should try with network subsystem tuning. (in whole setup e.g.gprfs017, gprfs018, gprfc089, gprfc090, gprfc091 and gprfc092). > >This is a nice link to tune the 10GE NIC for better performance: > >http://fasterdata.es.net/host-tuning/linux/ > >I am able to attach to vnc. I am not running anything but going to run another instance of iozone. (with the gluster-debug bundles installed). So, that perf annotate would point to the source line. Currently its just mapping to assembly. > >I have not touched the flush_behind parameter of wb-xlator. > >--- Additional comment from santosh pradhan on 2013-07-17 08:12:05 EDT --- > > >1. I discussed with Raghavendra Gowdappa about the flush_behind paramater, he says setting it to OFF will make the performance worse. > >2. I ran the iozone with smaller load (16g per thread with 16 threads) i.e. > >iozone -+m /var/blr-team/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 32k -s 16g -t 16 > >I saw something different behaviour this time. Both the server went upto ~ 15G (RSS) and then came down to 2.5G RSS but iozone did not finish. I am not sure but problem seems to be somewhere else. The NFS client might be doing another layer of write-behind for its performance (which might be reason for wb_requests_overlap() to be hit more ). > >3. I am not able to map the perf data to source line this time as well. Already installed the debug symbols in gprfs017 (server). > >4. In both the servers (17 and 18), in /var/log/messages: I saw something like this, not sure if this is related. > >Jul 17 08:56:52 gprfs018 GlusterFS[22756]: [2013-07-17 08:56:52.375118] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-0: cannot lookup the saved frame corresponding to xid (450331) >Jul 17 08:56:53 gprfs018 GlusterFS[22756]: [2013-07-17 08:56:53.366492] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-1: cannot lookup the saved frame corresponding to xid (463375) > >5. In one of the clients i.e. gprfc092 in /var/log/messages: > >Jul 17 07:50:17 gprfc092 kernel: INFO: task iozone:3160 blocked for more than 120 seconds. >Jul 17 07:50:17 gprfc092 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >Jul 17 07:50:17 gprfc092 kernel: iozone D 0000000000000000 0 3160 1 0x00000080 >Jul 17 07:50:17 gprfc092 kernel: ffff880831b3fe08 0000000000000082 ffff880830566aa0 ffff880830566aa0 >Jul 17 07:50:17 gprfc092 kernel: ffff880830566aa0 ffff88083fee6140 ffff880830566aa0 0000001081167330 >Jul 17 07:50:17 gprfc092 kernel: ffff880830567058 ffff880831b3ffd8 000000000000fb88 ffff880830567058 >Jul 17 07:50:17 gprfc092 kernel: Call Trace: >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118f221>] ? path_put+0x31/0x40 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150efbe>] __mutex_lock_slowpath+0x13e/0x180 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150ee5b>] mutex_lock+0x2b/0x50 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191e39>] do_unlinkat+0xa9/0x1b0 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff810dc8f7>] ? audit_syscall_entry+0x1d7/0x200 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191f56>] sys_unlink+0x16/0x20 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b >Jul 17 07:50:17 gprfc092 kernel: INFO: task iozone:3167 blocked for more than 120 seconds. >Jul 17 07:50:17 gprfc092 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >Jul 17 07:50:17 gprfc092 kernel: iozone D 0000000000000000 0 3167 3166 0x00000080 >Jul 17 07:50:17 gprfc092 kernel: ffff8807eb81fdd8 0000000000000086 0000000000000000 ffff8807eb81fd68 >Jul 17 07:50:17 gprfc092 kernel: ffff8807eb81fd88 ffffffffa0275688 ffff8807eb81fdb8 0000000000000286 >Jul 17 07:50:17 gprfc092 kernel: ffff880831d2e5f8 ffff8807eb81ffd8 000000000000fb88 ffff880831d2e5f8 >Jul 17 07:50:17 gprfc092 kernel: Call Trace: >Jul 17 07:50:17 gprfc092 kernel: [<ffffffffa0275688>] ? nfs_refresh_inode_locked+0x468/0xce0 [nfs] >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150efbe>] __mutex_lock_slowpath+0x13e/0x180 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8121cd6f>] ? security_inode_permission+0x1f/0x30 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150ee5b>] mutex_lock+0x2b/0x50 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118fabc>] vfs_unlink+0x5c/0xe0 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118e84a>] ? lookup_hash+0x3a/0x50 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191e85>] do_unlinkat+0xf5/0x1b0 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff810dc8f7>] ? audit_syscall_entry+0x1d7/0x200 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191f56>] sys_unlink+0x16/0x20 >Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b >Jul 17 07:59:51 gprfc092 kernel: nfsacl: server gprfs018-10ge not responding, still trying > > >I think for this kind of workload, we need to tune: > >1. In the client side: > a) the options passed to mount command > b) N/W tuning for 10GE and high workload >2. In the server side: >N/W tuning to handle the high workload > >3. Not sure but we may need to look the windows size in wb-xlator. > >Thanks, >Santosh > >--- Additional comment from RHEL Product and Program Management on 2013-07-17 22:32:17 EDT --- > >This bug report has Keywords: Regression or TestBlocker. > >Since no regressions or test blockers are allowed between releases, >it is also being identified as a blocker for this release. > >Please resolve ASAP. > >--- Additional comment from Ben England on 2013-07-18 12:32:56 EDT --- > >Rich Fortier suggested that if we've had the problem all along then we don't need to fix the problem, since no customer experienced it. It turns out we didn't have the problem all along. The same workload completed successfully many times during performance measurements I took for Anshi RHS 2.0 U4. So it appears that a problem was introduced between RHS 2.0U4 and Big Bend. You can see iozone logs and stats during Anshi runs in: > >http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver/shared/benchmarks/gluster_test/logs/anshi2/ > >--- Additional comment from santosh pradhan on 2013-07-19 10:37:07 EDT --- > > > >1. Wrt RHS 2.0 Update 4, using the same testbed from Ben, we see the issue (reported for Big Bend) both with and without write-behind. > >Ben : please check the testbed you gave us. The last run was with the write-behind xlator enabled for a U4 run. It would be great if you could take a few minutes to take a look and see if there is anything missing in the configuration from what you tested U4 with (speaking of the successful iozone run(s)). > >2. Do you also remember the configuration (gluster conf mainly) when you ran iozone successfully. We did go through your iozone results but we are unsure if write-behind was enabled or not for those runs. Which version of RHEL was in your config? > >3. Please note that in the testbed you shared with us, all of the following seem to hang: >(a) U4 with write-behind >(b) U4 without write-behind (glusterfs-nfs and glusterfsd both are OOM killed) >(c) Big Bend release 3.4.0.12rhs.beta3/ > >4. In our U4 runs, we noticed from iostat output that initially when the iozone workload started, the queue-len await times (not looking at svctm) are bad at times: >Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util >sda 0.00 1.67 0.00 1.33 0.00 21.33 16.00 0.00 0.75 0.75 0.10 >sdb 0.00 0.00 0.00 951.00 0.00 795400.00 836.38 83.86 88.18 0.77 73.67 >dm-0 0.00 0.00 0.00 2.67 0.00 21.33 8.00 0.00 0.75 0.38 0.10 >dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >dm-2 0.00 698.67 0.00 951.00 0.00 795400.00 836.38 83.96 88.29 0.78 73.77 >dm-3 0.00 0.00 0.00 1649.67 0.00 795400.00 482.16 146.38 88.73 0.45 73.77 >dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > >Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util >sda 0.00 0.67 8.67 1.00 298.67 13.33 32.28 0.05 5.24 4.79 4.63 >sdb 0.00 0.00 0.00 1.33 0.00 1365.33 1024.00 0.15 737.50 43.75 5.83 >dm-0 0.00 0.00 8.33 1.67 296.00 13.33 30.93 0.05 4.97 4.63 4.63 >dm-1 0.00 0.00 0.33 0.00 2.67 0.00 8.00 0.00 9.00 9.00 0.30 >scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >dm-2 0.00 0.00 0.00 1.33 0.00 1365.33 1024.00 0.15 737.50 43.75 5.83 >dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 0.00 5.83 > >Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util >sda 0.00 0.67 0.00 1.33 0.00 13.33 10.00 0.00 0.00 0.00 0.00 >sdb 0.00 0.00 0.00 820.00 0.00 685968.00 836.55 69.78 85.10 0.68 55.37 >dm-0 0.00 0.00 0.00 1.67 0.00 13.33 8.00 0.00 0.00 0.00 0.00 >dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >dm-2 0.00 602.67 0.00 820.00 0.00 685968.00 836.55 69.86 85.19 0.68 55.37 >dm-3 0.00 0.00 0.00 1422.67 0.00 685968.00 482.17 121.43 85.30 0.39 55.37 > > >If the latency is ~43ms, then the backpressure could build on the wb xlator as well. We are trying to look at the code and figure out the explanation you wanted wrt nfs commits. > >5. It would be great if you could share your RHS-2.0-U5 run results. > >--- Additional comment from Ben England on 2013-07-19 16:12:48 EDT --- > >I'm not happy, I just reproduced problem with Anshi glusterfs-3.3.0.7 RPMs. I tried to run everything the same way I did before, when it clearly did not hang. > >I changed my test script so it always logs "rpm -qa" and "gluster volume info", hopefully in the future this will answer questions like this. > >So it appears that we always had the problem. Are you sure that no customers have encountered this already? > >You don't need iozone to reproduce the problem. A single dd command will do it. This is the simplest write application that I know so every Gluster user will see this problem when writing large files. > >This command hung after only 9 GB of data. The replicas were not on the NFS server. You only need 2 servers, 1 client to reproduce this. > >ssh gprfc081 'f=/mnt/glnfs/x.dd; rm -fv $f ; dd if=/dev/zero of=$f bs=16384k count=1024' > >But this command never hung and never leaked memory - glusterfs process did not grow its virtual memory at all. This is dd utility equivalent of iozone's "-o" option to use the O_SYNC flag when opening the file. Every write request is flushed to storage before it completes. > >ssh gprfc081 'f=/mnt/glnfs/x.dd; rm -fv $f ; dd if=/dev/zero of=$f bs=16384k count=1k oflag=sync' >removed `/mnt/glnfs/x.dd' >1024+0 records in >1024+0 records out >17179869184 bytes (17 GB) copied, 137.944 s, 125 MB/s > >Again this suggests NFS COMMIT processing is not working. > >--- Additional comment from santosh pradhan on 2013-07-23 06:00:04 EDT --- > > >gluster version: Big bend (3.4.0.12rhs.beta5-2) > >Tuning: >1. Server: vm.min_free_kbytes = 1048576 > >2. client: >Mount nfs with bigger rsize/wsize: >mount -t nfs -o nfsvers=3,tcp,rsize=32768,wsize=32768 gprfs017-10ge:/perftest /mnt/glnfs > >3. In iozone, set the record size smaller to avoid huge rpc retransmissions: > >(-r 4k with 16 threads) > >Verified in the client side (al clients), no retransmission happens. glusterfs(d) processes also came down in memory size. > >With this, iozone finished succesfully, here is the output: > >iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 32g -t 16 > Iozone: Performance Test of File I/O > Version $Revision: 3.392 $ > Compiled for 64 bit mode. > Build: linux-AMD64 > > Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, > Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, > Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, > Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, > Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. > Ben England. > > Run began: Tue Jul 23 05:27:36 2013 > > Network distribution mode enabled. > Hostname = 10.16.159.187 > Setting no_unlink > Include close in write timing > Include fsync in write timing > No retest option selected > Record Size 4 KB > File size set to 33554432 KB > Command line used: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 32g -t 16 > Output is in Kbytes/sec > Time Resolution = 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > Throughput test with 16 processes > Each process writes a 33554432 Kbyte file in 4 Kbyte records > > Test running: > Children see throughput for 16 initial writers = 416480.96 KB/sec > Min throughput per process = 24609.73 KB/sec > Max throughput per process = 27953.86 KB/sec > Avg throughput per process = 26030.06 KB/sec > Min xfer = 29644520.00 KB > Child[0] xfer count = 30135464.00 KB, Throughput = 24888.27 KB/sec > Child[1] xfer count = 29906884.00 KB, Throughput = 24827.08 KB/sec > Child[2] xfer count = 32807456.00 KB, Throughput = 27329.24 KB/sec > Child[3] xfer count = 32005172.00 KB, Throughput = 26662.56 KB/sec > Child[4] xfer count = 30035532.00 KB, Throughput = 24817.52 KB/sec > Child[5] xfer count = 29735420.00 KB, Throughput = 24685.27 KB/sec > Child[6] xfer count = 32017976.00 KB, Throughput = 26673.22 KB/sec > Child[7] xfer count = 31608052.00 KB, Throughput = 26331.74 KB/sec > Child[8] xfer count = 30064660.00 KB, Throughput = 24827.14 KB/sec > Child[9] xfer count = 30332952.00 KB, Throughput = 25183.77 KB/sec > Child[10] xfer count = 33040404.00 KB, Throughput = 27521.22 KB/sec > Child[11] xfer count = 33084820.00 KB, Throughput = 27562.14 KB/sec > Child[12] xfer count = 30635224.00 KB, Throughput = 25307.52 KB/sec > Child[13] xfer count = 29644520.00 KB, Throughput = 24609.73 KB/sec > Child[14] xfer count = 32774976.00 KB, Throughput = 27300.69 KB/sec > Child[15] xfer count = 33554432.00 KB, Throughput = 27953.86 KB/sec > > > >iozone test complete. > > >DRC patch is yet to be tested. > > >Thanks, >Santosh > >--- Additional comment from santosh pradhan on 2013-07-23 08:03:40 EDT --- > > >vm.min_free_kbytes = 131072 (Reduced to 128MB instead of 1GB) >mount -t nfs -o nfsvers=3,tcp,rsize=32768,wsize=32768 >iozone record size = 16k > >This time, it passed as well: > >iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16k -s 32g -t 16 > Iozone: Performance Test of File I/O > Version $Revision: 3.392 $ > Compiled for 64 bit mode. > Build: linux-AMD64 > > Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, > Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, > Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, > Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, > Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. > Ben England. > > Run began: Tue Jul 23 06:20:13 2013 > > Network distribution mode enabled. > Hostname = 10.16.159.187 > Setting no_unlink > Include close in write timing > Include fsync in write timing > No retest option selected > Record Size 16 KB > File size set to 33554432 KB > Command line used: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16k -s 32g -t 16 > Output is in Kbytes/sec > Time Resolution = 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > Throughput test with 16 processes > Each process writes a 33554432 Kbyte file in 16 Kbyte records > > Test running: > Children see throughput for 16 initial writers = 425322.62 KB/sec > Min throughput per process = 26312.52 KB/sec > Max throughput per process = 26887.80 KB/sec > Avg throughput per process = 26582.66 KB/sec > Min xfer = 31964480.00 KB > Child[0] xfer count = 32900256.00 KB, Throughput = 26468.06 KB/sec > Child[1] xfer count = 31976400.00 KB, Throughput = 26323.77 KB/sec > Child[2] xfer count = 33413312.00 KB, Throughput = 26750.23 KB/sec > Child[3] xfer count = 33157936.00 KB, Throughput = 26559.38 KB/sec > Child[4] xfer count = 33115824.00 KB, Throughput = 26636.75 KB/sec > Child[5] xfer count = 31964480.00 KB, Throughput = 26317.89 KB/sec > Child[6] xfer count = 33021152.00 KB, Throughput = 26450.86 KB/sec > Child[7] xfer count = 33114976.00 KB, Throughput = 26529.80 KB/sec > Child[8] xfer count = 33304128.00 KB, Throughput = 26887.80 KB/sec > Child[9] xfer count = 32330704.00 KB, Throughput = 26619.55 KB/sec > Child[10] xfer count = 33297312.00 KB, Throughput = 26662.14 KB/sec > Child[11] xfer count = 33161376.00 KB, Throughput = 26564.73 KB/sec > Child[12] xfer count = 32718272.00 KB, Throughput = 26312.52 KB/sec > Child[13] xfer count = 32231040.00 KB, Throughput = 26529.39 KB/sec > Child[14] xfer count = 33499104.00 KB, Throughput = 26827.74 KB/sec > Child[15] xfer count = 33554432.00 KB, Throughput = 26882.01 KB/sec > > > >iozone test complete. > >--- Additional comment from Ben England on 2013-07-23 14:00:48 EDT --- > >sorry, I can't reproduce. I tried all your suggestions, it hangs the same way it did before. However, there are no stack traces, which is consistent with fact that RHEL6.4 isn't as vulnerable to memory fragmentation problem. > >For example, gprfc090 "nfsstat -c" shows this, and gprfc090 was rebooted immediately before this test was run so counters were all zeroed. >Counters stop changing when it hangs, and /var/log/messages says NFS server is not responding. NFS server shows 7 GB in glusterfs and it's using 1 core up. > >Jul 23 13:31:28 gprfc090 kernel: nfs: server gprfs018-10ge not responding, still trying > >If we analyze the nfs counters, we have ~2.6 million writes x 32 KB/write = ~83 GB of writes, which makes sense -- iozone has 4 threads running on this client, each is trying to write 32 GB for a total of 128 GB. The number of commit requests is 68 , so we have roughly 1 GB per commit request! This may be the problem. If we issue a COMMIT while system is busy, or perhaps multiple commits from different clients, could it timeout? > >I don't know of a way to increase the COMMIT frequency on the client side. I thought fs.nfs.nfs_congestion_kb would do it but it doesn't work. > >Client rpc stats: >calls retrans authrefrsh >2657537 25597 2539831 > >Client nfs v3: >null getattr setattr lookup access readlink >0 0% 12 0% 0 0% 20 0% 20 0% 0 0% >read write create mkdir symlink mknod >0 0% 2658483 99% 12 0% 0 0% 0 0% 0 0% >remove rmdir rename link readdir readdirplus >0 0% 0 0% 0 0% 0 0% 0 0% 0 0% >fsstat fsinfo pathconf commit >0 0% 4 0% 2 0% 68 0% > >--- Additional comment from Ben England on 2013-07-23 17:36:17 EDT --- > >I now have some proof that this problem is in Gluster-NFS implementation and needs to be fixed there. > >Jeff Layton helped me figure out how to get the NFS COMMIT frequency from the client to increase -- I set this on all the clients: > ># sysctl -w vm.dirty_bytes=262144000 vm.dirty_background_bytes=65536000 > >When I do this, the excessive glusterfs memory consumption does not occur as much. > >By default, they use vm.dirty_ratio and vm.dirty_background_ratio, which default to 20% and 10% of RAM respectively, so for a 64-GB system that would be 12 GB and 6 GB respectively. > >So Gluster NFS translator is in fact flushing when COMMIT arrives, they just weren't arriving often enough to keep the write-behind list from getting excessively long. With this setting, I saw glusterfs grow to 2.7 GB VM size, but it still seemed to be able to function, and the test completed, even with a record size of 16384 KB. Now I'm going to remove some of the other tuning and verify that it still works, to confirm the hypothesis. Sorry that it takes a long time. > >But you cannot expect NFS clients to tune vm.dirty_bytes and vm.dirty_background_bytes, this was just a workaround. What has to happen is that Gluster has to automatically flush uncommitted writes to storage if they exceed some size threshold. So instead of calling nfs_flush from just the nfs3_commit routine, you have to trigger a call to it from somewhere else (maybe during write processing) if this threshold is detected. Can we make the threshold configurable please? > >--- Additional comment from Ben England on 2013-07-24 00:08:54 EDT --- > >Now that I've increased NFS COMMIT frequency as Jeff Layton suggested with > >vm.dirty_background_bytes=32768000 >vm.dirty_bytes=65536000 > >on the NFS clients to help Gluster-NFS implementation flush writes to bricks sooner, I have successfully backed out tunings that were introduced earlier when we weren't sure what the cause was - no hangs are occurring anymore. glusterfs NFS server process never exceeds 1 GB, is closer to 600 MB now. > >- removing NFS mount options rsize=32k,wsize=32k does not cause hang >- Removing record size of 16 KB (back to 64 KB) does not cause hang >- removing kernel network tunings in /etc/sysctl.conf does not cause hang (TCP autotuning should be handling some of them). >- No volume tuning parameters were required - but eager-lock may help performance >- vm.min_free_kbytes=1GB is not required (on Big Bend = RHEL6.4) - but may help performance > >But as soon as I set vm.dirty_background_bytes=0 vm.dirty_bytes=0 (default values), glusterfs becomes huge and hangs again. > >This bz is not fixed until NFS clients don't have to be tuned. > >--- Additional comment from Scott Haines on 2013-07-24 03:14:50 EDT --- > >Not a regression. Requires kernel tunables. Now targeting Corbett. > >--- Additional comment from RHEL Product and Program Management on 2013-07-24 03:21:52 EDT --- > >This bug report has Keywords: Regression or TestBlocker. > >Since no regressions or test blockers are allowed between releases, >it is also being identified as a blocker for this release. > >Please resolve ASAP. > >--- Additional comment from santosh pradhan on 2013-07-25 06:29:11 EDT --- > > >Working on the DRC patch. > >--- Additional comment from santosh pradhan on 2013-07-25 08:58:30 EDT --- > > >Update: >======= > >1. I tried DRC patch (without any tuning in client/server side), same behaviour i.e iozone hangs. > >My findings after all these testing: >=================================== > >2. Whenever iozone hangs, these are the behaviours we see in client/server side: > >a. client side: nfsstat shows a lot of retransmission: > >e.g. >Client rpc stats: >calls retrans authrefrsh >418122 31379 396229 > >which is huge. > >What does it tell: NFS server is not able to handle the workload. Gluster NFS is single threaded (Now it can not be changed though). > >[Whenever it's seen, in case of kernel NFS, it's advised to increase the number of NFS kernel thread.] > >b. server side: >"netstat -s" shows a lot of packet overrun issues in network: > >e.g. > > 1884 packets pruned from receive queue because of socket buffer overrun > 42 packets pruned from receive queue > 14 packets dropped from out-of-order queue because of socket buffer overrun > 1629267 packets collapsed in receive queue due to low socket buffer > >It demands "Network Tuning" in server side for better NFS performance. > >3. What can be tuned for better NFS performance: >a. Network (server side) >b. System e.g. vm subsystem in server >c. Client e.g. VM subsys in client > >It's recommended to tune the client VM subsys for better NFS performance, especially the dirty page-cache parameter which may congest the network during the high I/O workload (high volume of write) in a system with bigger RAM (more than 8-GB). > >In our case, the amount of RAM in clients (64GB) are more than that of server ( ~48GB). The default vm params in client m/c: > >vm.dirty_background_ratio = 10 ( which is ~ 6.4GB) >vm.dirty_ratio = 20 ( which is ~26GB) > >These are really big for a NFS client which needs tuning. > >vm.dirty_background_bytes = 0 >vm.dirty_bytes = 0 > >The tuning can be made through either of the set but only one set ll be taken into a/c, later has the preference I guess (not sure). > >http://www.novell.com/support/kb/doc.php?id=7010287 > >Not sure if we enforce the write in gluster NFS, how much it is going to help. I am not sure but I guess NFS kernel version also does not do any such enforcement. > >I would request Anand/Ben/Sayan/Vivek for their input to take it forward from here. > >Thanks, >Santosh > >--- Additional comment from santosh pradhan on 2013-07-25 12:40:37 EDT --- > > >One more thing forgot to mention i.e. performance problem because of RAID: ( which might need some tuning) > >The glusterfs brick is cut from a RAID disk (in both the servers), they might be imposing some performance bottleneck (as mentioned by Anand in one of the email thread). > >What is the RAID level for the bricks ? >What is the chunk size of the RAID? >What is the block size of underlying filesystem (xfs in this case)? > >I dont know much about the RAID tuning for performance but I think xfs/RAID needs some tuning for better performance. > >Reference: >(http://www.redwax.co.uk/blog-reader/items/linux-ext2-ext3-and-ext4-stride-stripe-calculation-on-raid.html) > >(1) Ideally 10GE set-up should provide better NFS performance instead of 1GE interface. If the NFS is going to write 64KB data (GlusterFS supports NFS v3 which auto-negotiates to 64KB as rsize/wsize) then it should generate more TCP packets in case of 1GE (1500 MTU) than 10GE (MTU >= 9000). [Exception: 1GE also can support jumbo-frames if configured]. But in our case, the reverse happens. I tested with 1GE setup where iozone test passed with all record size. In fact, iozone record size should not matter, what matters is the NFS (r/w)size. > >(2) In the past, Ben tested the same with RHS 2.0 U4 (as per the shared test result) where iozone test passed. But with same version of GlusterFS s/w, the test fails in this setup. > >From both of the above mentioned points, it looks like the disk plays something in the performance degradation. > >I will check with Anand/system admin (here in bangalore) if jumbo frame can be enabled in the switch where both of BLR-servers are connected. Both of the servers have 10GE interface. Both of the machines are connected to SATA disks (Anand, please rectify me if the disks are SAS or so). > >--- Additional comment from Ben England on 2013-07-30 16:08:39 EDT --- > >I tried using vm.dirty_ratio=2 vm.dirty_background_ratio=1 and it hung, so no reasonable vm.dirty tuning is going to make Gluster-NFS work. > >I also saw glusterfs fail to stop when I stopped the volume. THis makes it impossible to fix without killing the glusterfs NFS process, not a good thing. > >re: .-1 > >These clients are not abnormally sized. If anything you could see clients with far more memory than this in the field. > >RAID level is 6 >chunk size is 256 KB or 64 KB, not sure which right now. >block size is default (directory block size is 8192) >XFS file system is made according to RHS best practices document as follows by the test harness used by perf team: > >mkfs -t xfs -f -i size=512 -n size=8192 -d agcount=31,su=256k,sw=10 -L RHSbrick0 /dev/vg_brick0/lv && mount -v -t xfs -o inode64,noatime -L RHSbrick0 /mnt/brick0 > >agcount=31 should not be needed if you have latest xfsprogs rpm in Big Bend. >noatime is irrelevant for this workload. > >Typically jumbo frames can be enabled in the switch. jumbo frame size of 7000 is almost as good as 9000 but is much easier for memory allocator to find, so if you are seeing kernel stack traces with jumbo frames try this. > >as for disks, BAGL hosts use SAS 7200-RPM drives with PERC H700 (MegaRAID) with writeback cache enabled. > >--- Additional comment from santosh pradhan on 2013-08-01 13:46:43 EDT --- > >(In reply to Ben England from comment #40) >> I tried using vm.dirty_ratio=2 vm.dirty_background_ratio=1 and it hung, so >> no reasonable vm.dirty tuning is going to make Gluster-NFS work. >> > >[Santosh] dirty_background_ration is 1 i.e. 1% of RAM is ~ 640MB which still looks higher. May be something around 300MB would work for our RAM size as mentioned by the Novell KB article. (yet to be tried out) > > >> I also saw glusterfs fail to stop when I stopped the volume. THis makes it >> impossible to fix without killing the glusterfs NFS process, not a good >> thing. >> > >[Santosh] Agree, this does not look good. :-( >I am not actively working on this bug as this is targeted for Corbett. Thanks for all the experimentation. > >Thanks, >Santosh > >--- Additional comment from Ben England on 2013-08-05 17:05:47 EDT --- > >I have a hang now even with the kernel tuning. Performance is bad with prior vm.dirty_bytes=65536000 so I tried multiplying by 4... > > -- gprfc077 -- >vm.dirty_background_ratio = 0 >vm.dirty_background_bytes = 32768000 >vm.dirty_ratio = 0 >vm.dirty_bytes = 262144000 > >and with this workload > >/usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -i 0 -C -w -+n -s 67108864 -r 128 -c -e -t 8 > >I got multiple of these messages on multiple servers: > >Aug 5 15:09:53 gprfs017 GlusterFS[17954]: [2013-08-05 15:09:53.438281] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-0: cannot lookup the saved frame corresponding to xid (1618709) >.. >Aug 5 15:45:04 gprfs019 GlusterFS[16293]: [2013-08-05 15:45:04.467323] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-2: cannot lookup the saved frame corresponding to xid (1130351) >... > >in /var/log/glusterfs/nfs.log: > >[2013-08-05 15:45:09.733087] I [client-handshake.c:450:client_set_lk_version_cbk] 0-perftest-client-2: Server lk version = 1 >[2013-08-05 16:15:18.775003] E [rpc-clnt.c:207:call_bail] 0-perftest-client-2: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x1138745x sent = 2013-08-05 15:45:08.199970. timeout = 1800 >[2013-08-05 16:15:18.775040] W [client-rpc-fops.c:1811:client3_3_fxattrop_cbk] 0-perftest-client-2: remote operation failed: Transport endpoint is not connected >[2013-08-05 18:30:09.639496] E [rpc-clnt.c:207:call_bail] 0-perftest-client-2: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x1167741x sent = 2013-08-05 18:00:07.870139. timeout = 1800 >... > >my theory is that some gluster process crashes and loses state about NFS such that NFS cannot make progress because it can't get COMMITs to succeed. > >--- Additional comment from santosh pradhan on 2013-08-07 07:49:52 EDT --- > > >Ben, > >I dont think gluster process crashes. They become unresponsive, mostly because of TCP/IP stack pressure. (in server side) > >1. From our experimentation, one thing looks clear i.e. if we do NFS COMMIT more frequently with less data (cached in client), the test PASS. > >dd /iozone with O_DIRECT/sync flag PASS because it by-passes/avoids the client cache. > >No rpc retransmission and NO TCP/IP packet drops/data lost are seen. > >2. So, if we increase the cache amount (vm dirty_ratio/dirty_background_ratio), then the test FAILs. > >In FAIL case, rpc retransmission goes up crazy along with TCP/IP layer pressure in the server side. > >So my understanding is no single tunable is going to help for NFS performance. > >Client side: >=========== > >1. Tuning needed in vm.dirty_ratio and vm.dirty_background_ratio (or bytes) with m/c with huge RAM. Size needs to be decided based on the experiments/various tests. > >2. Increase the number of RPC connections to server i.e. /proc/sys/sunrpc/tcp_slot_table_entries, especially for 10GE interface. > >[ Ref: http://researcher.watson.ibm.com/researcher/view_person_subpage.php?id=4427] > >Server side: >========== > >1. Tuning vm (already there in place) >2. a) Tuning tcp/ip/socket layer buffers - Needs to be done. > b) Tune at NIC level i.e. txqueuelen >3. Tune performance.write-behind-window-size (Not sure how much would be good, need to check). >4. Tune at I/O scheduler level for Disk I/O performance: >For NFS server m/c, I guess deadline is the scheduler to go with. >It is already SET in /dev/sdb which is brick dir (if I am not wrong) >[Ref https://docspace.corp.redhat.com/docs/DOC-92040] > >In our test bed, looks like the tcp/socket(core)/ip/NIC layer tuning is needed in the SERVER side and vm-cache size and RPC-tcp_slot_table_entries needs to be tuned in the CLIENT side. > >Thanks, >Santosh > >--- Additional comment from Ben England on 2013-08-16 12:50:56 EDT --- > >Santosh, > >If any tuning is needed to make NFS work (i.e. not hang) IN THE NFS CLIENT, then it is not supportable. See 997140 for a related bz, showing additional evidence of this problem. > >I agree with first couple of conclusions, but disagree with idea that RPC retransmission is the problem, I think it is a symptom. See below for what I think is root cause. > >not sure why performance.write-behind-window-size would help here, it's already 1 MB by default. Increasing it would only cause more pages to be buffered in the NFS server process. > >we don't need more RPC connections in this case, we don't have that many clients (only 8, 2 per server). > >TCP has an "auto-tune" feature in RHEL6 that should adjust buffer space automatically, if this is not working then we need to bz in RHEL! In RHEL5 I used to adjust net.core.tcp_{r,w}mem up to 4 MB, but I was told this should not be necessary in RHEL6, and that's what we observed 2 years ago, I can provide data if this is of interest. > >txqueuelen is a possible performance tuning but I'm skeptical that it is necessary to prevent NFS hangs, since I see very high rates when I first start the tests. It is already 1000 jumbo frames, how many do we need? > >I suspect the issue is more in memory management since the problem does not happen right away or with smaller tests, am going to stick with that. I think the client runs out of memory, tries to do a COMMIT, and the server can't process the COMMIT because there are too many write requests queued, so it can't make progress and can't respond to the client (server is single-threaded), so the client cannot issue any more RPCs to the server successfully. This is the cause of the RPC retransmissions perhaps. > >--- Additional comment from Ben England on 2013-08-28 08:02:26 EDT --- > >Santosh thinks TCP buffer tuning is a way to prevent packet loss and RPC retransmissions, and that this would solve the problem. I'm not convinced but I'm willing to give it a try and see what happens, so I'll report back here when I have some data. If this approach works, it has the advantage that it can be implemented on the SERVER SIDE, without client tuning. > >So the alternative hypothesis that I heard from Santosh is that NFS server is single threaded and can't keep up with the multiple streams of NFS requests coming in from 2 clients/server. NFS is being re-implemented in a future release (possibly with Ganesha) so consequently there is not a lot of interest from the developers to invest in changing NFS implementation in Big Bend. I would agree if the main issue was performance, but this is a hang, not a performance problem. > >--- Additional comment from santosh pradhan on 2013-08-28 11:36:23 EDT --- > >(In reply to Ben England from comment #45) >> Santosh thinks TCP buffer tuning is a way to prevent packet loss and RPC >> retransmissions, and that this would solve the problem. I'm not convinced >> but I'm willing to give it a try and see what happens, so I'll report back >> here when I have some data. If this approach works, it has the advantage >> that it can be implemented on the SERVER SIDE, without client tuning. > >[Santosh] I am thinking TCP-auto tuning might be one of the reason for this issue. If the client machine has more memory installed and the VM is not tuned, then the quantity of I/O per commit would be high. Again if the write is happening from multiple clients, then amount of write would be even higher. If each of the client is writing with multiple threads, then the frequency of write burst would be high. When the SERVER (where NFS server runs), gets a lot of TCP packets (because each NFS write would be of size: 64k for NFS v3 and they would be converted into pmtu i.e. say 9k for 10GE). Because of "TCP-auto tuning" feature, the socket layer recv-buffer would keep increasing. The server has additional responsibility to send the packets to different bricks as well for which it needs that much send-buffer as well. It may happen that, recv/send buffer of socket layer would eat up all the memory allocated for the TCP stack i.e. /proc/sys/net/ipv4/tcp_mem. Because of this, ACKs for the incoming requests (write/ping) would be delayed. The senders would keep trying for the connection (by retransmitting the req's) for a certain period (limited no of retransmission) and they give up by resetting the connection. This can make the situation even worse. > >To check the theory: Under high load, when NFS server starts behaving badly: > >cat /proc/sys/net/ipv4/tcp_mem > >cat /proc/net/sockstat > >If the results are closer or match up, then the theory is correct. > >May be we can try this by turning OFF the TCP auto tune i.e. /proc/sys/net/ipv4/tcp_moderate_rcvbuf. > >I cant assure that it would work. > >So far, it's widely accepted solution to handle high-write-workload is to increase the NFS write commit frequency from client side by tuning vm.dirty_ratio/vm.dirty_background_ratio > >OR > >vm.dirty_bytes/vm.dirty_background_bytes > >[ KB article from Novell: >http://www.novell.com/support/kb/doc.php?id=7010287 ] > >> >> So the alternative hypothesis that I heard from Santosh is that NFS server >> is single threaded and can't keep up with the multiple streams of NFS >> requests coming in from 2 clients/server. > >[Santosh] Its already being told by Anand many times. With higher write workload, 8 threads (default) of kernel NFS may not be sufficient. And Gluster NFS is just single threaded, which is a limitation by itself. > > NFS is being re-implemented in a >> future release (possibly with Ganesha) so consequently there is not a lot of >> interest from the developers to invest in changing NFS implementation in Big >> Bend. I would agree if the main issue was performance, but this is a hang, >> not a performance problem. > >[Santosh] I just mentioned that NFS-ganesha is being worked which has scalable threads to handle more I/O. I dont know anything beyond unfortunately. That's a decision to be taken by management. > >BZ 965924 is same as this bug. I feel they should be tracked as one problem not two. > >Thanks, >Santosh > >--- Additional comment from Ben England on 2013-09-09 13:15:34 EDT --- > >I disagree that the problem is single-threaded-ness of NFS. I reduced number of clients from 8 to 4, so there was only 1 NFS client per NFS server, and the same problem still occurs with sufficiently large files. Throughput is better though until the hang occurs. > >Sayan has said repeatedly that a solution which requires tuning the NFS client is not acceptable, because no one will implement it. We need to fix the NFS server to avoid this problem. I have not been able to tune my way out of the problem by adjusting TCP buffering kernel parameters on the server, for example. Please do not close this bug on the basis that client tuning is a solution. > >--- Additional comment from santosh pradhan on 2013-09-13 04:39:40 EDT --- > > >I think I could understand the problem. > >The problem is: > >GlusterFS NFS code assumes that there will be 128 in-flight RPC requests at a time (max) from NFS client. Currently the NFS tunable mem-factor is set to 15 (GF_NFS_DEFAULT_MEMFACTOR) (default) which gets multiplied by 15 ( GF_NFS_CONCURRENT_OPS_MULT): which can serve at most 225 in-flight RPC requests. If the load is high, then it ll be a bottleneck for NFS server. Which gets exposed easily with parallel iozone command (with high write workload). > >Value 128 was max hardcoded for earlier RHEL (prior to 6.3). In 6.3 onwards, sunrpc.tcp_slot_table_entries gets dynamically tuned as per the load (can go upto sunrpc.tcp_max_slot_table_entries i.e. 65536 now) based on the TCP window size. Again TCP autotuning is enabled in newer RHEL OS e.g. 6.4 onwards. > >I tried iozone command in cluster mode (with 2x2 server with 8 clients) with 32g write from 8 threads, it PASSED. > >Server side tune: >================ >sysctl -w sunrpc.tcp_max_slot_table_entries=224 >(Max gluster can support 225 without mem-factor tuning) > >Or > >mem-factor can be tuned to have more values. That ll keep mem-factor * 6000 number of inodes in cache. > >So, I believe we need to modify the code to address. > >Ben, >Could you please verify the theory by running the tests in your test bed? > >Pre-requisite: Please make sure that there are no other instances of iozone command running (may be by some cronjob or so) which might corrupt data of iozone files. Mostly I see iozone keeps running in the test bed (you shared with me) but I dont know who fired them. Do you have a cron running in the test bed? > >Best, >Santosh > >--- Additional comment from santosh pradhan on 2013-09-13 04:41:11 EDT --- > > >Have a look @ http://www.slideshare.net/sshaaf/red-hat-enterprise-linux-and-nfs, page 22. > >--- Additional comment from Pavithra on 2013-09-13 12:50:57 EDT --- > >This is added as a known issue in the 2.1 release notes. > >--- Additional comment from Pavithra on 2013-09-13 12:51:44 EDT --- > > > >--- Additional comment from Pavithra on 2013-09-16 01:33:35 EDT --- > >The GA link is here for the known issue: >https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/2.1_Release_Notes/index.html#chap-Documentation-2.1_Release_Notes-Known_Issues > >--- Additional comment from Anand Avati on 2013-09-16 03:16:44 EDT --- > >REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#2) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2013-09-16 06:24:36 EDT --- > >REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#3) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Ben England on 2013-09-16 10:57:25 EDT --- > > I'll try to apply the patch to RHS 2.1 source RPM to minimize changes and see how that goes. > >--- Additional comment from Ben England on 2013-09-19 12:21:04 EDT --- > >At Sayan's request here is the tuning I used to work around this problem on the NFS clients. In /etc/sysctl.conf: > >vm.dirty_background_bytes=32768000 >vm.dirty_bytes=65536000 > >This replaces vm.dirty_ratio and vm.dirty_background_ratio settings that control how much write data can be buffered by the NFS client. It will significantly reduce throughput by as much as 50% but Gluster can sustain throughput this way. I'm not saying this fixes the problem, only that it is a workaround. > >One problem with this workaround is that it affects ALL filesystems not just NFS. > >--- Additional comment from Ben England on 2013-10-02 14:33:11 EDT --- > >Need a build of this patch applied to RHS 2.1 so perf team can test. > >--- Additional comment from Anand Avati on 2013-10-03 04:58:49 EDT --- > >REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#4) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from santosh pradhan on 2013-10-04 04:50:04 EDT --- > >Downstream build with minimal code changes available @ > >https://brewweb.devel.redhat.com/taskinfo?taskID=6366874 > >It does not need any setting in the volume. > >--- Additional comment from Anand Avati on 2013-10-04 05:04:33 EDT --- > >REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#5) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2013-10-07 08:28:20 EDT --- > >REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#6) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2013-10-07 08:32:31 EDT --- > >REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#7) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2013-10-07 08:34:20 EDT --- > >REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#8) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from santosh pradhan on 2013-10-09 01:52:39 EDT --- > >I tested with minimal changes (from patch posted for review) with downstream, iozone load is passing: > >Volume info: >=========== >Volume Name: perftest >Type: Distributed-Replicate >Volume ID: c052bfd8-fd21-4140-8059-f3d2c5993b76 >Status: Started >Number of Bricks: 2 x 2 = 4 >Transport-type: tcp >Bricks: >Brick1: gprfs017-10ge:/mnt/brick0/brick >Brick2: gprfs018-10ge:/mnt/brick0/brick >Brick3: gprfs019-10ge:/mnt/brick0/brick >Brick4: gprfs020-10ge:/mnt/brick0/brick > >How the volume is mounted with 8 clients: >========================================= > >All mount points on gprfs017.sbu.lab.eng.bos.redhat.com: >172.17.10.89:/perftest >172.17.10.94:/perftest >172.17.70.252:/perftest >All mount points on gprfs018.sbu.lab.eng.bos.redhat.com: >172.17.10.90:/perftest >172.17.10.95:/perftest >All mount points on gprfs019.sbu.lab.eng.bos.redhat.com: >172.17.10.91:/perftest >172.17.10.96:/perftest >All mount points on gprfs020.sbu.lab.eng.bos.redhat.com: >172.17.10.77:/perftest >172.17.10.92:/perftest > >IOZONE workload: >================= > > Iozone: Performance Test of File I/O > Version $Revision: 3.392 $ > Compiled for 64 bit mode. > Build: linux-AMD64 > > Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, > Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, > Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, > Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, > Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. > Ben England. > > Run began: Mon Oct 7 05:46:00 2013 > > Network distribution mode enabled. > Hostname = 172.17.70.252 > Setting no_unlink > Include close in write timing > Include fsync in write timing > No retest option selected > Record Size 128 KB > File size set to 67108864 KB > Command line used: /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -C -w -c -e -i 0 -+n -r 128 -s 64g -t 8 > Output is in Kbytes/sec > Time Resolution = 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > Throughput test with 8 processes > Each process writes a 67108864 Kbyte file in 128 Kbyte records > > Test running: > Children see throughput for 8 initial writers = 984939.92 KB/sec > Min throughput per process = 119135.94 KB/sec > Max throughput per process = 128661.16 KB/sec > Avg throughput per process = 123117.49 KB/sec > Min xfer = 59963392.00 KB > Child[0] xfer count = 59963392.00 KB, Throughput = 119135.94 KB/sec > Child[1] xfer count = 64541440.00 KB, Throughput = 128661.16 KB/sec > Child[2] xfer count = 67108864.00 KB, Throughput = 122749.62 KB/sec > Child[3] xfer count = 60181632.00 KB, Throughput = 123384.07 KB/sec > Child[4] xfer count = 60026624.00 KB, Throughput = 119214.86 KB/sec > Child[5] xfer count = 60088448.00 KB, Throughput = 126651.59 KB/sec > Child[6] xfer count = 66225664.00 KB, Throughput = 121826.70 KB/sec > Child[7] xfer count = 60401152.00 KB, Throughput = 123315.98 KB/sec > > > >iozone test complete. > >--- Additional comment from santosh pradhan on 2013-10-09 07:02:11 EDT --- > > >Some more testing results (with fix of BZ 977492 and 1016608) >============================================================ > >1. 1x2 vol, 4 clients (2 clients mounts per server) > >gluster v set perftest nfs.write-size 1048576 > >iozone I/O workload(write): >-s 64g -t 4 - Worked fine. > >2. 1x2 vol, 4 clients (2 clients mounts per server) >gluster v reset nfs.write-size > >iozone I/O workload(write): >-s 16g -t 4 - Worked fine. > >--- Additional comment from Anand Avati on 2013-10-20 23:07:44 EDT --- > >REVIEW: http://review.gluster.org/6114 (rpcsvc: implement per-client RPC throttling) posted (#1) for review on master by Anand Avati (avati@redhat.com) > >--- Additional comment from Anand Avati on 2013-10-21 20:52:55 EDT --- > >REVIEW: http://review.gluster.org/6114 (rpcsvc: implement per-client RPC throttling) posted (#2) for review on master by Anand Avati (avati@redhat.com) > >--- Additional comment from Ben England on 2013-10-24 13:56:01 EDT --- > >good news bad news... > >At least 3 of my iozone write tests now complete successfully where before none did. > >the memory consumption goes away if I disable the write-behind translator, and throughput is if anything slightly better than with writebehind! > >One of the SAs at the RHS training this week said that there was a NFS timeout parameter that eliminated the write problem, I'm trying to track down the information on this. > >See run log and stats at http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/13-10-24-12-41-19/ > >bad news: with 4 servers and 8 clients I got the 7th client to hang, and saw this stack trace in /var/log/messages on that client multiple times today while running with RPMs built by Santosh containing the fix. At the same time, I see rpc.statd and sm-notify daemons restarting on the NFS server gprfs019 where gprfc077 is mounting, and I see some sort of gluster-NFS restart happening at this time also: > >QUESTION: is this a different bz? Am trying to repro with smaller config for easier debug. > >[root@gprfs019 ~]# tail /var/log/messages >... >Oct 24 17:22:26 gprfs019 rpc.statd[15604]: Version 1.2.3 starting >Oct 24 17:22:26 gprfs019 sm-notify[15605]: Version 1.2.3 starting >[root@gprfs019 ~]# tail -f /var/log/glusterfs/etc-glusterfs-glusterd.vol.log >... >[2013-10-24 17:22:26.083783] I [glusterd-utils.c:3847:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully >[2013-10-24 17:22:26.083962] I [glusterd-utils.c:3852:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully > > > Again some of the tests complete (and fast), but then this happens -- I took gprfc076 out of there so it has nothing to do with a single client being in a bad state. > >gprfc077: Oct 24 13:15:56 gprfc077 kernel: INFO: task bash:3969 blocked for more than 120 seconds. >gprfc077: Oct 24 13:15:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >gprfc077: Oct 24 13:15:56 gprfc077 kernel: bash D 000000000000000f 0 3969 3737 0x00000080 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff8807dd861ce8 0000000000000086 0000000000000000 ffff88083199f540 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff88083199f578 0000000000000000 ffff8807dd861ca8 ffffffff81064a00 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff88083199faf8 ffff8807dd861fd8 000000000000fb88 ffff88083199faf8 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: Call Trace: >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff81064a00>] ? pick_next_task_fair+0xd0/0x130 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150daca>] ? thread_return+0x1b8/0x76e >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff810669eb>] ? enqueue_rt_entity+0x6b/0x80 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80 >gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0 > >gprfc077: Oct 24 13:25:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >gprfc077: Oct 24 13:25:56 gprfc077 kernel: sshd D 000000000000000f 0 3973 2129 0x00000080 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff88082fef5ce8 0000000000000082 0000000000000000 0000000000000000 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff88082fc95ab8 ffff88082fef5fd8 000000000000fb88 ffff88082fc95ab8 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: Call Trace: >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffffa0094bf0>] ? ext4_file_open+0x0/0x130 [ext4] >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81277349>] ? cpumask_next_and+0x29/0x50 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: INFO: task sshd:3974 blocked for more than 120 seconds. >gprfc077: Oct 24 13:25:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >gprfc077: Oct 24 13:25:56 gprfc077 kernel: sshd D 000000000000000f 0 3974 2129 0x00000080 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff8806906dfce8 0000000000000082 0000000000000000 0000000000000000 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff880831d3a5f8 ffff8806906dffd8 000000000000fb88 ffff880831d3a5f8 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: Call Trace: >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffffa0094bf0>] ? ext4_file_open+0x0/0x130 [ext4] >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81277349>] ? cpumask_next_and+0x29/0x50 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80 >gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0 > >last thing attempted by the test harness was this: > >13-10-24-13-23-03 : running on hosts ['gprfc089', 'gprfc090', 'gprfc091', 'gprfc092', 'gprfc094', 'gprfc095', 'gprfc077', 'gprfc078'] cmd: sync > >Note that sync would hit the local ext4 filesystem as well as the NFS file system, so perhaps there is some sort of interaction there. > >Here's my volume: > >Volume Name: perftest >Type: Distributed-Replicate >Volume ID: 59fac3fb-9194-40b1-8bf8-ba5fde61ce77 >Status: Created >Number of Bricks: 2 x 2 = 4 >Transport-type: tcp >Bricks: >Brick1: gprfs017-10ge:/mnt/brick0/brick >Brick2: gprfs018-10ge:/mnt/brick0/brick >Brick3: gprfs019-10ge:/mnt/brick0/brick >Brick4: gprfs020-10ge:/mnt/brick0/brick >Options Reconfigured: >performance.nfs.write-behind: off >nfs.drc: 0 >nfs.acl: 0 > >Here's NFS mount command used: > ># mount -v -t nfs -o noacl,nfsvers=3,tcp %s:/perftest /mnt/glnfs > >where %s is replaced with whatever server the client is connecting to, there are 2 clients/server. > > >here's the volfile: > >gprfs018: +------------------------------------------------------------------------------+ >gprfs018: 1: volume perftest-client-0 >gprfs018: 2: type protocol/client >gprfs018: 8: end-volume >gprfs018: 9: >gprfs018: 10: volume perftest-client-1 >gprfs018: 11: type protocol/client >gprfs018: 17: end-volume >gprfs018: 18: >gprfs018: 19: volume perftest-replicate-0 >gprfs018: 20: type cluster/replicate >gprfs018: 21: subvolumes perftest-client-0 perftest-client-1 >gprfs018: 22: end-volume >gprfs018: 23: >gprfs018: 24: volume perftest-client-2 >gprfs018: 25: type protocol/client >gprfs018: 31: end-volume >gprfs018: 32: >gprfs018: 33: volume perftest-client-3 >gprfs018: 34: type protocol/client >gprfs018: 40: end-volume >gprfs018: 41: >gprfs018: 42: volume perftest-replicate-1 >gprfs018: 43: type cluster/replicate >gprfs018: 44: subvolumes perftest-client-2 perftest-client-3 >gprfs018: 45: end-volume >gprfs018: 46: >gprfs018: 47: volume perftest-dht >gprfs018: 48: type cluster/distribute >gprfs018: 49: subvolumes perftest-replicate-0 perftest-replicate-1 >gprfs018: 50: end-volume >gprfs018: 51: >gprfs018: 52: volume perftest >gprfs018: 53: type debug/io-stats >gprfs018: 56: subvolumes perftest-dht >gprfs018: 57: end-volume >gprfs018: 58: >gprfs018: 59: volume nfs-server >gprfs018: 60: type nfs/server >gprfs018: 74: subvolumes perftest >gprfs018: 75: end-volume >gprfs018: 76: > >This is the RPMs we're using, I believe Santosh built them: > >[root@gprfs019 ~]# rpm -q --info glusterfs >Name : glusterfs Relocations: (not relocatable) >Version : 3.4rhs Vendor: Red Hat, Inc. >Release : 1.el6rhs Build Date: Tue 22 Oct 2013 09:08:47 AM UTC >Install Date: Tue 22 Oct 2013 09:20:12 AM UTC Build Host: x86-028.build.eng.bos.redhat.com >Group : System Environment/Base Source RPM: glusterfs-3.4rhs-1.el6rhs.src.rpm >Size : 7316109 License: GPLv2 or LGPLv3+ >Signature : (none) >Packager : gluster-users@gluster.org >URL : http://www.gluster.org/docs/index.php/GlusterFS >Summary : Cluster File System > >--- Additional comment from Ben England on 2013-10-24 20:54:09 EDT --- > >ideas on why this happens? > >[root@gprfs018 ~]# gluster volume set perftest nfs.outstanding-rpc-limit 8 >volume set: failed: Staging failed on 172.17.40.17. Error: Required op_version (3) is not supported > >[root@gprfs018 ~]# gluster volume set perftest server.outstanding-rpc-limit 256 >volume set: failed: Staging failed on 172.17.40.17. Error: Required op_version (3) is not supported > >--- Additional comment from Ben England on 2013-10-24 23:29:43 EDT --- > >Maybe NFS doesn't handle the case where Gluster is backpressuring it. With 1 NFS client/server, it never hung. Manoj was able to make the old implementation hang with 4 clients/server. > >So I used write-behind translator to force NFS to not have as much max outstanding RPC data as Gluster layer under it. Since I can't currently set the throttle parameters (see last bz comment), I used NFS mount option wsize=32768. With writebehind translator, this will result in about 4 NFS RPCS per Gluster RPC, right? So even with 2 NFS clients/server, gluster should not run out of outstanding RPCs this way. It worked for 16-GB files and 64-GB files with 2 threads/client, 2 clients/server (1/2 TB of write data). > >It did NOT work with 4 clients mounting 1 server gprfs018-10ge. Maybe I needed wsize=16384 on client mounts for that. > >Do we need Gluster bricks to be able to successfully backpressure NFS? > >Here's the test that worked with 2 clients/server as shown above. > >[root@ben-test-driver2-10ge gluster_test]# iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 64k -s 64g -C -t 4 > Run began: Thu Oct 24 22:14:26 2013 > Test running: > Children see throughput for 4 initial writers = 446187.80 KB/sec > Min throughput per process = 110588.89 KB/sec > Max throughput per process = 112947.85 KB/sec > Avg throughput per process = 111546.95 KB/sec > Min xfer = 63893888.00 KB > Child[0] xfer count = 67108864.00 KB, Throughput = 112947.85 KB/sec > Child[1] xfer count = 64317568.00 KB, Throughput = 110773.44 KB/sec > Child[2] xfer count = 64369536.00 KB, Throughput = 111877.62 KB/sec > Child[3] xfer count = 63893888.00 KB, Throughput = 110588.89 KB/sec > >gprfs018# gluster v i >Volume Name: perftest >Type: Replicate >Volume ID: f73c6470-54f8-4a6b-b427-1222774a318c >Status: Started >Number of Bricks: 1 x 2 = 2 >Transport-type: tcp >Bricks: >Brick1: gprfs017-10ge:/mnt/brick0/brick >Brick2: gprfs018-10ge:/mnt/brick0/brick >Options Reconfigured: >nfs.mem-factor: 8 >performance.nfs.write-behind: on >nfs.acl: 0 > >--- Additional comment from santosh pradhan on 2013-10-25 02:12:20 EDT --- > >(In reply to Ben England from comment #17) >> ideas on why this happens? >> >> [root@gprfs018 ~]# gluster volume set perftest nfs.outstanding-rpc-limit 8 >> volume set: failed: Staging failed on 172.17.40.17. Error: Required >> op_version (3) is not supported >> >> [root@gprfs018 ~]# gluster volume set perftest server.outstanding-rpc-limit >> 256 >> volume set: failed: Staging failed on 172.17.40.17. Error: Required >> op_version (3) is not supported > >I had sent an email on 22nd Oct, to install the binary from following link: > >https://brewweb.devel.redhat.com/taskinfo?taskID=6450281 > >I dont think the current binary is from the above build link. > >I have downloaded the correct binaries under gprfs0{17,18,19,20}:/root/rpms > >--- Additional comment from santosh pradhan on 2013-10-25 02:31:55 EDT --- > >(In reply to Ben England from comment #16) >> good news bad news... >> >> At least 3 of my iozone write tests now complete successfully where before >> none did. >> >> the memory consumption goes away if I disable the write-behind translator, >> and throughput is if anything slightly better than with writebehind! >> >> One of the SAs at the RHS training this week said that there was a NFS >> timeout parameter that eliminated the write problem, I'm trying to track >> down the information on this. > >[Santosh] When NFS client COMMITs I/O, it expects the reply in 120 seconds i.e. 2 minutes. If it does not get, then it dumps the stacktrace in syslog. I dont know if there is any tunable to increase the timeout value. If the amount of I/O in the COMMIT is bigger, then GlusterNFS throttling can make it delayed for more than 120 secs and subsequently the stacktrace in the client :( > > >> >> See run log and stats at >> http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability- >> cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/13-10-24- >> 12-41-19/ >> >> bad news: with 4 servers and 8 clients I got the 7th client to hang, and saw >> this stack trace in /var/log/messages on that client multiple times today >> while running with RPMs built by Santosh containing the fix. At the same >> time, I see rpc.statd and sm-notify daemons restarting on the NFS server >> gprfs019 where gprfc077 is mounting, and I see some sort of gluster-NFS >> restart happening at this time also: >> >> QUESTION: is this a different bz? Am trying to repro with smaller config >> for easier debug. > >[Santosh]Yes, whenever we change any option in NFS server (basically which changes the NFS volfile and NFS graph), that restarts the NFS server. Gluster NFS restarts the NLM service daemons as well. This is tracked under BZ 929036. > >--- Additional comment from Anand Avati on 2013-10-28 03:33:26 EDT --- > >COMMIT: http://review.gluster.org/6114 committed in master by Vijay Bellur (vbellur@redhat.com) >------ >commit a4056292528db49a666422c7f8e0c032441cc83f >Author: Anand Avati <avati@redhat.com> >Date: Sun Oct 20 08:45:18 2013 -0700 > > rpcsvc: implement per-client RPC throttling > > Implement a limit on the total number of outstanding RPC requests > from a given cient. Once the limit is reached the client socket > is removed from POLL-IN event polling. > > Change-Id: I8071b8c89b78d02e830e6af5a540308199d6bdcd > BUG: 1008301 > Signed-off-by: Anand Avati <avati@redhat.com> > Reviewed-on: http://review.gluster.org/6114 > Reviewed-by: Santosh Pradhan <spradhan@redhat.com> > Reviewed-by: Rajesh Joseph <rjoseph@redhat.com> > Reviewed-by: Harshavardhana <harsha@harshavardhana.net> > Reviewed-by: Vijay Bellur <vbellur@redhat.com> > Tested-by: Gluster Build System <jenkins@build.gluster.com> > >--- Additional comment from Ben England on 2013-10-30 18:07:45 EDT --- > >Next step - get a good protocol trace of what is happening at the NFS server to see why congestion control doesn't always work. > >re cmt 20: it's hard to see why NFS COMMIT would take > 120 sec given RPC throttling change! But yes it makes sense that the stack trace results from NFS commit not getting done. I suspect more that Gluster is not acking the COMMIT request for some reason, a protocol trace might show this. > >Avati and I discussed NFS, we have consensus that we don't understand why existing code isn't working and need a better analysis of what's going wrong (to Manoj's point, why isn't NFS congestion control working?). Is it a bug in NFS client or does kNFS behave differently at a a protocol level so that NFS clients don't flood the server? It's not easy to go into NFS client and analyze a crash dump, or examine with systemtap, etc. the easiest way to see may be to look at a protocol trace if we can capture it around the time of the hang. > >--- Additional comment from Ben England on 2013-11-08 14:55:06 EST --- > > >I'm running the glusterfs-*-3.4rhs-1.el6rhs.x86_64 RPMs (built by Santosh?) that let met set the outstanding-rpc-limit parameters, thanks for that. > >The good news: it does complete writes where before it did not. > >Problem 1) On writes, it still slows to a crawl, but it takes longer to do it. For example, with the command to write 1 TB of data in 32-GB files: > >iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -C -i 0 -+n -r 64k -t 32 -s 32g > >after about 1/4 TB is written, glusterfs is consuming more CPU than glusterfsd and this ratio keeps increasing as it slows to a crawl. > > it's still leaking memory because of NFS writes, virtual address size of glusterfs process is still 13 GB, 1/5 of RAM, even AFTER writes have completed, fsync has been issued and all files have been closed. > >Tasks: 715 total, 2 running, 713 sleeping, 0 stopped, 0 zombie >Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >Mem: 49544696k total, 49211980k used, 332716k free, 88720k buffers >Swap: 24772600k total, 44616k used, 24727984k free, 38663592k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >15213 root 20 0 12.8g 8.3g 2984 S 0.0 17.5 4:29.97 glusterfs > 2902 vdsm 15 -5 1607m 9340 2328 S 0.3 0.0 9:06.48 vdsm >15096 root 20 0 1536m 25m 2780 S 0.0 0.1 8:06.92 glusterfsd > >Swap space is still being used. If I do reads, no virtual memory is consumed. > >Workaround: set an NFS-related volume parameter, it will try to restart of NFS server and free up its memory. > >Another workaround: rhs-virtualization tuned profile will set vm.swappiness to 10, reducing swapping, but eventually it will happen. > > >problem 2) self-heal: this really screws up performance, because now glustershd is reading all of these files as well as writing them. This wrecks sequential I/O performance. > >[root@gprfs020 ~]# gluster volume heal perftest info >Gathering list of entries to be healed on volume perftest has been successful > >Brick gprfs017-10ge:/mnt/brick0/brick >Number of entries: 17 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.29 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.18 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.19 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.31 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.15 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.1 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.13 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.5 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.25 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.14 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.20 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.2 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.12 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.6 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.8 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.28 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.17 > >Brick gprfs018-10ge:/mnt/brick0/brick >Number of entries: 17 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.29 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.18 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.19 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.31 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.15 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.1 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.13 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.5 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.25 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.14 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.20 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.2 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.12 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.6 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.8 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.28 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.17 > >Brick gprfs019-10ge:/mnt/brick0/brick >Number of entries: 14 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.23 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.10 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.11 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.3 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.27 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.7 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.21 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.9 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.22 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.26 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.30 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.16 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.0 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.4 > >Brick gprfs020-10ge:/mnt/brick0/brick >Number of entries: 14 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.23 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.10 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.11 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.3 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.27 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.7 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.21 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.9 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.22 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.26 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.30 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.16 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.0 >/iozone.d/13-10-24-18-22-25/iozone.DUMMY.4 > >--- Additional comment from Ben England on 2013-11-08 18:51:57 EST --- > >If this good result holds up, can we get Corbett changed to make nfs.outstanding-rpc-limit default to 8? > > >So I decided to try setting just this one tunable, and retest with writing 256 GB/server of NFS data with iozone, that's 1 TB total. It made it through 3 write passes with 4 servers, 8 clients, and 1, 2 and 4 threads/client. Self healing is either absent or greatly reduced. 4-server test is more stressful than 2-server test because there is a lot more replication traffic. Basically it's getting at least 150-200 MB/s/server of write throughput, and 500 MB/s/server on reads. While the write performance isn't near hardware speed, I think we know how to improve it by adjusting RPC size and outstanding-rpc-limit parameters. But stability comes first. > >I see this message continually logged to /var/log/glusterfs/nfs.log during writes, plz fix or you will fill up system disk quickly. > >[2013-11-08 23:50:38.155138] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates >[2013-11-08 23:50:38.155587] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates >[2013-11-08 23:50:38.155726] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates > >why limit=8? for 8 clients this is 64 writes in flight. The RHS bricks can accept 64 NFS RPCs at a time from each client for total of 512 in-flight writes of 64 KB each = 32 MB of data. On brick side, with replication I'm dividing by 2, so we have 32 x 2 = 64 RPCs of 128 KB each = 8 MB of data, which is less than NFS can push closer to it. Why is this important? Gluster and NFS are competing for the NIC, and if NFS pushes too hard it can interfere with Gluster getting data replicated. This is unlike kernel NFS server writing to an XFS filesystem, for example. Before I set this limit, some files were getting written when others weren't, since some files have 1 local replica my hypothesis is that Gluster files with a local replica on the NFS server were getting preferential treatment. When I set this limit, it appeared that the files all started growing at more or less the same rate. Need to get more data on this but at least it's a theory. Note that as we increase volume brick count, the probability of files having local replicas on the NFS server drops to zero. > >If this tuning works, I could try increasing server.outstanding-rpc-count and nfs.outstanding-rpc-count by same factor to improve write perf. > >BTW, I also filed a bz 1028631 on glusterfs NFS server being bottlenecked on reads by a single thread. This prevents us from utilizing multiple NICs to overcome this bottleneck (Would be nice to have nfs.write-size parameter that is upstream also). If it was fixed, we could do these things to speed up NFS: > >http://www.gluster.org/community/documentation/index.php/Network_Configuration_Techniques > ># ssh gprfs020 gluster v i > >Volume Name: perftest >Type: Distributed-Replicate >Volume ID: 1b5f085a-ded1-4902-bc4f-781404876d21 >Status: Started >Number of Bricks: 2 x 2 = 4 >Transport-type: tcp >Bricks: >Brick1: gprfs017-10ge:/mnt/brick0/brick >Brick2: gprfs018-10ge:/mnt/brick0/brick >Brick3: gprfs019-10ge:/mnt/brick0/brick >Brick4: gprfs020-10ge:/mnt/brick0/brick >Options Reconfigured: >diagnostics.count-fop-hits: on >diagnostics.latency-measurement: on >nfs.outstanding-rpc-limit: 8 > >immediately after 1-TB write completed: > >[root@ben-test-driver2-10ge gluster_test]# ssh gprfs020 gluster v heal perftest info >Gathering list of entries to be healed on volume perftest has been successful > >Brick gprfs017-10ge:/mnt/brick0/brick >Number of entries: 0 > >Brick gprfs018-10ge:/mnt/brick0/brick >Number of entries: 0 > >Brick gprfs019-10ge:/mnt/brick0/brick >Number of entries: 0 > >Brick gprfs020-10ge:/mnt/brick0/brick >Number of entries: 0 > >--- Additional comment from Ben England on 2013-11-09 09:13:37 EST --- > >All tests finished with config described in .-1, results are similar to first pass, 4 passes reading/writing 1 TB of data completed with 1, 2 and 4 threads/client. logs are in > >http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/ > >13-11-08-15-32-40/ >glnfs-throttle-patch-nfsrpc8.csv > >--- Additional comment from Ben England on 2013-12-03 18:30:18 EST --- > >I changed the status back to assigned from modified, because the memory leak problem is still not fixed and is a show-stopper. > >if I go past 1 TB = 256 GB/server, I'm still seeing a memory leak, am going to try to document this better, but bottom line is even with flow control fix above, the memory leak eventually causes glusterfs to swap and hang. It's using 15 GB of RAM out of 48 before I start running into problems. This is not acceptable -- I could see it using as much as 1-2 GB (although even this is unnecessary). The result: clients hang and you can't log into them. Example from "top" output: > >mem: 49544696k total, 48732460k used, 812236k free, 14476k buffers >Swap: 24772600k total, 0k used, 24772600k free, 42414060k cached >23861 root 20 0 15.3g 4.2g 2960 S 0.0 8.9 6:47.72 glusterfs > >Even if I just create and delete sets of 4-GB files, the memory leak occurs. There is NO POSSIBLE REASON for glusterfs to hang onto the memory if the files are deleted, but it does. workload to reproduce is: > >while [ 1 ] ; do rm -fv /mnt/glnfs/iozone.d/13-12-03-15-04-05/iozone.DUMMY.* ; sleep 2 ; iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 4g -t 16 ; done > >Worse yet, if I do a hard DRAC reset on all clients, when they come back up glusterfs NFS server does not give back the memory that it used. I have to stop and start the volume. Even then, the clients do not recover and I have to reset them with the DRAC. > >However, if I use direct I/O like this: > >iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16384k -I -s 1g -t 16 > >I get terrible throughput but no significant memory leak that I can detect. > >Sorry but you have to run valgrind and figure out where the memory is going if you want NFS translator to become stable. > >I'll attach data showing how the memory leak progresses, it's very steady and reproducible. Try it, you can borrow the BAGL NFS config to do this if you need to. > >--- Additional comment from santosh pradhan on 2013-12-06 06:13:35 EST --- > > >Could you just try the same test case with making DRC off? (gluster v set <volname> nfs.drc off. Just to verify if the DRC is contributing to memory footprint of the NFS server. > >Because we are throttling the in-flight RPC count, so that should n't cause the memory leak. > >Again which build you use for your testing? > >Thanks, >Santosh > >--- Additional comment from Ben England on 2013-12-08 12:05:06 EST --- > >Santosh, I'll do as you suggest. build was glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64 > >--- Additional comment from Ben Turner on 2013-12-11 20:15:11 EST --- > >I am pretty sure I am hitting this in fs sanity as I recently switched iozone to run in distributed. Iozone slows down and in the logs I see: > >[2013-12-11 23:51:38.934671] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates >[2013-12-11 23:51:38.988245] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates >[2013-12-11 23:51:39.041640] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates >[2013-12-11 23:51:39.097689] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates >[2013-12-11 23:51:39.151618] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates > >I will attempt to repro with: > >gluster v set <volname> nfs.drc off > >--- Additional comment from Ben England on 2013-12-13 11:35:08 EST --- > >with nfs.drc off and nfs.outstanding-rpc-limit 8, the leak appears to go away! I'll continue to run > >iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 4g -t 64 > >over and over again, but glusterfs process is tiny. However, when I turn nfs.drc on, glusterfs process immediately blows up. The memory leak definitely depends on the value of this parameter. > >So I guess nfs.drc on is the culprit! > >--- Additional comment from Anand Avati on 2014-01-14 00:56:50 EST --- > >REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#1) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2014-01-14 12:00:27 EST --- > >REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#2) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2014-01-14 12:30:18 EST --- > >REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#3) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2014-01-14 12:52:13 EST --- > >REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#4) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2014-01-15 03:48:24 EST --- > >REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#5) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2014-01-15 05:08:19 EST --- > >REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#6) for review on master by Santosh Pradhan (spradhan@redhat.com) > >--- Additional comment from Anand Avati on 2014-01-15 16:47:40 EST --- > >COMMIT: http://review.gluster.org/6696 committed in master by Anand Avati (avati@redhat.com) >------ >commit 260e817b8a652ddb705808098da1e07e3660f4c7 >Author: Santosh Kumar Pradhan <spradhan@redhat.com> >Date: Tue Jan 14 11:21:44 2014 +0530 > > gNFS: Set default outstanding RPC limit to 16 > > With 64, NFS server hangs with large I/O load (~ 64 threads writing > to NFS server). The test results from Ben England (Performance expert) > suggest to set it as 16 instead of 64. > > Change-Id: I418ff5ba0a3e9fdb14f395b8736438ee1bbd95f4 > BUG: 1008301 > Signed-off-by: Santosh Kumar Pradhan <spradhan@redhat.com> > Reviewed-on: http://review.gluster.org/6696 > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: ben england <bengland@redhat.com>
You cannot view the attachment while viewing its details because your browser does not support IFRAMEs.
View the attachment on a separate page
.
View Attachment As Raw
Actions:
View
Attachments on
bug 1073441
: 915863