Bug 1008301
Summary: | large NFS writes to Gluster slow down then stop | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | santosh pradhan <spradhan> | ||||
Component: | nfs | Assignee: | santosh pradhan <spradhan> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | mainline | CC: | ansubram, bengland, bturner, gluster-bugs, jclift, mpillai, nsathyan, perfbz, pneedle, psriniva, rhs-bugs, saujain, shaines, surs, vagarwal, vbellur | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.5.0 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 977492 | ||||||
: | 1073441 1073442 (view as bug list) | Environment: | |||||
Last Closed: | 2014-04-17 11:48:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 977492 | ||||||
Bug Blocks: | 1073441, 1073442 | ||||||
Attachments: |
|
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#2) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#3) for review on master by Santosh Pradhan (spradhan) I'll try to apply the patch to RHS 2.1 source RPM to minimize changes and see how that goes. At Sayan's request here is the tuning I used to work around this problem on the NFS clients. In /etc/sysctl.conf: vm.dirty_background_bytes=32768000 vm.dirty_bytes=65536000 This replaces vm.dirty_ratio and vm.dirty_background_ratio settings that control how much write data can be buffered by the NFS client. It will significantly reduce throughput by as much as 50% but Gluster can sustain throughput this way. I'm not saying this fixes the problem, only that it is a workaround. One problem with this workaround is that it affects ALL filesystems not just NFS. Need a build of this patch applied to RHS 2.1 so perf team can test. REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#4) for review on master by Santosh Pradhan (spradhan) Downstream build with minimal code changes available @ https://brewweb.devel.redhat.com/taskinfo?taskID=6366874 It does not need any setting in the volume. REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#5) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#6) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#7) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#8) for review on master by Santosh Pradhan (spradhan) I tested with minimal changes (from patch posted for review) with downstream, iozone load is passing: Volume info: =========== Volume Name: perftest Type: Distributed-Replicate Volume ID: c052bfd8-fd21-4140-8059-f3d2c5993b76 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gprfs017-10ge:/mnt/brick0/brick Brick2: gprfs018-10ge:/mnt/brick0/brick Brick3: gprfs019-10ge:/mnt/brick0/brick Brick4: gprfs020-10ge:/mnt/brick0/brick How the volume is mounted with 8 clients: ========================================= All mount points on gprfs017.sbu.lab.eng.bos.redhat.com: 172.17.10.89:/perftest 172.17.10.94:/perftest 172.17.70.252:/perftest All mount points on gprfs018.sbu.lab.eng.bos.redhat.com: 172.17.10.90:/perftest 172.17.10.95:/perftest All mount points on gprfs019.sbu.lab.eng.bos.redhat.com: 172.17.10.91:/perftest 172.17.10.96:/perftest All mount points on gprfs020.sbu.lab.eng.bos.redhat.com: 172.17.10.77:/perftest 172.17.10.92:/perftest IOZONE workload: ================= Iozone: Performance Test of File I/O Version $Revision: 3.392 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Ben England. Run began: Mon Oct 7 05:46:00 2013 Network distribution mode enabled. Hostname = 172.17.70.252 Setting no_unlink Include close in write timing Include fsync in write timing No retest option selected Record Size 128 KB File size set to 67108864 KB Command line used: /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -C -w -c -e -i 0 -+n -r 128 -s 64g -t 8 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 8 processes Each process writes a 67108864 Kbyte file in 128 Kbyte records Test running: Children see throughput for 8 initial writers = 984939.92 KB/sec Min throughput per process = 119135.94 KB/sec Max throughput per process = 128661.16 KB/sec Avg throughput per process = 123117.49 KB/sec Min xfer = 59963392.00 KB Child[0] xfer count = 59963392.00 KB, Throughput = 119135.94 KB/sec Child[1] xfer count = 64541440.00 KB, Throughput = 128661.16 KB/sec Child[2] xfer count = 67108864.00 KB, Throughput = 122749.62 KB/sec Child[3] xfer count = 60181632.00 KB, Throughput = 123384.07 KB/sec Child[4] xfer count = 60026624.00 KB, Throughput = 119214.86 KB/sec Child[5] xfer count = 60088448.00 KB, Throughput = 126651.59 KB/sec Child[6] xfer count = 66225664.00 KB, Throughput = 121826.70 KB/sec Child[7] xfer count = 60401152.00 KB, Throughput = 123315.98 KB/sec iozone test complete. Some more testing results (with fix of BZ 977492 and 1016608) ============================================================ 1. 1x2 vol, 4 clients (2 clients mounts per server) gluster v set perftest nfs.write-size 1048576 iozone I/O workload(write): -s 64g -t 4 - Worked fine. 2. 1x2 vol, 4 clients (2 clients mounts per server) gluster v reset nfs.write-size iozone I/O workload(write): -s 16g -t 4 - Worked fine. REVIEW: http://review.gluster.org/6114 (rpcsvc: implement per-client RPC throttling) posted (#1) for review on master by Anand Avati (avati) REVIEW: http://review.gluster.org/6114 (rpcsvc: implement per-client RPC throttling) posted (#2) for review on master by Anand Avati (avati) good news bad news... At least 3 of my iozone write tests now complete successfully where before none did. the memory consumption goes away if I disable the write-behind translator, and throughput is if anything slightly better than with writebehind! One of the SAs at the RHS training this week said that there was a NFS timeout parameter that eliminated the write problem, I'm trying to track down the information on this. See run log and stats at http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/13-10-24-12-41-19/ bad news: with 4 servers and 8 clients I got the 7th client to hang, and saw this stack trace in /var/log/messages on that client multiple times today while running with RPMs built by Santosh containing the fix. At the same time, I see rpc.statd and sm-notify daemons restarting on the NFS server gprfs019 where gprfc077 is mounting, and I see some sort of gluster-NFS restart happening at this time also: QUESTION: is this a different bz? Am trying to repro with smaller config for easier debug. [root@gprfs019 ~]# tail /var/log/messages ... Oct 24 17:22:26 gprfs019 rpc.statd[15604]: Version 1.2.3 starting Oct 24 17:22:26 gprfs019 sm-notify[15605]: Version 1.2.3 starting [root@gprfs019 ~]# tail -f /var/log/glusterfs/etc-glusterfs-glusterd.vol.log ... [2013-10-24 17:22:26.083783] I [glusterd-utils.c:3847:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully [2013-10-24 17:22:26.083962] I [glusterd-utils.c:3852:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully Again some of the tests complete (and fast), but then this happens -- I took gprfc076 out of there so it has nothing to do with a single client being in a bad state. gprfc077: Oct 24 13:15:56 gprfc077 kernel: INFO: task bash:3969 blocked for more than 120 seconds. gprfc077: Oct 24 13:15:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gprfc077: Oct 24 13:15:56 gprfc077 kernel: bash D 000000000000000f 0 3969 3737 0x00000080 gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff8807dd861ce8 0000000000000086 0000000000000000 ffff88083199f540 gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff88083199f578 0000000000000000 ffff8807dd861ca8 ffffffff81064a00 gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff88083199faf8 ffff8807dd861fd8 000000000000fb88 ffff88083199faf8 gprfc077: Oct 24 13:15:56 gprfc077 kernel: Call Trace: gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff81064a00>] ? pick_next_task_fair+0xd0/0x130 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150daca>] ? thread_return+0x1b8/0x76e gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff810669eb>] ? enqueue_rt_entity+0x6b/0x80 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80 gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gprfc077: Oct 24 13:25:56 gprfc077 kernel: sshd D 000000000000000f 0 3973 2129 0x00000080 gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff88082fef5ce8 0000000000000082 0000000000000000 0000000000000000 gprfc077: Oct 24 13:25:56 gprfc077 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff88082fc95ab8 ffff88082fef5fd8 000000000000fb88 ffff88082fc95ab8 gprfc077: Oct 24 13:25:56 gprfc077 kernel: Call Trace: gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffffa0094bf0>] ? ext4_file_open+0x0/0x130 [ext4] gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81277349>] ? cpumask_next_and+0x29/0x50 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: INFO: task sshd:3974 blocked for more than 120 seconds. gprfc077: Oct 24 13:25:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gprfc077: Oct 24 13:25:56 gprfc077 kernel: sshd D 000000000000000f 0 3974 2129 0x00000080 gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff8806906dfce8 0000000000000082 0000000000000000 0000000000000000 gprfc077: Oct 24 13:25:56 gprfc077 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff880831d3a5f8 ffff8806906dffd8 000000000000fb88 ffff880831d3a5f8 gprfc077: Oct 24 13:25:56 gprfc077 kernel: Call Trace: gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffffa0094bf0>] ? ext4_file_open+0x0/0x130 [ext4] gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81277349>] ? cpumask_next_and+0x29/0x50 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80 gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0 last thing attempted by the test harness was this: 13-10-24-13-23-03 : running on hosts ['gprfc089', 'gprfc090', 'gprfc091', 'gprfc092', 'gprfc094', 'gprfc095', 'gprfc077', 'gprfc078'] cmd: sync Note that sync would hit the local ext4 filesystem as well as the NFS file system, so perhaps there is some sort of interaction there. Here's my volume: Volume Name: perftest Type: Distributed-Replicate Volume ID: 59fac3fb-9194-40b1-8bf8-ba5fde61ce77 Status: Created Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gprfs017-10ge:/mnt/brick0/brick Brick2: gprfs018-10ge:/mnt/brick0/brick Brick3: gprfs019-10ge:/mnt/brick0/brick Brick4: gprfs020-10ge:/mnt/brick0/brick Options Reconfigured: performance.nfs.write-behind: off nfs.drc: 0 nfs.acl: 0 Here's NFS mount command used: # mount -v -t nfs -o noacl,nfsvers=3,tcp %s:/perftest /mnt/glnfs where %s is replaced with whatever server the client is connecting to, there are 2 clients/server. here's the volfile: gprfs018: +------------------------------------------------------------------------------+ gprfs018: 1: volume perftest-client-0 gprfs018: 2: type protocol/client gprfs018: 8: end-volume gprfs018: 9: gprfs018: 10: volume perftest-client-1 gprfs018: 11: type protocol/client gprfs018: 17: end-volume gprfs018: 18: gprfs018: 19: volume perftest-replicate-0 gprfs018: 20: type cluster/replicate gprfs018: 21: subvolumes perftest-client-0 perftest-client-1 gprfs018: 22: end-volume gprfs018: 23: gprfs018: 24: volume perftest-client-2 gprfs018: 25: type protocol/client gprfs018: 31: end-volume gprfs018: 32: gprfs018: 33: volume perftest-client-3 gprfs018: 34: type protocol/client gprfs018: 40: end-volume gprfs018: 41: gprfs018: 42: volume perftest-replicate-1 gprfs018: 43: type cluster/replicate gprfs018: 44: subvolumes perftest-client-2 perftest-client-3 gprfs018: 45: end-volume gprfs018: 46: gprfs018: 47: volume perftest-dht gprfs018: 48: type cluster/distribute gprfs018: 49: subvolumes perftest-replicate-0 perftest-replicate-1 gprfs018: 50: end-volume gprfs018: 51: gprfs018: 52: volume perftest gprfs018: 53: type debug/io-stats gprfs018: 56: subvolumes perftest-dht gprfs018: 57: end-volume gprfs018: 58: gprfs018: 59: volume nfs-server gprfs018: 60: type nfs/server gprfs018: 74: subvolumes perftest gprfs018: 75: end-volume gprfs018: 76: This is the RPMs we're using, I believe Santosh built them: [root@gprfs019 ~]# rpm -q --info glusterfs Name : glusterfs Relocations: (not relocatable) Version : 3.4rhs Vendor: Red Hat, Inc. Release : 1.el6rhs Build Date: Tue 22 Oct 2013 09:08:47 AM UTC Install Date: Tue 22 Oct 2013 09:20:12 AM UTC Build Host: x86-028.build.eng.bos.redhat.com Group : System Environment/Base Source RPM: glusterfs-3.4rhs-1.el6rhs.src.rpm Size : 7316109 License: GPLv2 or LGPLv3+ Signature : (none) Packager : gluster-users URL : http://www.gluster.org/docs/index.php/GlusterFS Summary : Cluster File System ideas on why this happens? [root@gprfs018 ~]# gluster volume set perftest nfs.outstanding-rpc-limit 8 volume set: failed: Staging failed on 172.17.40.17. Error: Required op_version (3) is not supported [root@gprfs018 ~]# gluster volume set perftest server.outstanding-rpc-limit 256 volume set: failed: Staging failed on 172.17.40.17. Error: Required op_version (3) is not supported Maybe NFS doesn't handle the case where Gluster is backpressuring it. With 1 NFS client/server, it never hung. Manoj was able to make the old implementation hang with 4 clients/server. So I used write-behind translator to force NFS to not have as much max outstanding RPC data as Gluster layer under it. Since I can't currently set the throttle parameters (see last bz comment), I used NFS mount option wsize=32768. With writebehind translator, this will result in about 4 NFS RPCS per Gluster RPC, right? So even with 2 NFS clients/server, gluster should not run out of outstanding RPCs this way. It worked for 16-GB files and 64-GB files with 2 threads/client, 2 clients/server (1/2 TB of write data). It did NOT work with 4 clients mounting 1 server gprfs018-10ge. Maybe I needed wsize=16384 on client mounts for that. Do we need Gluster bricks to be able to successfully backpressure NFS? Here's the test that worked with 2 clients/server as shown above. [root@ben-test-driver2-10ge gluster_test]# iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 64k -s 64g -C -t 4 Run began: Thu Oct 24 22:14:26 2013 Test running: Children see throughput for 4 initial writers = 446187.80 KB/sec Min throughput per process = 110588.89 KB/sec Max throughput per process = 112947.85 KB/sec Avg throughput per process = 111546.95 KB/sec Min xfer = 63893888.00 KB Child[0] xfer count = 67108864.00 KB, Throughput = 112947.85 KB/sec Child[1] xfer count = 64317568.00 KB, Throughput = 110773.44 KB/sec Child[2] xfer count = 64369536.00 KB, Throughput = 111877.62 KB/sec Child[3] xfer count = 63893888.00 KB, Throughput = 110588.89 KB/sec gprfs018# gluster v i Volume Name: perftest Type: Replicate Volume ID: f73c6470-54f8-4a6b-b427-1222774a318c Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gprfs017-10ge:/mnt/brick0/brick Brick2: gprfs018-10ge:/mnt/brick0/brick Options Reconfigured: nfs.mem-factor: 8 performance.nfs.write-behind: on nfs.acl: 0 (In reply to Ben England from comment #17) > ideas on why this happens? > > [root@gprfs018 ~]# gluster volume set perftest nfs.outstanding-rpc-limit 8 > volume set: failed: Staging failed on 172.17.40.17. Error: Required > op_version (3) is not supported > > [root@gprfs018 ~]# gluster volume set perftest server.outstanding-rpc-limit > 256 > volume set: failed: Staging failed on 172.17.40.17. Error: Required > op_version (3) is not supported I had sent an email on 22nd Oct, to install the binary from following link: https://brewweb.devel.redhat.com/taskinfo?taskID=6450281 I dont think the current binary is from the above build link. I have downloaded the correct binaries under gprfs0{17,18,19,20}:/root/rpms (In reply to Ben England from comment #16) > good news bad news... > > At least 3 of my iozone write tests now complete successfully where before > none did. > > the memory consumption goes away if I disable the write-behind translator, > and throughput is if anything slightly better than with writebehind! > > One of the SAs at the RHS training this week said that there was a NFS > timeout parameter that eliminated the write problem, I'm trying to track > down the information on this. [Santosh] When NFS client COMMITs I/O, it expects the reply in 120 seconds i.e. 2 minutes. If it does not get, then it dumps the stacktrace in syslog. I dont know if there is any tunable to increase the timeout value. If the amount of I/O in the COMMIT is bigger, then GlusterNFS throttling can make it delayed for more than 120 secs and subsequently the stacktrace in the client :( > > See run log and stats at > http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability- > cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/13-10-24- > 12-41-19/ > > bad news: with 4 servers and 8 clients I got the 7th client to hang, and saw > this stack trace in /var/log/messages on that client multiple times today > while running with RPMs built by Santosh containing the fix. At the same > time, I see rpc.statd and sm-notify daemons restarting on the NFS server > gprfs019 where gprfc077 is mounting, and I see some sort of gluster-NFS > restart happening at this time also: > > QUESTION: is this a different bz? Am trying to repro with smaller config > for easier debug. [Santosh]Yes, whenever we change any option in NFS server (basically which changes the NFS volfile and NFS graph), that restarts the NFS server. Gluster NFS restarts the NLM service daemons as well. This is tracked under BZ 929036. COMMIT: http://review.gluster.org/6114 committed in master by Vijay Bellur (vbellur) ------ commit a4056292528db49a666422c7f8e0c032441cc83f Author: Anand Avati <avati> Date: Sun Oct 20 08:45:18 2013 -0700 rpcsvc: implement per-client RPC throttling Implement a limit on the total number of outstanding RPC requests from a given cient. Once the limit is reached the client socket is removed from POLL-IN event polling. Change-Id: I8071b8c89b78d02e830e6af5a540308199d6bdcd BUG: 1008301 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/6114 Reviewed-by: Santosh Pradhan <spradhan> Reviewed-by: Rajesh Joseph <rjoseph> Reviewed-by: Harshavardhana <harsha> Reviewed-by: Vijay Bellur <vbellur> Tested-by: Gluster Build System <jenkins.com> Next step - get a good protocol trace of what is happening at the NFS server to see why congestion control doesn't always work. re cmt 20: it's hard to see why NFS COMMIT would take > 120 sec given RPC throttling change! But yes it makes sense that the stack trace results from NFS commit not getting done. I suspect more that Gluster is not acking the COMMIT request for some reason, a protocol trace might show this. Avati and I discussed NFS, we have consensus that we don't understand why existing code isn't working and need a better analysis of what's going wrong (to Manoj's point, why isn't NFS congestion control working?). Is it a bug in NFS client or does kNFS behave differently at a a protocol level so that NFS clients don't flood the server? It's not easy to go into NFS client and analyze a crash dump, or examine with systemtap, etc. the easiest way to see may be to look at a protocol trace if we can capture it around the time of the hang. I'm running the glusterfs-*-3.4rhs-1.el6rhs.x86_64 RPMs (built by Santosh?) that let met set the outstanding-rpc-limit parameters, thanks for that. The good news: it does complete writes where before it did not. Problem 1) On writes, it still slows to a crawl, but it takes longer to do it. For example, with the command to write 1 TB of data in 32-GB files: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -C -i 0 -+n -r 64k -t 32 -s 32g after about 1/4 TB is written, glusterfs is consuming more CPU than glusterfsd and this ratio keeps increasing as it slows to a crawl. it's still leaking memory because of NFS writes, virtual address size of glusterfs process is still 13 GB, 1/5 of RAM, even AFTER writes have completed, fsync has been issued and all files have been closed. Tasks: 715 total, 2 running, 713 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49544696k total, 49211980k used, 332716k free, 88720k buffers Swap: 24772600k total, 44616k used, 24727984k free, 38663592k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15213 root 20 0 12.8g 8.3g 2984 S 0.0 17.5 4:29.97 glusterfs 2902 vdsm 15 -5 1607m 9340 2328 S 0.3 0.0 9:06.48 vdsm 15096 root 20 0 1536m 25m 2780 S 0.0 0.1 8:06.92 glusterfsd Swap space is still being used. If I do reads, no virtual memory is consumed. Workaround: set an NFS-related volume parameter, it will try to restart of NFS server and free up its memory. Another workaround: rhs-virtualization tuned profile will set vm.swappiness to 10, reducing swapping, but eventually it will happen. problem 2) self-heal: this really screws up performance, because now glustershd is reading all of these files as well as writing them. This wrecks sequential I/O performance. [root@gprfs020 ~]# gluster volume heal perftest info Gathering list of entries to be healed on volume perftest has been successful Brick gprfs017-10ge:/mnt/brick0/brick Number of entries: 17 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.29 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.18 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.19 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.31 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.15 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.1 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.13 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.5 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.25 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.14 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.20 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.2 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.12 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.6 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.8 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.28 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.17 Brick gprfs018-10ge:/mnt/brick0/brick Number of entries: 17 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.29 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.18 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.19 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.31 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.15 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.1 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.13 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.5 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.25 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.14 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.20 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.2 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.12 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.6 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.8 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.28 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.17 Brick gprfs019-10ge:/mnt/brick0/brick Number of entries: 14 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.23 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.10 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.11 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.3 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.27 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.7 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.21 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.9 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.22 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.26 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.30 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.16 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.0 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.4 Brick gprfs020-10ge:/mnt/brick0/brick Number of entries: 14 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.23 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.10 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.11 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.3 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.27 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.7 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.21 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.9 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.22 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.26 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.30 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.16 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.0 /iozone.d/13-10-24-18-22-25/iozone.DUMMY.4 If this good result holds up, can we get Corbett changed to make nfs.outstanding-rpc-limit default to 8? So I decided to try setting just this one tunable, and retest with writing 256 GB/server of NFS data with iozone, that's 1 TB total. It made it through 3 write passes with 4 servers, 8 clients, and 1, 2 and 4 threads/client. Self healing is either absent or greatly reduced. 4-server test is more stressful than 2-server test because there is a lot more replication traffic. Basically it's getting at least 150-200 MB/s/server of write throughput, and 500 MB/s/server on reads. While the write performance isn't near hardware speed, I think we know how to improve it by adjusting RPC size and outstanding-rpc-limit parameters. But stability comes first. I see this message continually logged to /var/log/glusterfs/nfs.log during writes, plz fix or you will fill up system disk quickly. [2013-11-08 23:50:38.155138] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates [2013-11-08 23:50:38.155587] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates [2013-11-08 23:50:38.155726] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates why limit=8? for 8 clients this is 64 writes in flight. The RHS bricks can accept 64 NFS RPCs at a time from each client for total of 512 in-flight writes of 64 KB each = 32 MB of data. On brick side, with replication I'm dividing by 2, so we have 32 x 2 = 64 RPCs of 128 KB each = 8 MB of data, which is less than NFS can push closer to it. Why is this important? Gluster and NFS are competing for the NIC, and if NFS pushes too hard it can interfere with Gluster getting data replicated. This is unlike kernel NFS server writing to an XFS filesystem, for example. Before I set this limit, some files were getting written when others weren't, since some files have 1 local replica my hypothesis is that Gluster files with a local replica on the NFS server were getting preferential treatment. When I set this limit, it appeared that the files all started growing at more or less the same rate. Need to get more data on this but at least it's a theory. Note that as we increase volume brick count, the probability of files having local replicas on the NFS server drops to zero. If this tuning works, I could try increasing server.outstanding-rpc-count and nfs.outstanding-rpc-count by same factor to improve write perf. BTW, I also filed a bz 1028631 on glusterfs NFS server being bottlenecked on reads by a single thread. This prevents us from utilizing multiple NICs to overcome this bottleneck (Would be nice to have nfs.write-size parameter that is upstream also). If it was fixed, we could do these things to speed up NFS: http://www.gluster.org/community/documentation/index.php/Network_Configuration_Techniques # ssh gprfs020 gluster v i Volume Name: perftest Type: Distributed-Replicate Volume ID: 1b5f085a-ded1-4902-bc4f-781404876d21 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gprfs017-10ge:/mnt/brick0/brick Brick2: gprfs018-10ge:/mnt/brick0/brick Brick3: gprfs019-10ge:/mnt/brick0/brick Brick4: gprfs020-10ge:/mnt/brick0/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on nfs.outstanding-rpc-limit: 8 immediately after 1-TB write completed: [root@ben-test-driver2-10ge gluster_test]# ssh gprfs020 gluster v heal perftest info Gathering list of entries to be healed on volume perftest has been successful Brick gprfs017-10ge:/mnt/brick0/brick Number of entries: 0 Brick gprfs018-10ge:/mnt/brick0/brick Number of entries: 0 Brick gprfs019-10ge:/mnt/brick0/brick Number of entries: 0 Brick gprfs020-10ge:/mnt/brick0/brick Number of entries: 0 All tests finished with config described in .-1, results are similar to first pass, 4 passes reading/writing 1 TB of data completed with 1, 2 and 4 threads/client. logs are in http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/ 13-11-08-15-32-40/ glnfs-throttle-patch-nfsrpc8.csv I changed the status back to assigned from modified, because the memory leak problem is still not fixed and is a show-stopper. if I go past 1 TB = 256 GB/server, I'm still seeing a memory leak, am going to try to document this better, but bottom line is even with flow control fix above, the memory leak eventually causes glusterfs to swap and hang. It's using 15 GB of RAM out of 48 before I start running into problems. This is not acceptable -- I could see it using as much as 1-2 GB (although even this is unnecessary). The result: clients hang and you can't log into them. Example from "top" output: mem: 49544696k total, 48732460k used, 812236k free, 14476k buffers Swap: 24772600k total, 0k used, 24772600k free, 42414060k cached 23861 root 20 0 15.3g 4.2g 2960 S 0.0 8.9 6:47.72 glusterfs Even if I just create and delete sets of 4-GB files, the memory leak occurs. There is NO POSSIBLE REASON for glusterfs to hang onto the memory if the files are deleted, but it does. workload to reproduce is: while [ 1 ] ; do rm -fv /mnt/glnfs/iozone.d/13-12-03-15-04-05/iozone.DUMMY.* ; sleep 2 ; iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 4g -t 16 ; done Worse yet, if I do a hard DRAC reset on all clients, when they come back up glusterfs NFS server does not give back the memory that it used. I have to stop and start the volume. Even then, the clients do not recover and I have to reset them with the DRAC. However, if I use direct I/O like this: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16384k -I -s 1g -t 16 I get terrible throughput but no significant memory leak that I can detect. Sorry but you have to run valgrind and figure out where the memory is going if you want NFS translator to become stable. I'll attach data showing how the memory leak progresses, it's very steady and reproducible. Try it, you can borrow the BAGL NFS config to do this if you need to. Could you just try the same test case with making DRC off? (gluster v set <volname> nfs.drc off. Just to verify if the DRC is contributing to memory footprint of the NFS server. Because we are throttling the in-flight RPC count, so that should n't cause the memory leak. Again which build you use for your testing? Thanks, Santosh Santosh, I'll do as you suggest. build was glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64 I am pretty sure I am hitting this in fs sanity as I recently switched iozone to run in distributed. Iozone slows down and in the logs I see: [2013-12-11 23:51:38.934671] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates [2013-12-11 23:51:38.988245] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates [2013-12-11 23:51:39.041640] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates [2013-12-11 23:51:39.097689] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates [2013-12-11 23:51:39.151618] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates I will attempt to repro with: gluster v set <volname> nfs.drc off with nfs.drc off and nfs.outstanding-rpc-limit 8, the leak appears to go away! I'll continue to run iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 4g -t 64 over and over again, but glusterfs process is tiny. However, when I turn nfs.drc on, glusterfs process immediately blows up. The memory leak definitely depends on the value of this parameter. So I guess nfs.drc on is the culprit! REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#1) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#2) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#3) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#4) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#5) for review on master by Santosh Pradhan (spradhan) REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#6) for review on master by Santosh Pradhan (spradhan) COMMIT: http://review.gluster.org/6696 committed in master by Anand Avati (avati) ------ commit 260e817b8a652ddb705808098da1e07e3660f4c7 Author: Santosh Kumar Pradhan <spradhan> Date: Tue Jan 14 11:21:44 2014 +0530 gNFS: Set default outstanding RPC limit to 16 With 64, NFS server hangs with large I/O load (~ 64 threads writing to NFS server). The test results from Ben England (Performance expert) suggest to set it as 16 instead of 64. Change-Id: I418ff5ba0a3e9fdb14f395b8736438ee1bbd95f4 BUG: 1008301 Signed-off-by: Santosh Kumar Pradhan <spradhan> Reviewed-on: http://review.gluster.org/6696 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: ben england <bengland> Are we able to get this into 3.5.0? (or 3.5.1 if not 3.5.0) The fix is already there in release-3.5 branch. Thanks Santosh, somehow missed that. ;) This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report. glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |
Created attachment 915767 [details] Comment (This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).