Bug 1008301

Summary: large NFS writes to Gluster slow down then stop
Product: [Community] GlusterFS Reporter: santosh pradhan <spradhan>
Component: nfsAssignee: santosh pradhan <spradhan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: ansubram, bengland, bturner, gluster-bugs, jclift, mpillai, nsathyan, perfbz, pneedle, psriniva, rhs-bugs, saujain, shaines, surs, vagarwal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.5.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 977492
: 1073441 1073442 (view as bug list) Environment:
Last Closed: 2014-04-17 11:48:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 977492    
Bug Blocks: 1073441, 1073442    
Attachments:
Description Flags
Comment none

Description santosh pradhan 2013-09-16 06:09:58 UTC
Created attachment 915767 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 1 Anand Avati 2013-09-16 07:16:44 UTC
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#2) for review on master by Santosh Pradhan (spradhan)

Comment 2 Anand Avati 2013-09-16 10:24:36 UTC
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#3) for review on master by Santosh Pradhan (spradhan)

Comment 3 Ben England 2013-09-16 14:57:25 UTC
 I'll try to apply the patch to RHS 2.1 source RPM to minimize changes and see how that goes.

Comment 4 Ben England 2013-09-19 16:21:04 UTC
At Sayan's request here is the tuning I used to work around this problem on the NFS clients.  In /etc/sysctl.conf:

vm.dirty_background_bytes=32768000
vm.dirty_bytes=65536000

This replaces vm.dirty_ratio and vm.dirty_background_ratio settings that control how much write data can be buffered by the NFS client.  It will significantly reduce throughput by as much as 50% but Gluster can sustain throughput this way.  I'm not saying this fixes the problem, only that it is a workaround.  

One problem with this workaround is that it affects ALL filesystems not just NFS.

Comment 5 Ben England 2013-10-02 18:33:11 UTC
Need a build of this patch applied to RHS 2.1 so perf team can test.

Comment 6 Anand Avati 2013-10-03 08:58:49 UTC
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#4) for review on master by Santosh Pradhan (spradhan)

Comment 7 santosh pradhan 2013-10-04 08:50:04 UTC
Downstream build with minimal code changes available @

https://brewweb.devel.redhat.com/taskinfo?taskID=6366874

It does not need any setting in the volume.

Comment 8 Anand Avati 2013-10-04 09:04:33 UTC
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#5) for review on master by Santosh Pradhan (spradhan)

Comment 9 Anand Avati 2013-10-07 12:28:20 UTC
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#6) for review on master by Santosh Pradhan (spradhan)

Comment 10 Anand Avati 2013-10-07 12:32:31 UTC
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#7) for review on master by Santosh Pradhan (spradhan)

Comment 11 Anand Avati 2013-10-07 12:34:20 UTC
REVIEW: http://review.gluster.org/5926 (gNFS: Slow write performance due to low RPC slots) posted (#8) for review on master by Santosh Pradhan (spradhan)

Comment 12 santosh pradhan 2013-10-09 05:52:39 UTC
I tested with minimal changes (from patch posted for review) with downstream, iozone load is passing:

Volume info:
===========
Volume Name: perftest
Type: Distributed-Replicate
Volume ID: c052bfd8-fd21-4140-8059-f3d2c5993b76
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gprfs017-10ge:/mnt/brick0/brick
Brick2: gprfs018-10ge:/mnt/brick0/brick
Brick3: gprfs019-10ge:/mnt/brick0/brick
Brick4: gprfs020-10ge:/mnt/brick0/brick

How the volume is mounted with 8 clients:
=========================================

All mount points on gprfs017.sbu.lab.eng.bos.redhat.com:
172.17.10.89:/perftest
172.17.10.94:/perftest
172.17.70.252:/perftest
All mount points on gprfs018.sbu.lab.eng.bos.redhat.com:
172.17.10.90:/perftest
172.17.10.95:/perftest
All mount points on gprfs019.sbu.lab.eng.bos.redhat.com:
172.17.10.91:/perftest
172.17.10.96:/perftest
All mount points on gprfs020.sbu.lab.eng.bos.redhat.com:
172.17.10.77:/perftest
172.17.10.92:/perftest

IOZONE workload:
=================

	Iozone: Performance Test of File I/O
	        Version $Revision: 3.392 $
		Compiled for 64 bit mode.
		Build: linux-AMD64 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
	             Ben England.

	Run began: Mon Oct  7 05:46:00 2013

	Network distribution mode enabled.
	Hostname = 172.17.70.252
	Setting no_unlink
	Include close in write timing
	Include fsync in write timing
	No retest option selected
	Record Size 128 KB
	File size set to 67108864 KB
	Command line used: /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -C -w -c -e -i 0 -+n -r 128 -s 64g -t 8
	Output is in Kbytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 8 processes
	Each process writes a 67108864 Kbyte file in 128 Kbyte records

	Test running:
	Children see throughput for  8 initial writers 	=  984939.92 KB/sec
	Min throughput per process 			=  119135.94 KB/sec 
	Max throughput per process 			=  128661.16 KB/sec
	Avg throughput per process 			=  123117.49 KB/sec
	Min xfer 					= 59963392.00 KB
	Child[0] xfer count = 59963392.00 KB, Throughput =  119135.94 KB/sec
	Child[1] xfer count = 64541440.00 KB, Throughput =  128661.16 KB/sec
	Child[2] xfer count = 67108864.00 KB, Throughput =  122749.62 KB/sec
	Child[3] xfer count = 60181632.00 KB, Throughput =  123384.07 KB/sec
	Child[4] xfer count = 60026624.00 KB, Throughput =  119214.86 KB/sec
	Child[5] xfer count = 60088448.00 KB, Throughput =  126651.59 KB/sec
	Child[6] xfer count = 66225664.00 KB, Throughput =  121826.70 KB/sec
	Child[7] xfer count = 60401152.00 KB, Throughput =  123315.98 KB/sec



iozone test complete.

Comment 13 santosh pradhan 2013-10-09 11:02:11 UTC
Some more testing results (with fix of BZ 977492 and 1016608)
============================================================

1. 1x2 vol, 4 clients (2 clients mounts per server)

gluster v set perftest nfs.write-size 1048576

iozone I/O workload(write):
-s 64g -t 4 - Worked fine.

2. 1x2 vol, 4 clients (2 clients mounts per server)
gluster v reset nfs.write-size

iozone I/O workload(write):
-s 16g -t 4 - Worked fine.

Comment 14 Anand Avati 2013-10-21 03:07:44 UTC
REVIEW: http://review.gluster.org/6114 (rpcsvc: implement per-client RPC throttling) posted (#1) for review on master by Anand Avati (avati)

Comment 15 Anand Avati 2013-10-22 00:52:55 UTC
REVIEW: http://review.gluster.org/6114 (rpcsvc: implement per-client RPC throttling) posted (#2) for review on master by Anand Avati (avati)

Comment 16 Ben England 2013-10-24 17:56:01 UTC
good news bad news...

At least 3 of my iozone write tests now complete successfully where before none did.

the memory consumption goes away if I disable the write-behind translator, and throughput is if anything slightly better than with writebehind!

One of the SAs at the RHS training this week said that there was a NFS timeout parameter that eliminated the write problem, I'm trying to track down the information on this.

See run log and stats at http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/13-10-24-12-41-19/

bad news: with 4 servers and 8 clients I got the 7th client to hang, and saw this stack trace in /var/log/messages on that client multiple times today while running with RPMs built by Santosh containing the fix. At the same time, I see rpc.statd and sm-notify daemons restarting on the NFS server gprfs019 where gprfc077 is mounting, and I see some sort of gluster-NFS restart happening at this time also:

QUESTION: is this a different bz?  Am trying to repro with smaller config for easier debug.

[root@gprfs019 ~]# tail /var/log/messages
...
Oct 24 17:22:26 gprfs019 rpc.statd[15604]: Version 1.2.3 starting
Oct 24 17:22:26 gprfs019 sm-notify[15605]: Version 1.2.3 starting
[root@gprfs019 ~]# tail -f /var/log/glusterfs/etc-glusterfs-glusterd.vol.log 
...
[2013-10-24 17:22:26.083783] I [glusterd-utils.c:3847:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully
[2013-10-24 17:22:26.083962] I [glusterd-utils.c:3852:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully


 Again some of the tests complete (and fast), but then this happens --  I took gprfc076 out of there so it has nothing to do with a single client being in a bad state.

gprfc077: Oct 24 13:15:56 gprfc077 kernel: INFO: task bash:3969 blocked for more than 120 seconds.
gprfc077: Oct 24 13:15:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gprfc077: Oct 24 13:15:56 gprfc077 kernel: bash          D 000000000000000f     0  3969   3737 0x00000080
gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff8807dd861ce8 0000000000000086 0000000000000000 ffff88083199f540
gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff88083199f578 0000000000000000 ffff8807dd861ca8 ffffffff81064a00
gprfc077: Oct 24 13:15:56 gprfc077 kernel: ffff88083199faf8 ffff8807dd861fd8 000000000000fb88 ffff88083199faf8
gprfc077: Oct 24 13:15:56 gprfc077 kernel: Call Trace:
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff81064a00>] ? pick_next_task_fair+0xd0/0x130
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150daca>] ? thread_return+0x1b8/0x76e
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff810669eb>] ? enqueue_rt_entity+0x6b/0x80
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80
gprfc077: Oct 24 13:15:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0

gprfc077: Oct 24 13:25:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gprfc077: Oct 24 13:25:56 gprfc077 kernel: sshd          D 000000000000000f     0  3973   2129 0x00000080
gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff88082fef5ce8 0000000000000082 0000000000000000 0000000000000000
gprfc077: Oct 24 13:25:56 gprfc077 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff88082fc95ab8 ffff88082fef5fd8 000000000000fb88 ffff88082fc95ab8
gprfc077: Oct 24 13:25:56 gprfc077 kernel: Call Trace:
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffffa0094bf0>] ? ext4_file_open+0x0/0x130 [ext4]
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81277349>] ? cpumask_next_and+0x29/0x50
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0
gprfc077: Oct 24 13:25:56 gprfc077 kernel: INFO: task sshd:3974 blocked for more than 120 seconds.
gprfc077: Oct 24 13:25:56 gprfc077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gprfc077: Oct 24 13:25:56 gprfc077 kernel: sshd          D 000000000000000f     0  3974   2129 0x00000080
gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff8806906dfce8 0000000000000082 0000000000000000 0000000000000000
gprfc077: Oct 24 13:25:56 gprfc077 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
gprfc077: Oct 24 13:25:56 gprfc077 kernel: ffff880831d3a5f8 ffff8806906dffd8 000000000000fb88 ffff880831d3a5f8
gprfc077: Oct 24 13:25:56 gprfc077 kernel: Call Trace:
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffffa0094bf0>] ? ext4_file_open+0x0/0x130 [ext4]
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e7d5>] schedule_timeout+0x215/0x2e0
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81277349>] ? cpumask_next_and+0x29/0x50
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e453>] wait_for_common+0x123/0x180
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8150e56d>] wait_for_completion+0x1d/0x20
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8118a0a0>] do_execve+0xe0/0x2c0
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80
gprfc077: Oct 24 13:25:56 gprfc077 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0

last thing attempted by the test harness was this:

13-10-24-13-23-03 : running on hosts ['gprfc089', 'gprfc090', 'gprfc091', 'gprfc092', 'gprfc094', 'gprfc095', 'gprfc077', 'gprfc078'] cmd: sync

Note that sync would hit the local ext4 filesystem as well as the NFS file system, so perhaps there is some sort of interaction there.

Here's my volume:

Volume Name: perftest
Type: Distributed-Replicate
Volume ID: 59fac3fb-9194-40b1-8bf8-ba5fde61ce77
Status: Created
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gprfs017-10ge:/mnt/brick0/brick
Brick2: gprfs018-10ge:/mnt/brick0/brick
Brick3: gprfs019-10ge:/mnt/brick0/brick
Brick4: gprfs020-10ge:/mnt/brick0/brick
Options Reconfigured:
performance.nfs.write-behind: off
nfs.drc: 0
nfs.acl: 0

Here's NFS mount command used:

# mount -v -t nfs -o noacl,nfsvers=3,tcp %s:/perftest /mnt/glnfs

where %s is replaced with whatever server the client is connecting to, there are 2 clients/server.


here's the volfile:  

gprfs018: +------------------------------------------------------------------------------+
gprfs018: 1: volume perftest-client-0
gprfs018: 2:     type protocol/client
gprfs018: 8: end-volume
gprfs018: 9:
gprfs018: 10: volume perftest-client-1
gprfs018: 11:     type protocol/client
gprfs018: 17: end-volume
gprfs018: 18:
gprfs018: 19: volume perftest-replicate-0
gprfs018: 20:     type cluster/replicate
gprfs018: 21:     subvolumes perftest-client-0 perftest-client-1
gprfs018: 22: end-volume
gprfs018: 23:
gprfs018: 24: volume perftest-client-2
gprfs018: 25:     type protocol/client
gprfs018: 31: end-volume
gprfs018: 32:
gprfs018: 33: volume perftest-client-3
gprfs018: 34:     type protocol/client
gprfs018: 40: end-volume
gprfs018: 41:
gprfs018: 42: volume perftest-replicate-1
gprfs018: 43:     type cluster/replicate
gprfs018: 44:     subvolumes perftest-client-2 perftest-client-3
gprfs018: 45: end-volume
gprfs018: 46:
gprfs018: 47: volume perftest-dht
gprfs018: 48:     type cluster/distribute
gprfs018: 49:     subvolumes perftest-replicate-0 perftest-replicate-1
gprfs018: 50: end-volume
gprfs018: 51:
gprfs018: 52: volume perftest
gprfs018: 53:     type debug/io-stats
gprfs018: 56:     subvolumes perftest-dht
gprfs018: 57: end-volume
gprfs018: 58:
gprfs018: 59: volume nfs-server
gprfs018: 60:     type nfs/server
gprfs018: 74:     subvolumes perftest
gprfs018: 75: end-volume
gprfs018: 76:

This is the RPMs we're using, I believe Santosh built them:

[root@gprfs019 ~]# rpm -q --info glusterfs
Name        : glusterfs                    Relocations: (not relocatable)
Version     : 3.4rhs                            Vendor: Red Hat, Inc.
Release     : 1.el6rhs                      Build Date: Tue 22 Oct 2013 09:08:47 AM UTC
Install Date: Tue 22 Oct 2013 09:20:12 AM UTC      Build Host: x86-028.build.eng.bos.redhat.com
Group       : System Environment/Base       Source RPM: glusterfs-3.4rhs-1.el6rhs.src.rpm
Size        : 7316109                          License: GPLv2 or LGPLv3+
Signature   : (none)
Packager    : gluster-users
URL         : http://www.gluster.org/docs/index.php/GlusterFS
Summary     : Cluster File System

Comment 17 Ben England 2013-10-25 00:54:09 UTC
ideas on why this happens?

[root@gprfs018 ~]# gluster volume set perftest nfs.outstanding-rpc-limit 8
volume set: failed: Staging failed on 172.17.40.17. Error: Required op_version (3) is not supported

[root@gprfs018 ~]# gluster volume set perftest server.outstanding-rpc-limit 256
volume set: failed: Staging failed on 172.17.40.17. Error: Required op_version (3) is not supported

Comment 18 Ben England 2013-10-25 03:29:43 UTC
Maybe NFS doesn't handle the case where Gluster is backpressuring it.  With 1 NFS client/server, it never hung.    Manoj was able to make the old implementation hang with 4 clients/server.  

So I used write-behind translator  to force NFS to not have as much max outstanding RPC data as Gluster layer under it.  Since I can't currently set the throttle parameters (see last bz comment), I used NFS mount option wsize=32768.  With writebehind translator, this will result in about 4 NFS RPCS per Gluster RPC, right?   So even with 2 NFS clients/server, gluster should not run out of outstanding RPCs this way.  It worked for 16-GB files and 64-GB files with 2 threads/client, 2 clients/server (1/2 TB of write data).

It did NOT work with 4 clients mounting 1 server gprfs018-10ge.  Maybe I needed wsize=16384 on client mounts for that.

Do we need Gluster bricks to be able to successfully backpressure NFS?

Here's the test that worked with 2 clients/server as shown above.

[root@ben-test-driver2-10ge gluster_test]# iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187  -w -c -e -i 0 -+n -r 64k -s 64g  -C -t 4
        Run began: Thu Oct 24 22:14:26 2013
        Test running:
        Children see throughput for  4 initial writers  =  446187.80 KB/sec
        Min throughput per process                      =  110588.89 KB/sec 
        Max throughput per process                      =  112947.85 KB/sec
        Avg throughput per process                      =  111546.95 KB/sec
        Min xfer                                        = 63893888.00 KB
        Child[0] xfer count = 67108864.00 KB, Throughput =  112947.85 KB/sec
        Child[1] xfer count = 64317568.00 KB, Throughput =  110773.44 KB/sec
        Child[2] xfer count = 64369536.00 KB, Throughput =  111877.62 KB/sec
        Child[3] xfer count = 63893888.00 KB, Throughput =  110588.89 KB/sec

gprfs018# gluster v i
Volume Name: perftest
Type: Replicate
Volume ID: f73c6470-54f8-4a6b-b427-1222774a318c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gprfs017-10ge:/mnt/brick0/brick
Brick2: gprfs018-10ge:/mnt/brick0/brick
Options Reconfigured:
nfs.mem-factor: 8
performance.nfs.write-behind: on
nfs.acl: 0

Comment 19 santosh pradhan 2013-10-25 06:12:20 UTC
(In reply to Ben England from comment #17)
> ideas on why this happens?
> 
> [root@gprfs018 ~]# gluster volume set perftest nfs.outstanding-rpc-limit 8
> volume set: failed: Staging failed on 172.17.40.17. Error: Required
> op_version (3) is not supported
> 
> [root@gprfs018 ~]# gluster volume set perftest server.outstanding-rpc-limit
> 256
> volume set: failed: Staging failed on 172.17.40.17. Error: Required
> op_version (3) is not supported

I had sent an email on 22nd Oct, to install the binary from following link:

https://brewweb.devel.redhat.com/taskinfo?taskID=6450281

I dont think the current binary is from the above build link.

I have downloaded the correct binaries under gprfs0{17,18,19,20}:/root/rpms

Comment 20 santosh pradhan 2013-10-25 06:31:55 UTC
(In reply to Ben England from comment #16)
> good news bad news...
> 
> At least 3 of my iozone write tests now complete successfully where before
> none did.
> 
> the memory consumption goes away if I disable the write-behind translator,
> and throughput is if anything slightly better than with writebehind!
> 
> One of the SAs at the RHS training this week said that there was a NFS
> timeout parameter that eliminated the write problem, I'm trying to track
> down the information on this.

[Santosh] When NFS client COMMITs I/O, it expects the reply in 120 seconds i.e. 2 minutes. If it does not get, then it dumps the stacktrace in syslog. I dont know if there is any tunable to increase the timeout value. If the amount of I/O in the COMMIT is bigger, then GlusterNFS throttling can make it delayed for more than 120 secs and subsequently the stacktrace in the client :( 


> 
> See run log and stats at
> http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-
> cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/13-10-24-
> 12-41-19/
> 
> bad news: with 4 servers and 8 clients I got the 7th client to hang, and saw
> this stack trace in /var/log/messages on that client multiple times today
> while running with RPMs built by Santosh containing the fix. At the same
> time, I see rpc.statd and sm-notify daemons restarting on the NFS server
> gprfs019 where gprfc077 is mounting, and I see some sort of gluster-NFS
> restart happening at this time also:
> 
> QUESTION: is this a different bz?  Am trying to repro with smaller config
> for easier debug.

[Santosh]Yes, whenever we change any option in NFS server (basically which changes the NFS volfile and NFS graph), that restarts the NFS server. Gluster NFS restarts the NLM service daemons as well. This is tracked under BZ 929036.

Comment 21 Anand Avati 2013-10-28 07:33:26 UTC
COMMIT: http://review.gluster.org/6114 committed in master by Vijay Bellur (vbellur) 
------
commit a4056292528db49a666422c7f8e0c032441cc83f
Author: Anand Avati <avati>
Date:   Sun Oct 20 08:45:18 2013 -0700

    rpcsvc: implement per-client RPC throttling
    
    Implement a limit on the total number of outstanding RPC requests
    from a given cient. Once the limit is reached the client socket
    is removed from POLL-IN event polling.
    
    Change-Id: I8071b8c89b78d02e830e6af5a540308199d6bdcd
    BUG: 1008301
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/6114
    Reviewed-by: Santosh Pradhan <spradhan>
    Reviewed-by: Rajesh Joseph <rjoseph>
    Reviewed-by: Harshavardhana <harsha>
    Reviewed-by: Vijay Bellur <vbellur>
    Tested-by: Gluster Build System <jenkins.com>

Comment 22 Ben England 2013-10-30 22:07:45 UTC
Next step - get a good protocol trace of what is happening at the NFS server to see why congestion control doesn't always work.

re cmt 20: it's hard to see why NFS COMMIT would take > 120 sec given RPC throttling change! But yes it makes sense that the stack trace results from NFS commit not getting done.  I suspect more that Gluster is not acking the COMMIT request for some reason, a protocol trace might show this.

Avati and I discussed NFS, we have consensus that we don't understand why existing code isn't working and need a better analysis of what's going wrong (to Manoj's point, why isn't NFS congestion control working?). Is it a bug in NFS client or does kNFS behave differently at a a protocol level so that NFS clients don't flood the server?   It's not easy to go into NFS client and analyze a crash dump, or examine with systemtap, etc. the easiest way to see may be to look at a protocol trace if we can capture it around the time of the hang.

Comment 23 Ben England 2013-11-08 19:55:06 UTC
I'm running the glusterfs-*-3.4rhs-1.el6rhs.x86_64 RPMs (built by Santosh?) that let met set the outstanding-rpc-limit parameters, thanks for that.

The good news: it does complete writes where before it did not.

Problem 1) On writes, it still slows to a crawl, but it takes longer to do it. For example, with the command to write 1 TB of data in 32-GB files:

iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -C -i 0 -+n -r 64k -t 32 -s 32g

after about 1/4 TB is written, glusterfs is consuming more CPU than glusterfsd and this ratio keeps increasing as it slows to a crawl. 

 it's still leaking memory because of NFS writes, virtual address size of glusterfs process is still 13 GB, 1/5 of RAM, even AFTER writes have completed, fsync has been issued and all files have been closed.

Tasks: 715 total,   2 running, 713 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49544696k total, 49211980k used,   332716k free,    88720k buffers
Swap: 24772600k total,    44616k used, 24727984k free, 38663592k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                            
15213 root      20   0 12.8g 8.3g 2984 S  0.0 17.5   4:29.97 glusterfs                                                          
 2902 vdsm      15  -5 1607m 9340 2328 S  0.3  0.0   9:06.48 vdsm                                                               
15096 root      20   0 1536m  25m 2780 S  0.0  0.1   8:06.92 glusterfsd                                                          

Swap space is still being used. If I do reads, no virtual memory is consumed. 

Workaround: set an NFS-related volume parameter, it will try to restart of NFS server and free up its memory.

Another workaround: rhs-virtualization tuned profile will set vm.swappiness to 10, reducing swapping, but eventually it will happen.


problem 2) self-heal: this really screws up performance, because now glustershd is reading all of these files as well as writing them.  This wrecks sequential I/O performance.

[root@gprfs020 ~]# gluster volume heal perftest info
Gathering list of entries to be healed on volume perftest has been successful 

Brick gprfs017-10ge:/mnt/brick0/brick
Number of entries: 17
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.29
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.18
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.19
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.31
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.15
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.1
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.13
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.5
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.25
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.14
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.20
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.2
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.12
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.6
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.8
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.28
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.17

Brick gprfs018-10ge:/mnt/brick0/brick
Number of entries: 17
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.29
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.18
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.19
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.31
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.15
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.1
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.13
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.5
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.25
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.14
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.20
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.2
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.12
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.6
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.8
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.28
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.17

Brick gprfs019-10ge:/mnt/brick0/brick
Number of entries: 14
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.23
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.10
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.11
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.3
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.27
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.7
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.21
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.9
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.22
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.26
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.30
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.16
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.0
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.4

Brick gprfs020-10ge:/mnt/brick0/brick
Number of entries: 14
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.23
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.10
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.11
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.3
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.27
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.7
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.21
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.9
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.22
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.26
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.30
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.16
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.0
/iozone.d/13-10-24-18-22-25/iozone.DUMMY.4

Comment 24 Ben England 2013-11-08 23:51:57 UTC
If this good result holds up, can we get Corbett changed to make nfs.outstanding-rpc-limit default to 8?


So I decided to try setting just this one tunable, and retest with writing 256 GB/server of NFS data with iozone, that's 1 TB total. It made it through 3 write passes with 4 servers, 8 clients, and 1, 2 and 4 threads/client.  Self healing is either absent or greatly reduced.   4-server test is more stressful than 2-server test because there is a lot more replication traffic.  Basically it's getting at least 150-200 MB/s/server of write throughput, and 500 MB/s/server on reads.  While the write performance isn't near hardware speed, I think we know how to improve it by adjusting RPC size and outstanding-rpc-limit parameters.  But stability comes first.

I see this message continually logged to /var/log/glusterfs/nfs.log during writes, plz fix or you will fill up system disk quickly.

[2013-11-08 23:50:38.155138] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2013-11-08 23:50:38.155587] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2013-11-08 23:50:38.155726] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates

why limit=8?  for 8 clients this is 64 writes in flight.  The RHS bricks can accept 64 NFS RPCs at a time from each client for total of 512 in-flight writes of 64 KB each = 32 MB of data.  On brick side, with replication I'm dividing by 2, so we have 32 x 2 = 64 RPCs of 128 KB each = 8 MB of data, which is less than NFS can push closer to it.    Why is this important?  Gluster and NFS are competing for the NIC, and if NFS pushes too hard it can interfere with  Gluster getting data replicated.  This is unlike kernel NFS server writing to an XFS filesystem, for example.  Before I set this limit, some files were getting written when others weren't, since some files have 1 local replica my hypothesis is that Gluster files with a local replica on the NFS server were getting preferential treatment.  When I set this limit, it appeared that the files all started growing at more or less the same rate.  Need to get more data on this but at least it's a theory.  Note that as we increase volume brick count, the probability of files having local replicas on the NFS server drops to zero.

If this tuning works, I could try increasing server.outstanding-rpc-count and nfs.outstanding-rpc-count by same factor to improve write perf.  

BTW, I also filed a bz 1028631 on glusterfs NFS server being bottlenecked on reads by a single thread.  This prevents us from utilizing multiple NICs to overcome this bottleneck (Would be nice to have nfs.write-size parameter that is upstream also).  If it was fixed, we could do these things to speed up NFS:

http://www.gluster.org/community/documentation/index.php/Network_Configuration_Techniques

# ssh gprfs020 gluster v i
 
Volume Name: perftest
Type: Distributed-Replicate
Volume ID: 1b5f085a-ded1-4902-bc4f-781404876d21
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gprfs017-10ge:/mnt/brick0/brick
Brick2: gprfs018-10ge:/mnt/brick0/brick
Brick3: gprfs019-10ge:/mnt/brick0/brick
Brick4: gprfs020-10ge:/mnt/brick0/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.outstanding-rpc-limit: 8

immediately after 1-TB write completed:

[root@ben-test-driver2-10ge gluster_test]# ssh gprfs020 gluster v heal perftest info
Gathering list of entries to be healed on volume perftest has been successful 

Brick gprfs017-10ge:/mnt/brick0/brick
Number of entries: 0

Brick gprfs018-10ge:/mnt/brick0/brick
Number of entries: 0

Brick gprfs019-10ge:/mnt/brick0/brick
Number of entries: 0

Brick gprfs020-10ge:/mnt/brick0/brick
Number of entries: 0

Comment 25 Ben England 2013-11-09 14:13:37 UTC
All tests finished with config described in .-1, results are similar to first pass, 4 passes reading/writing 1 TB of data completed with 1, 2 and 4 threads/client.  logs are in 

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/

13-11-08-15-32-40/
glnfs-throttle-patch-nfsrpc8.csv

Comment 26 Ben England 2013-12-03 23:30:18 UTC
I changed the status back to assigned from modified, because the memory leak problem is still not fixed and is a show-stopper.

if I go past 1 TB = 256 GB/server, I'm still seeing a memory leak, am going to try to document this better, but bottom line is even with flow control fix above, the memory leak eventually causes glusterfs to swap and hang.  It's using 15 GB of RAM out of 48 before I start running into problems.  This is not acceptable -- I could see it using as much as 1-2 GB (although even this is unnecessary).  The result: clients hang and you can't log into them.  Example from "top" output:

mem:  49544696k total, 48732460k used,   812236k free,    14476k buffers
Swap: 24772600k total,        0k used, 24772600k free, 42414060k cached
23861 root  20   0 15.3g 4.2g 2960 S  0.0  8.9  6:47.72  glusterfs

Even if I just create and delete sets of 4-GB files, the memory leak occurs.  There is NO POSSIBLE REASON for glusterfs to hang onto the memory if the files are deleted, but it does.  workload to reproduce is:

while [ 1 ] ; do rm -fv /mnt/glnfs/iozone.d/13-12-03-15-04-05/iozone.DUMMY.* ; sleep 2 ; iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 4g -t 16 ; done

Worse yet, if I do a hard DRAC reset on all clients, when they come back up glusterfs NFS server does not give back the memory that it used.  I have to stop and start the volume.  Even then, the clients do not recover and I have to reset them with the DRAC.  
                                                     
However, if I use direct I/O like this:

iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16384k -I -s 1g -t 16

I get terrible throughput but no significant memory leak that I can detect.  

Sorry but you have to run valgrind and figure out where the memory is going if you want NFS translator to become stable.

I'll attach data showing how the memory leak progresses, it's very steady and reproducible.  Try it,  you can borrow the BAGL NFS config to do this if you need to.

Comment 27 santosh pradhan 2013-12-06 11:13:35 UTC
Could you just try the same test case with making DRC off? (gluster v set <volname> nfs.drc off. Just to verify if the DRC is contributing to memory footprint of the NFS server.

Because we are throttling the in-flight RPC count, so that should n't cause the memory leak.

Again which build you use for your testing?

Thanks,
Santosh

Comment 28 Ben England 2013-12-08 17:05:06 UTC
Santosh, I'll do as you suggest.    build was glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64

Comment 29 Ben Turner 2013-12-12 01:15:11 UTC
I am pretty sure I am hitting this in fs sanity as I recently switched iozone to run in distributed.  Iozone slows down and in the logs I see:

[2013-12-11 23:51:38.934671] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2013-12-11 23:51:38.988245] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2013-12-11 23:51:39.041640] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2013-12-11 23:51:39.097689] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2013-12-11 23:51:39.151618] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates

I will attempt to repro with:

gluster v set <volname> nfs.drc off

Comment 30 Ben England 2013-12-13 16:35:08 UTC
with nfs.drc off and nfs.outstanding-rpc-limit 8, the leak appears to go away!  I'll continue to run 

iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 4g -t 64

over and over again, but glusterfs process is tiny.  However, when I turn nfs.drc on, glusterfs process immediately blows up.  The memory leak definitely depends on the value of this parameter.

So I guess nfs.drc on is the culprit!

Comment 31 Anand Avati 2014-01-14 05:56:50 UTC
REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#1) for review on master by Santosh Pradhan (spradhan)

Comment 32 Anand Avati 2014-01-14 17:00:27 UTC
REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#2) for review on master by Santosh Pradhan (spradhan)

Comment 33 Anand Avati 2014-01-14 17:30:18 UTC
REVIEW: http://review.gluster.org/6696 (rpc: Limit the default outstanding RPC limit to 16) posted (#3) for review on master by Santosh Pradhan (spradhan)

Comment 34 Anand Avati 2014-01-14 17:52:13 UTC
REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#4) for review on master by Santosh Pradhan (spradhan)

Comment 35 Anand Avati 2014-01-15 08:48:24 UTC
REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#5) for review on master by Santosh Pradhan (spradhan)

Comment 36 Anand Avati 2014-01-15 10:08:19 UTC
REVIEW: http://review.gluster.org/6696 (gNFS: Set default outstanding RPC limit to 16) posted (#6) for review on master by Santosh Pradhan (spradhan)

Comment 37 Anand Avati 2014-01-15 21:47:40 UTC
COMMIT: http://review.gluster.org/6696 committed in master by Anand Avati (avati) 
------
commit 260e817b8a652ddb705808098da1e07e3660f4c7
Author: Santosh Kumar Pradhan <spradhan>
Date:   Tue Jan 14 11:21:44 2014 +0530

    gNFS: Set default outstanding RPC limit to 16
    
    With 64, NFS server hangs with large I/O load (~ 64 threads writing
    to NFS server). The test results from Ben England (Performance expert)
    suggest to set it as 16 instead of 64.
    
    Change-Id: I418ff5ba0a3e9fdb14f395b8736438ee1bbd95f4
    BUG: 1008301
    Signed-off-by: Santosh Kumar Pradhan <spradhan>
    Reviewed-on: http://review.gluster.org/6696
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: ben england <bengland>

Comment 38 Justin Clift 2014-03-19 14:32:31 UTC
Are we able to get this into 3.5.0? (or 3.5.1 if not 3.5.0)

Comment 39 santosh pradhan 2014-03-20 06:54:24 UTC
The fix is already there in release-3.5 branch.

Comment 40 Justin Clift 2014-03-20 12:29:46 UTC
Thanks Santosh, somehow missed that. ;)

Comment 41 Niels de Vos 2014-04-17 11:48:00 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report.

glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user