977492 – large NFS writes to Gluster slow down then stop

Bug 977492 - large NFS writes to Gluster slow down then stop

Summary: large NFS writes to Gluster slow down then stop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 2.1.2
Assignee:	santosh pradhan
QA Contact:	Ben Turner
Docs Contact:
URL:
Whiteboard:
Depends On:	1016608
Blocks:	1008301 1073441 1073442
TreeView+	depends on / blocked

Reported:	2013-06-24 17:13 UTC by Ben England
Modified:	2018-12-04 15:30 UTC (History)
CC List:	19 users (show)
Fixed In Version:	glusterfs-3.4.0.58rhs-1.el6rhs
Doc Type:	Bug Fix
Doc Text:	Previously, When there were multi-Gigabyte NFS writes (128 GB/server) to a Red Hat Storage volume, throughput slowed and eventually stopped. With this fix, the default Red Hat Storage Volume parameter values are changed to nfs.outstanding-rpc-limit=16. As a result the performance of the Red Hat Storage NFS for multi-Gigabyte file-writes has improved.
Clone Of:
Clones:	1008301 (view as bug list)
Environment:
Last Closed:	2014-02-25 07:31:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
known issue (203.07 KB, application/pdf) 2013-09-13 16:51 UTC, Pavithra	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	507323	0	None	None	None	Never
Red Hat Product Errata	RHEA-2014:0208	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #2	2014-02-25 12:20:30 UTC

Description Ben England 2013-06-24 17:13:03 UTC

Description of problem:

If you run a very large NFS write test (128 GB/server) to RHS volume, throughput slows and eventually stops.  It appears to be a bug related to write-behind translator.

Version-Release number of selected component (if applicable):

glusterfs-3.4.0.8 (am now trying with glusterfs-3.4.0.11)
kernel 2.6.32-358.6.2.el6.x86_64

How reproducible:

every time

Steps to Reproduce:
1.Construct Gluster volume as described below
2.run iozone as described below
3.watch what happens as we reach about half of the data written

Actual results:

iostat shows that throughput starts out very strong, but slows eventually until it effectively ceases.  Attachment will show this.

"top" shows that glusterfs NFS server process consumes all of 1 core.

"perf top" shows that all of its time is spent in wb_requests_overlap() routine in write-behind translator, see screenshots in attachment.

I disabled write-behind translator by creating a separate volfile for NFS server process and running it manually, and the test then completes.

Expected results:

No significant degradation in throughput, at least none due to Gluster.

Additional info:

how to construct Gluster volume: See attachment for details:

- 10-GbE, jumbo frames
- usual XFS bricks
- gluster volume create perftest  replica 2  gprfs045-10ge:/mnt/brick0/brick gprfs046-10ge:/mnt/brick0/brick gprfs047-10ge:/mnt/brick0/brick gprfs048-10ge:/mnt/brick0/brick
- gluster volume start perftest
- on each NFS client: mount -v -t nfs -o nfsvers=3,tcp,actimeo=1 %s:/perftest /mnt/glnfs

We spread the NFS clients evenly across the NFS servers in the volume.

how to run iozone:

change 172.17.10.32 to the IP address that results from resolving the test driver's hostname.  The file /var/tmp/iozone_cluster.config contains this:

h1 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone
h2 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone
...
h8 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone
h1 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone
h2 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone
...
h8 /mnt/glnfs/iozone.d/13-06-21-03-10-27 /usr/local/bin/iozone

So there are two threads per NFS client.

root@gprfc032-10ge gluster_test]# /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config  -+h 172.17.10.32  -i 0 -C   -+n  -s 32g -r 128 -c -e  -t 16

I got a core dump using "kill -BUS " on the glusterfs process.  The coredump was 8 GB.  My hypothesis is that the list of uncommitted writes is so long that this accounts for the large memory size of the process, and the act of checking for overlaps in these writes takes all of our CPU time.

[root@gprfs048 ~]# ls -l /core.30528
-rw------- 1 root root 9895763968 Jun 24 15:39 /core.30528
[root@gprfs048 ~]# gdb /usr/sbin/glusterfs /core.30528
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
...
Reading symbols from /usr/sbin/glusterfs...(no debugging symbols found)...done.
[New Thread 30528]
...
(gdb) where
#0  0x00007fabc8022db7 in wb_requests_overlap () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so
#1  0x00007fabc8022f08 in wb_wip_has_conflict () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so
#2  0x00007fabc802326b in __wb_pick_winds () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so
#3  0x00007fabc80262d9 in wb_process_queue () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so
#4  0x00007fabc8026c08 in wb_writev () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so
#5  0x00007fabc393b3d7 in io_stats_writev () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/debug/io-stats.so
#6  0x00007fabc36f0c47 in nfs_fop_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#7  0x00007fabc36fe95b in __nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#8  0x00007fabc3706423 in nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#9  0x00007fabc370ea3b in nfs3_fh_resolve_inode_done () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#10 0x00007fabc370eb0b in nfs3_fh_resolve_inode () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#11 0x00007fabc370ebb5 in nfs3_fh_resolve_resume () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#12 0x00007fabc370ecc8 in nfs3_fh_resolve_root () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#13 0x00007fabc370cc01 in nfs3_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#14 0x00007fabc370cd62 in nfs3svc_write () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/nfs/server.so
#15 0x0000003863808655 in rpcsvc_handle_rpc_call () from /usr/lib64/libgfrpc.so.0
#16 0x00000038638087a3 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0
#17 0x00000038638098d8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#18 0x00007fabc92ab3a4 in socket_event_poll_in () from /usr/lib64/glusterfs/3.4.0.8rhs/rpc-transport/socket.so
#19 0x00007fabc92ab4fd in socket_event_handler () from /usr/lib64/glusterfs/3.4.0.8rhs/rpc-transport/socket.so
#20 0x000000386305a9b7 in ?? () from /usr/lib64/libglusterfs.so.0
#21 0x0000000000406776 in main ()
(gdb) up
#1  0x00007fabc8022f08 in wb_wip_has_conflict () from /usr/lib64/glusterfs/3.4.0.8rhs/xlator/performance/write-behind.so

Comment 2 Ben England 2013-06-24 17:41:17 UTC

For files documenting details of bug, such as core dump and perf top screenshots,  see 

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/nfs-write-behind-prob.tgz

Comment 4 santosh pradhan 2013-07-04 16:34:28 UTC

In my setup, out of my 4 servers (2x2 volume), In one of the brick server, I am seeing a different problem where two of the threads are consuming a lot of CPU time with following stack trace:

I dont have debug symbols installed in the machine, ll do that to get the line no.

1. ps -L -p <NFS pid>

  PID   LWP TTY          TIME CMD
25563 25563 ?        1-05:01:17 glusterfs
25563 25565 ?        00:00:00 glusterfs
25563 25566 ?        00:00:00 glusterfs
25563 25567 ?        00:00:00 glusterfs
25563 25574 ?        1-02:19:19 glusterfs
25563 25589 ?        00:00:00 glusterfs

LWP 25563 and 25574 are consuming more time.

2. top output:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
25563 root      20   0 4493m 2.9g 7732 R 200.1 37.5   3326:24 glusterfs 

3. stack trace:

Thread 1 (Thread 0x7f34dcc6c700 (LWP 25563)):
#0  0x00007f34d86242a0 in afr_transaction_eager_lock_init () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so
#1  0x00007f34d862688d in afr_transaction () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so
#2  0x00007f34d861f427 in afr_do_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so
#3  0x00007f34d861fa3c in afr_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so
#4  0x00007f34d83e17d9 in dht_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/distribute.so
#5  0x00007f34d81a8c22 in wb_writev_helper () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so
#6  0x00000030b6e2f3a4 in call_resume () from /usr/lib64/libglusterfs.so.0
#7  0x00007f34d81a9318 in wb_do_winds () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so
#8  0x00007f34d81aa302 in wb_process_queue () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so
#9  0x00007f34d81aac08 in wb_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/performance/write-behind.so
#10 0x00007f34d3df23d7 in io_stats_writev () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/debug/io-stats.so
#11 0x00007f34d3ba7027 in nfs_fop_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#12 0x00007f34d3bb4d3b in __nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#13 0x00007f34d3bbc913 in nfs3_write_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#14 0x00007f34d3bc4f8b in nfs3_fh_resolve_inode_done () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#15 0x00007f34d3bc505b in nfs3_fh_resolve_inode () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#16 0x00007f34d3bc5105 in nfs3_fh_resolve_resume () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#17 0x00007f34d3bc5218 in nfs3_fh_resolve_root () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#18 0x00007f34d3bc3101 in nfs3_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#19 0x00007f34d3bc32b2 in nfs3svc_write () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/nfs/server.so
#20 0x00000030b7208655 in rpcsvc_handle_rpc_call () from /usr/lib64/libgfrpc.so.0
#21 0x00000030b72087a3 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0
#22 0x00000030b72098d8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#23 0x00007f34d96c5bd6 in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/rpc-transport/socket.so
#24 0x00007f34d96c74ed in ?? () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/rpc-transport/socket.so
#25 0x00000030b6e5b387 in ?? () from /usr/lib64/libglusterfs.so.0
#26 0x00000000004067c6 in main ()


Thread 3 (Thread 0x7f34d94a4700 (LWP 25574)):
#0  0x00000035c9e0c182 in ?? () from /lib64/libpthread.so.0
#1  0x00007f34d86241d5 in afr_fd_ctx_get () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so
#2  0x00007f34d86253dd in afr_delayed_changelog_post_op () from /usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/cluster/replicate.so
#3  0x00000030b6e2a230 in gf_timer_proc () from /usr/lib64/libglusterfs.so.0
#4  0x00000035c9e07851 in start_thread () from /lib64/libpthread.so.0
#5  0x00000035c96e890d in clone () from /lib64/libc.so.6


Will continue looking into this.

Comment 5 Ben England 2013-07-08 16:50:16 UTC

I reproduced the problem with just 2 servers instead of 4. So you don't need a huge configuration to do this.

I'll try to cut down configuration even more than this.

It appears that the client/server ratio matters here. With 1:1 ratio (2 servers, 2 clients), the problem does not occur, but with 2:1 (2 servers, 4 clients), the problem does occur. I think that NFS clients are bursty, they may overload the server with a large set of writes, and the server can't keep up with the long write-behind list. This would explain why O_DIRECT writes do not encounter the problem.

I reproduced with /proc/sys/vm/nr_hugepages=0 on all systems, eager-lock off, and no special tuning of any kind. So there is no reason why QE can't reproduce.

I'm going to try increasing vm.min_free_kbytes to see if it's some sort of kernel memory issue.

As before, once the problem occurs, you CANNOT CLEANUP the processes on the client, they hang. YOu have to do a hard reset on the clients (I use ipmitool), reboot won't work, for example, you see:

[root@gprfc032-10ge gluster_test]# ssh gprfc089
ssh: connect to host gprfc089 port 22: Connection refused

Which means that reboot hung while trying to unmount NFS, so it just becomes a useless machine.

THe server volume can be stopped but it took 2 min for that to happen while write-behind.so worked its way through the backlog.

Be sure your clients are not all mounting same server, this may prevent problem from occurring by limiting data flow from clients.

You can get it to happen much quicker by restricting memory available to kernel as follows:

echo N > /proc/sys/vm/nr_hugepages

where N is number of 2-MB pages you want to reserve. This effectively steals memory away from the Linux buffer cache so the system enters a state where it's out of physical memory much more quickly.

Delete old files before writing new ones, overwriting existing file may not give same behavior.

Comment 6 santosh pradhan 2013-07-09 12:44:15 UTC

Thanks Ben for all your input.

I tried with following config/workload:

1. 1x2 dist-rep volume (2 node cluster: sev1 sev2: one brick in each server)

2. 4 clients (c1, c2, c3, c4: c1/c2 mounting from sev1 and c3/c4 mounts from sev2)


3. iozone workload:

When I tried with 
"iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4"

I could hit the issue within 30mins or so. May be this generates more I/O than 128g per server. So I modified the command as below (for 128g per server):

iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 32g -r 128 -c -e -t 4

iozone.cfg:
===========
10.70.43.16 /mnt/nfsdir /usr/local/bin/iozone
10.70.43.17 /mnt/nfsdir /usr/local/bin/iozone
10.70.42.243 /mnt/nfsdir /usr/local/bin/iozone
10.70.42.252 /mnt/nfsdir /usr/local/bin/iozone

per thread 32g, so total 32*4 = 128g
as replica total i/o workload for gluster = 256g
as 2 node/brick cluster, workload per brick = 256/2 = 128g

Is my calculation for 128g write workload per sever correct?

The test is running, still no issue with this workload.

Please correct me if my understanding of generating the workload is wrong.

Thanks,
Santosh

Comment 7 santosh pradhan 2013-07-09 14:56:04 UTC

(In reply to Ben England from comment #5)
> I reproduced the problem with just 2 servers instead of 4.  So you don't
> need a huge configuration to do this.  
> 
> I'll try to cut down configuration even more than this.
> 
> It appears that the client/server ratio matters here.  With 1:1 ratio (2
> servers, 2 clients), the problem does not occur, but with 2:1 (2 servers, 4
> clients), the problem does occur.    I think that NFS clients are bursty,
> they may overload the server with a large set of writes, and the server
> can't keep up with the long write-behind list.  This would explain why
> O_DIRECT writes do not encounter the problem.
> 

Could you specify the iozone command line you used for (2 server : 4 client)-setup. 

I tried following command with a 1 x 2 (dist-repl) vol spread across 2 node cluster (one brick in each server) and 4 clients (mount ratio: 2 clients per server):

iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 32g -r 128 -c -e -t 4

It just worked.

But when I changed the size to 64g per thread (-s 64g), I could reproduce the issue e.g.

iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4

I am not clear how to calculate the exact iozone workload (say 128g) per server in 1x2 or 2x2 dist-rep volumes.

Could you clarify this part?

Thanks,
Santosh

Comment 8 santosh pradhan 2013-07-09 14:59:36 UTC

Forgot to mention that, I turned the eager-locking OFF in my setup.

Comment 9 santosh pradhan 2013-07-09 17:29:42 UTC

Some more update:

set up: 2 node cluster with 4 clients

iozone command used:
iozone -+m iozone.cfg -+h 10.70.43.16 -i 0 -C -+n -s 64g -r 128 -c -e -t 4

case: 1
(simple distributed vol on 2 node cluster)
cluster eager-locking ON

iozone completed successfully.

case: 2
(distributed replicated 1x2 volume)
cluster eager locking ON

iozone completed successfully.

case: 3
(dist-rep 1x2 vol)
cluster eager locking OFF
Already tried once where the issue was reproduced. Will try again tomorrow.

Comment 10 santosh pradhan 2013-07-10 17:52:00 UTC

Today I tried couple of times (on 1x2 vol with eager lock OFF/ON) but could not reproduce the issue in my setup. Also we (Anand and me) had a discussion with Ben how to take it forward. I requested Ben if he can reproduce the issue in my setup so that debugging would be easier. Otherwise I may need to debug in his setup. Also we ll follow up tomorrow evening 5pm (IST).

I tried with latest build (3.4.0.12rhs.beta3).

Comment 13 Ben England 2013-07-16 09:56:54 UTC

have you tried disabling the write-behind translator?  Why do we need it for NFS?  the NFS client will aggregate write requests up to 64-KB RPC size, good enough?

Comment 14 santosh pradhan 2013-07-16 11:51:33 UTC

Memory footprint of one of the server came down from 14G to 1.6G. Other server is still running high with ~ 11G. All are in RSS.

I profiled one of the server (where memory usage of NFS process came down from 14G to 1.6G RSS)

This is perf data:

Samples: 29M of event 'cycles', Event count (approx.): 21872713449727
+  81.49%  glusterfs  write-behind.so               [.] wb_requests_overlap
+   4.46%  glusterfs  write-behind.so               [.] wb_wip_has_conflict
+   3.89%  glusterfs  libpthread-2.12.so            [.] 0x000000000000ed31
+   1.99%  glusterfs  write-behind.so               [.] wb_requests_overlap@plt
+   0.85%  glusterfs  write-behind.so               [.] __wb_pick_unwinds
+   0.81%  glusterfs  [kernel.kallsyms]             [k] copy_user_generic_string
+   0.54%  glusterfs  [kernel.kallsyms]             [k] compaction_alloc
+   0.48%  glusterfs  afr.so                        [.] afr_transaction_eager_lock_init
+   0.28%  glusterfs  write-behind.so               [.] __wb_pick_winds
+   0.22%  glusterfs  libc-2.12.so                  [.] __memset_sse2
+   0.21%  glusterfs  libc-2.12.so                  [.] _int_malloc
+   0.18%  glusterfs  write-behind.so               [.] __wb_preprocess_winds
+   0.17%  glusterfs  libgfrpc.so.0.0.0             [.] __saved_frame_copy
+   0.16%  glusterfs  libc-2.12.so                  [.] memcpy
+   0.15%  glusterfs  write-behind.so               [.] wb_fulfill_err

There are a lot of Errors/Warnings in NFS log (both the NFS servers). Needs to be looked at as well. iozone looks like hung (not finishing). 

Will keep updated.

Comment 15 santosh pradhan 2013-07-16 11:57:37 UTC

(In reply to Ben England from comment #13)
> have you tried disabling the write-behind translator? 

[Santosh] Yes, I tried disabling write-behind xlator. The behavior was really bad, the memory usage is going up very quickly (I saw upto 24+ GB before I killed them).


 Why do we need it for

[Santosh] wb-xlator really helps, though we see a problem with huge write from multiple clients. May be Vijay/Avati ll have a better answer for this question. I would defer it for Vijjay. 

> NFS?  the NFS client will aggregate write requests up to 64-KB RPC size,
> good enough?

[Santosh] I dont this it ll help. (from my experiment with write-behind off). Thats why you see huge memory footprint of the NFS server in 2.0 i.e. BZ 965924 (in fact that's not memory leak as RSS is still huge which means NFS server without write-behind is too loaded to handle the I/O burst).

Comment 16 Ben England 2013-07-16 13:25:29 UTC

Santosh, you are right, turning write-behind off does not have desired effect. Throughput was about 50% of what it was with write-behind on.  And the memory leak still happens.  What confused me is that the bottleneck appears to be in the write behind translator, but if you turn write-behind off, the bottleneck moves to somewhere else (libgfrpc saved_frame_{copy,get}).      

What strikes me as strange is that it just takes in data from NFS client without any flow control.  NFS client should be asking for a COMMIT of data to persistent storage, this should cause clients to refrain from pushing additional write data, hence there should be no massive overcommit of memory to glusterfs process.

So I captured a protocol trace on c089, and now I'm really confused.  There are no NFS headers in this trace.

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/glnfs-c089-wr.tcpdump.gz

There are a lot of packets but none of them are NFS, they are all TCP, looks like some kind of TCP flow control problem where TCP never makes progress on the transmit window?  But I'm not really sure yet.  I also see lots of these:

[root@gprfc089 ~]# tail /var/log/messages
Jul 16 09:15:58 gprfc089 kernel: nfs: server gprfs017-10ge not responding, still trying

I'll try starting tracing before iozone run goes too far.

Comment 17 santosh pradhan 2013-07-16 14:55:50 UTC


what looks weird to me is: perf data i.e. wb_requests_overlap() routine which is just fetching the passed parameters (couple ptrs to structures of type wb_request_t), does some arithmetic operation along with a comparison.

perf annotate confuses me more:

           Disassembly of section .text:
       
         0000000000002db0 <wb_requests_overlap>:
  1.25       mov    0x4b8(%rdi),%rax
 85.37       mov    0x4c0(%rdi),%rcx   <===== This looks like loading ordering.size

  5.51       test   %rax,%rax
  0.00       jne    48

I installed the debug symbols, but still perf annotate shows the assembly instead of source lines.

Tomorrow, will continue looking into it.

Comment 18 Ben England 2013-07-16 22:25:04 UTC

My only hypothesis is that NFS server does not handle COMMIT request from NFS client properly.  Otherwise NFS server could not accumulate so many I/O requests to scan in write-behind translator code.

re .-1: I think all this perf output means is that it's executing this instruction millions of times and this instruction is just the place where the cache miss occurs perhaps.  It doesn't mean that the program is doing something wrong there.  It's a hint that your list of requests is getting excessively long.  

iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 16384k -o -s 32g -t 16

Does not hang or leak memory at all.  "-o" opens file with O_SYNC, so after every 16-MB record is written to file by iozone, the write will block until the data actually is written to storage.  This means the amount of write requests that are uncommitted is much smaller perhaps?

I tried capturing protocol traces on an NFS server gprfs017, while running single-threaded write to make trace analysis easier, but it did not capture the NFS write traffic except for a couple of early requests.  It did capture lots of gluster traffic.  So protocol tracing isn't going to help much here.

Comment 19 santosh pradhan 2013-07-17 03:26:34 UTC

(In reply to Ben England from comment #18)
> My only hypothesis is that NFS server does not handle COMMIT request from
> NFS client properly.  Otherwise NFS server could not accumulate so many I/O
> requests to scan in write-behind translator code.
> 
> re .-1: I think all this perf output means is that it's executing this
> instruction millions of times and this instruction is just the place where
> the cache miss occurs perhaps.  It doesn't mean that the program is doing
> something wrong there.  It's a hint that your list of requests is getting
> excessively long.


[Santosh] Yes, probably you are right. I was planning to let the perf another go with L1-cache event analysis to check that behaviour.

> 
> iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0
> -+n -r 16384k -o -s 32g -t 16
> 
> Does not hang or leak memory at all.  "-o" opens file with O_SYNC, so after
> every 16-MB record is written to file by iozone, the write will block until
> the data actually is written to storage.

[Santosh] Wow, great. At least some relief :) Thanks for the experiment.

> This means the amount of write
> requests that are uncommitted is much smaller perhaps?

[Santosh] Yes, probably. It does not accumulate too many outstanding I/O reqs.

> 
> I tried capturing protocol traces on an NFS server gprfs017, while running
> single-threaded write to make trace analysis easier, but it did not capture
> the NFS write traffic except for a couple of early requests.  It did capture
> lots of gluster traffic.  So protocol tracing isn't going to help much here.

[Santosh] Thanks again for all your effort.

I will keep looking at it today and keep you posted about the progress.

Best,
Santosh

Comment 20 Anand Subramanian 2013-07-17 05:14:51 UTC

Vijay mentioned that there is a way to get a specific xlator dump, in this case the wb xlator only dump. There is supposedly a filter for this. He couldn't recall that as its been a while he used it but we can try to see if we can use that to get all wb stats.

Regarding Ben's comment:
> re .-1: I think all this perf output means is that it's executing this
> instruction millions of times and this instruction is just the place where
> the cache miss occurs perhaps.  It doesn't mean that the program is doing
> something wrong there.  It's a hint that your list of requests is getting
> excessively long.

(gdb) p *wb_inode
$2 = {window_conf = 1048576, window_current = 92311552, transit = 673583104, all = {next = 0xd12ec60, prev = 0x190cfc6a0}, todo = {next = 0x1339a7800, prev = 0x190cfc6b0}, liability = {next = 0xd12ec80, prev = 0xd940340}, temptation = {
    next = 0xd9b7d00, prev = 0x190cfc6c0}, wip = {next = 0xd12ecb0, prev = 0x127cfe160}, gen = 17, lock = 0, this = 0x2578230}

The transit count seems to be the total size of the write fops in transit. The size from one of the runs is 673583104, when in the problem state.
Also, from the state dump, callcount was ~129000 which points to a huge backlog of writes pending (which would imply uncommitted to disk and unacknowledged to client/iozone).

Can one of you confirm if write-behind option flush_behind is on/off please?

Comment 21 Anand Subramanian 2013-07-17 05:28:03 UTC

By default flush_behind is on, just wanted to make sure it was not turned off for the last couple of runs. Thanks.

Comment 22 Anand Subramanian 2013-07-17 05:39:01 UTC

Santosh,

Am not able to attach to the Vnc. If you are running a test, could you run nfsstat -rc on one of the clients where iozone is running. If nfsstat outputs retrans > 0, it probably implies the server is not able to handle all the client requests. Just wanted to check this.

Anand

Comment 23 Anand Subramanian 2013-07-17 05:52:33 UTC

Does it make sense to tweak the defaults for the recv and write socket buffers and try a run with increased values? Defaults seem quite low:

[root@gprfs017 blr-team]# cat /proc/sys/net/core/rmem_max
229376

same for the wmem_max.

Comment 24 santosh pradhan 2013-07-17 07:11:47 UTC

I think we should try with network subsystem tuning. (in whole setup e.g.gprfs017, gprfs018, gprfc089, gprfc090, gprfc091 and gprfc092).

This is a nice link to tune the 10GE NIC for better performance:

http://fasterdata.es.net/host-tuning/linux/

I am able to attach to vnc. I am not running anything but going to run another instance of iozone. (with the gluster-debug bundles installed). So, that perf annotate would point to the source line. Currently its just mapping to assembly. 

I have not touched the flush_behind parameter of wb-xlator.

Comment 25 santosh pradhan 2013-07-17 12:12:05 UTC

1. I discussed with Raghavendra Gowdappa about the flush_behind paramater, he says setting it to OFF will make the performance worse.

2. I ran the iozone with smaller load (16g per thread with 16 threads) i.e. 

iozone -+m /var/blr-team/iozone_cluster.config -+h 10.16.159.187 -w -c -e -i 0 -+n -r 32k -s 16g -t 16

I saw something different behaviour this time. Both the server went upto ~ 15G (RSS) and then came down to 2.5G RSS but iozone did not finish. I am not sure but problem seems to be somewhere else. The NFS client might be doing another layer of write-behind for its performance (which might be reason for wb_requests_overlap() to be hit more ).

3. I am not able to map the perf data to source line this time as well. Already installed the debug symbols in gprfs017 (server).

4. In both the servers (17 and 18), in /var/log/messages: I saw something like this, not sure if this is related. 

Jul 17 08:56:52 gprfs018 GlusterFS[22756]: [2013-07-17 08:56:52.375118] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-0: cannot lookup the saved frame corresponding to xid (450331)
Jul 17 08:56:53 gprfs018 GlusterFS[22756]: [2013-07-17 08:56:53.366492] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-1: cannot lookup the saved frame corresponding to xid (463375)

5. In one of the clients i.e. gprfc092 in /var/log/messages: 

Jul 17 07:50:17 gprfc092 kernel: INFO: task iozone:3160 blocked for more than 120 seconds.
Jul 17 07:50:17 gprfc092 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 17 07:50:17 gprfc092 kernel: iozone        D 0000000000000000     0  3160      1 0x00000080
Jul 17 07:50:17 gprfc092 kernel: ffff880831b3fe08 0000000000000082 ffff880830566aa0 ffff880830566aa0
Jul 17 07:50:17 gprfc092 kernel: ffff880830566aa0 ffff88083fee6140 ffff880830566aa0 0000001081167330
Jul 17 07:50:17 gprfc092 kernel: ffff880830567058 ffff880831b3ffd8 000000000000fb88 ffff880830567058
Jul 17 07:50:17 gprfc092 kernel: Call Trace:
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118f221>] ? path_put+0x31/0x40
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150efbe>] __mutex_lock_slowpath+0x13e/0x180
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150ee5b>] mutex_lock+0x2b/0x50
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191e39>] do_unlinkat+0xa9/0x1b0
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff810dc8f7>] ? audit_syscall_entry+0x1d7/0x200
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191f56>] sys_unlink+0x16/0x20
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jul 17 07:50:17 gprfc092 kernel: INFO: task iozone:3167 blocked for more than 120 seconds.
Jul 17 07:50:17 gprfc092 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 17 07:50:17 gprfc092 kernel: iozone        D 0000000000000000     0  3167   3166 0x00000080
Jul 17 07:50:17 gprfc092 kernel: ffff8807eb81fdd8 0000000000000086 0000000000000000 ffff8807eb81fd68
Jul 17 07:50:17 gprfc092 kernel: ffff8807eb81fd88 ffffffffa0275688 ffff8807eb81fdb8 0000000000000286
Jul 17 07:50:17 gprfc092 kernel: ffff880831d2e5f8 ffff8807eb81ffd8 000000000000fb88 ffff880831d2e5f8
Jul 17 07:50:17 gprfc092 kernel: Call Trace:
Jul 17 07:50:17 gprfc092 kernel: [<ffffffffa0275688>] ? nfs_refresh_inode_locked+0x468/0xce0 [nfs]
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150efbe>] __mutex_lock_slowpath+0x13e/0x180
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8121cd6f>] ? security_inode_permission+0x1f/0x30
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8150ee5b>] mutex_lock+0x2b/0x50
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118fabc>] vfs_unlink+0x5c/0xe0
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8118e84a>] ? lookup_hash+0x3a/0x50
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191e85>] do_unlinkat+0xf5/0x1b0
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff810dc8f7>] ? audit_syscall_entry+0x1d7/0x200
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff81191f56>] sys_unlink+0x16/0x20
Jul 17 07:50:17 gprfc092 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jul 17 07:59:51 gprfc092 kernel: nfsacl: server gprfs018-10ge not responding, still trying


I think for this kind of workload, we need to tune:

1. In the client side:
 a) the options passed to mount command
 b) N/W tuning for 10GE and high workload
2. In the server side:
N/W tuning to handle the high workload

3. Not sure but we may need to look the windows size in wb-xlator.

Thanks,
Santosh

Comment 28 santosh pradhan 2013-07-19 14:37:07 UTC


1. Wrt RHS 2.0 Update 4, using the same testbed from Ben, we see the issue (reported for Big Bend) both with and without write-behind.

Ben : please check the testbed you gave us. The last run was with the write-behind xlator enabled for a U4 run. It would be great if you could take a few minutes to take a look and see if there is anything missing in the configuration from what you tested U4 with (speaking of the successful iozone run(s)).

2. Do you also remember the configuration (gluster conf mainly) when you ran iozone successfully. We did go through your iozone results but we are unsure if write-behind was enabled or not for those runs. Which version of RHEL was in your config?

3. Please note that in the testbed you shared with us, all of the following seem to hang:
(a) U4 with write-behind
(b) U4 without write-behind (glusterfs-nfs and glusterfsd both are OOM killed)
(c) Big Bend release 3.4.0.12rhs.beta3/

4. In our U4 runs, we noticed from iostat output that initially when the iozone workload started, the queue-len await times (not looking at svctm) are bad at times:
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     1.67    0.00    1.33     0.00    21.33    16.00     0.00    0.75   0.75   0.10
sdb               0.00     0.00    0.00  951.00     0.00 795400.00   836.38    83.86   88.18   0.77  73.67
dm-0              0.00     0.00    0.00    2.67     0.00    21.33     8.00     0.00    0.75   0.38   0.10
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00   698.67    0.00  951.00     0.00 795400.00   836.38    83.96   88.29   0.78  73.77
dm-3              0.00     0.00    0.00 1649.67     0.00 795400.00   482.16   146.38   88.73   0.45  73.77
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00


Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.67    8.67    1.00   298.67    13.33    32.28     0.05    5.24   4.79   4.63
sdb               0.00     0.00    0.00    1.33     0.00  1365.33  1024.00     0.15  737.50  43.75   5.83
dm-0              0.00     0.00    8.33    1.67   296.00    13.33    30.93     0.05    4.97   4.63   4.63
dm-1              0.00     0.00    0.33    0.00     2.67     0.00     8.00     0.00    9.00   9.00   0.30
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    1.33     0.00  1365.33  1024.00     0.15  737.50  43.75   5.83
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.30    0.00   0.00   5.83

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.67    0.00    1.33     0.00    13.33    10.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00  820.00     0.00 685968.00   836.55    69.78   85.10   0.68  55.37
dm-0              0.00     0.00    0.00    1.67     0.00    13.33     8.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00   602.67    0.00  820.00     0.00 685968.00   836.55    69.86   85.19   0.68  55.37
dm-3              0.00     0.00    0.00 1422.67     0.00 685968.00   482.17   121.43   85.30   0.39  55.37


If the latency is ~43ms, then the backpressure could build on the wb xlator as well. We are trying to look at the code and figure out the explanation you wanted wrt nfs commits.

5. It would be great if you could share your RHS-2.0-U5  run results.

Comment 29 Ben England 2013-07-19 20:12:48 UTC

I'm not happy, I just reproduced problem with Anshi glusterfs-3.3.0.7 RPMs.  I tried to run everything the same way I did before, when it clearly did not hang.

I changed my test script so it always logs "rpm -qa" and "gluster volume info", hopefully in the future this will answer questions like this.

So it appears that we always had the problem.  Are you sure that no customers have encountered this already?  

You don't need iozone to reproduce the problem.  A single dd command will do it.  This is the simplest write application that I know so every Gluster user will see this problem when writing large files.

This command hung after only 9 GB of data.  The replicas were not on the NFS server.  You only need 2 servers, 1 client to reproduce this.

ssh gprfc081 'f=/mnt/glnfs/x.dd; rm -fv $f ; dd if=/dev/zero of=$f bs=16384k count=1024'

But this command never hung and never leaked memory - glusterfs process did not grow its virtual memory at all.  This is dd utility equivalent of iozone's "-o" option to use the O_SYNC flag when opening the file.  Every write request is flushed to storage before it completes.

ssh gprfc081 'f=/mnt/glnfs/x.dd; rm -fv $f ; dd if=/dev/zero of=$f bs=16384k count=1k oflag=sync'
removed `/mnt/glnfs/x.dd'
1024+0 records in
1024+0 records out
17179869184 bytes (17 GB) copied, 137.944 s, 125 MB/s

Again this suggests NFS COMMIT processing is not working.

Comment 30 santosh pradhan 2013-07-23 10:00:04 UTC

gluster version: Big bend (3.4.0.12rhs.beta5-2)

Tuning:
1. Server: vm.min_free_kbytes = 1048576

2. client:
Mount nfs with bigger rsize/wsize:
mount -t nfs -o nfsvers=3,tcp,rsize=32768,wsize=32768 gprfs017-10ge:/perftest /mnt/glnfs

3. In iozone, set the record size smaller to avoid huge rpc retransmissions:

(-r 4k with 16 threads)

Verified in the client side (al clients), no retransmission happens. glusterfs(d) processes also came down in memory size.

With this, iozone finished succesfully, here is the output:

iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 32g -t 16
	Iozone: Performance Test of File I/O
	        Version $Revision: 3.392 $
		Compiled for 64 bit mode.
		Build: linux-AMD64 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
	             Ben England.

	Run began: Tue Jul 23 05:27:36 2013

	Network distribution mode enabled.
	Hostname = 10.16.159.187
	Setting no_unlink
	Include close in write timing
	Include fsync in write timing
	No retest option selected
	Record Size 4 KB
	File size set to 33554432 KB
	Command line used: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 4k -s 32g -t 16
	Output is in Kbytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 16 processes
	Each process writes a 33554432 Kbyte file in 4 Kbyte records

	Test running:
	Children see throughput for 16 initial writers 	=  416480.96 KB/sec
	Min throughput per process 			=   24609.73 KB/sec 
	Max throughput per process 			=   27953.86 KB/sec
	Avg throughput per process 			=   26030.06 KB/sec
	Min xfer 					= 29644520.00 KB
	Child[0] xfer count = 30135464.00 KB, Throughput =   24888.27 KB/sec
	Child[1] xfer count = 29906884.00 KB, Throughput =   24827.08 KB/sec
	Child[2] xfer count = 32807456.00 KB, Throughput =   27329.24 KB/sec
	Child[3] xfer count = 32005172.00 KB, Throughput =   26662.56 KB/sec
	Child[4] xfer count = 30035532.00 KB, Throughput =   24817.52 KB/sec
	Child[5] xfer count = 29735420.00 KB, Throughput =   24685.27 KB/sec
	Child[6] xfer count = 32017976.00 KB, Throughput =   26673.22 KB/sec
	Child[7] xfer count = 31608052.00 KB, Throughput =   26331.74 KB/sec
	Child[8] xfer count = 30064660.00 KB, Throughput =   24827.14 KB/sec
	Child[9] xfer count = 30332952.00 KB, Throughput =   25183.77 KB/sec
	Child[10] xfer count = 33040404.00 KB, Throughput =   27521.22 KB/sec
	Child[11] xfer count = 33084820.00 KB, Throughput =   27562.14 KB/sec
	Child[12] xfer count = 30635224.00 KB, Throughput =   25307.52 KB/sec
	Child[13] xfer count = 29644520.00 KB, Throughput =   24609.73 KB/sec
	Child[14] xfer count = 32774976.00 KB, Throughput =   27300.69 KB/sec
	Child[15] xfer count = 33554432.00 KB, Throughput =   27953.86 KB/sec



iozone test complete.


DRC patch is yet to be tested.


Thanks,
Santosh

Comment 31 santosh pradhan 2013-07-23 12:03:40 UTC

vm.min_free_kbytes = 131072 (Reduced to 128MB instead of 1GB)
mount -t nfs -o nfsvers=3,tcp,rsize=32768,wsize=32768
iozone record size = 16k

This time, it passed as well:

iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16k -s 32g -t 16
    Iozone: Performance Test of File I/O
            Version $Revision: 3.392 $
        Compiled for 64 bit mode.
        Build: linux-AMD64

    Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
                 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
                 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
                 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
                 Ben England.

    Run began: Tue Jul 23 06:20:13 2013

    Network distribution mode enabled.
    Hostname = 10.16.159.187
    Setting no_unlink
    Include close in write timing
    Include fsync in write timing
    No retest option selected
    Record Size 16 KB
    File size set to 33554432 KB
    Command line used: iozone -+m /var/tmp/iozone_cluster.config -+h 10.16.159.187 -C -w -c -e -i 0 -+n -r 16k -s 32g -t 16
    Output is in Kbytes/sec
    Time Resolution = 0.000001 seconds.
    Processor cache size set to 1024 Kbytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 17 * record size.
    Throughput test with 16 processes
    Each process writes a 33554432 Kbyte file in 16 Kbyte records

    Test running:
    Children see throughput for 16 initial writers     =  425322.62 KB/sec
    Min throughput per process             =   26312.52 KB/sec
    Max throughput per process             =   26887.80 KB/sec
    Avg throughput per process             =   26582.66 KB/sec
    Min xfer                     = 31964480.00 KB
    Child[0] xfer count = 32900256.00 KB, Throughput =   26468.06 KB/sec
    Child[1] xfer count = 31976400.00 KB, Throughput =   26323.77 KB/sec
    Child[2] xfer count = 33413312.00 KB, Throughput =   26750.23 KB/sec
    Child[3] xfer count = 33157936.00 KB, Throughput =   26559.38 KB/sec
    Child[4] xfer count = 33115824.00 KB, Throughput =   26636.75 KB/sec
    Child[5] xfer count = 31964480.00 KB, Throughput =   26317.89 KB/sec
    Child[6] xfer count = 33021152.00 KB, Throughput =   26450.86 KB/sec
    Child[7] xfer count = 33114976.00 KB, Throughput =   26529.80 KB/sec
    Child[8] xfer count = 33304128.00 KB, Throughput =   26887.80 KB/sec
    Child[9] xfer count = 32330704.00 KB, Throughput =   26619.55 KB/sec
    Child[10] xfer count = 33297312.00 KB, Throughput =   26662.14 KB/sec
    Child[11] xfer count = 33161376.00 KB, Throughput =   26564.73 KB/sec
    Child[12] xfer count = 32718272.00 KB, Throughput =   26312.52 KB/sec
    Child[13] xfer count = 32231040.00 KB, Throughput =   26529.39 KB/sec
    Child[14] xfer count = 33499104.00 KB, Throughput =   26827.74 KB/sec
    Child[15] xfer count = 33554432.00 KB, Throughput =   26882.01 KB/sec



iozone test complete.

Comment 32 Ben England 2013-07-23 18:00:48 UTC

sorry, I can't reproduce.  I tried all your suggestions, it hangs the same way it did before.  However, there are no stack traces, which is consistent with fact that RHEL6.4 isn't as vulnerable to memory fragmentation problem.

For example, gprfc090 "nfsstat -c" shows this, and gprfc090 was rebooted immediately before this test was run so counters were all zeroed.  
Counters stop changing when it hangs, and /var/log/messages says NFS server is not responding.  NFS server shows 7 GB in glusterfs and it's using 1 core up.

Jul 23 13:31:28 gprfc090 kernel: nfs: server gprfs018-10ge not responding, still trying

If we analyze the nfs counters, we have ~2.6 million writes x 32 KB/write = ~83 GB of writes, which makes sense -- iozone has 4 threads running on this client, each is trying to write 32 GB for a total of 128 GB.  The number of commit requests is 68 , so we have roughly 1 GB per commit request!   This may be the problem.  If we issue a COMMIT while system is busy, or perhaps multiple commits from different clients, could it timeout?  

I don't know of a way to increase the COMMIT frequency on the client side.  I thought fs.nfs.nfs_congestion_kb would do it but it doesn't work.

Client rpc stats:
calls      retrans    authrefrsh
2657537    25597      2539831 

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 12        0% 0         0% 20        0% 20        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
0         0% 2658483  99% 12        0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 4         0% 2         0% 68        0%

Comment 33 Ben England 2013-07-23 21:36:17 UTC

I now have some proof that this problem is in Gluster-NFS implementation and needs to be fixed there.

Jeff Layton helped me figure out how to get the NFS COMMIT frequency from the client to increase -- I set this on all the clients:

# sysctl -w vm.dirty_bytes=262144000 vm.dirty_background_bytes=65536000

When I do this, the excessive glusterfs memory consumption does not occur as much.

By default, they use vm.dirty_ratio and vm.dirty_background_ratio, which default to 20% and 10% of RAM respectively, so for a 64-GB system that would be 12 GB and 6 GB respectively.

So Gluster NFS translator is in fact flushing when COMMIT arrives, they just weren't arriving often enough to keep the write-behind list from getting excessively long. With this setting, I saw glusterfs grow to 2.7 GB VM size, but it still seemed to be able to function, and the test completed, even with a record size of 16384 KB. Now I'm going to remove some of the other tuning and verify that it still works, to confirm the hypothesis. Sorry that it takes a long time.

But you cannot expect NFS clients to tune vm.dirty_bytes and vm.dirty_background_bytes, this was just a workaround. What has to happen is that Gluster has to automatically flush uncommitted writes to storage if they exceed some size threshold. So instead of calling nfs_flush from just the nfs3_commit routine, you have to trigger a call to it from somewhere else (maybe during write processing) if this threshold is detected. Can we make the threshold configurable please?

Comment 34 Ben England 2013-07-24 04:08:54 UTC

Now that I've increased NFS COMMIT frequency as Jeff Layton suggested with 

vm.dirty_background_bytes=32768000
vm.dirty_bytes=65536000

on the NFS clients to help Gluster-NFS implementation flush writes to bricks sooner, I have successfully backed out tunings that were introduced earlier when we weren't sure what the cause was - no hangs are occurring anymore.  glusterfs NFS server process never exceeds 1 GB, is closer to 600 MB now.

- removing NFS mount options rsize=32k,wsize=32k does not cause hang
- Removing record size of 16 KB (back to 64 KB) does not cause hang
- removing kernel network tunings in /etc/sysctl.conf does not cause hang (TCP autotuning should be handling some of them). 
- No volume tuning parameters were required - but eager-lock may help performance
- vm.min_free_kbytes=1GB is not required (on Big Bend = RHEL6.4) - but may help performance

But as soon as I set vm.dirty_background_bytes=0 vm.dirty_bytes=0 (default values), glusterfs becomes huge and hangs again.

This bz is not fixed until NFS clients don't have to be tuned.

Comment 35 Scott Haines 2013-07-24 07:14:50 UTC

Not a regression.  Requires kernel tunables.  Now targeting Corbett.

Comment 37 santosh pradhan 2013-07-25 10:29:11 UTC

Working on the DRC patch.

Comment 38 santosh pradhan 2013-07-25 12:58:30 UTC

Update:
=======

1. I tried DRC patch (without any tuning in client/server side), same behaviour i.e iozone hangs.

My findings after all these testing:
===================================

2. Whenever iozone hangs, these are the behaviours we see in client/server side:

a. client side: nfsstat shows a lot of retransmission:

e.g.
Client rpc stats:
calls      retrans    authrefrsh
418122     31379      396229 

which is huge.

What does it tell: NFS server is not able to handle the workload. Gluster NFS is single threaded (Now it can not be changed though).

[Whenever it's seen, in case of kernel NFS, it's advised to increase the number of NFS kernel thread.]

b. server side:
"netstat -s" shows a lot of packet overrun issues in network:

e.g.

    1884 packets pruned from receive queue because of socket buffer overrun
    42 packets pruned from receive queue
    14 packets dropped from out-of-order queue because of socket buffer overrun
    1629267 packets collapsed in receive queue due to low socket buffer

It demands "Network Tuning" in server side for better NFS performance.

3. What can be tuned for better NFS performance:
a. Network (server side)
b. System e.g. vm subsystem in server
c. Client e.g. VM subsys in client

It's recommended to tune the client VM subsys for better NFS performance, especially the dirty page-cache parameter which may congest the network during the high I/O workload (high volume of write) in a system with bigger RAM (more than 8-GB).

In our case, the amount of RAM in clients (64GB) are more than that of server ( ~48GB). The default vm params in client m/c:

vm.dirty_background_ratio = 10 ( which is ~ 6.4GB)
vm.dirty_ratio = 20 ( which is ~26GB)

These are really big for a NFS client which needs tuning.

vm.dirty_background_bytes = 0
vm.dirty_bytes = 0

The tuning can be made through either of the set but only one set ll be taken into a/c, later has the preference I guess (not sure).

http://www.novell.com/support/kb/doc.php?id=7010287

Not sure if we enforce the write in gluster NFS, how much it is going to help. I am not sure but I guess NFS kernel version also does not do any such enforcement.

I would request Anand/Ben/Sayan/Vivek for their input to take it forward from here.

Thanks,
Santosh

Comment 39 santosh pradhan 2013-07-25 16:40:37 UTC

One more thing forgot to mention i.e. performance problem because of RAID: ( which might need some tuning)

The glusterfs brick is cut from a RAID disk (in both the servers), they might be imposing some performance bottleneck (as mentioned by Anand in one of the email thread). 

What is the RAID level for the bricks ?
What is the chunk size of the RAID?
What is the block size of underlying filesystem (xfs in this case)?

I dont know much about the RAID tuning for performance but I think xfs/RAID needs some tuning for better performance.

Reference:
(http://www.redwax.co.uk/blog-reader/items/linux-ext2-ext3-and-ext4-stride-stripe-calculation-on-raid.html)

(1) Ideally 10GE set-up should provide better NFS performance instead of 1GE interface. If the NFS is going to write 64KB data (GlusterFS supports NFS v3 which auto-negotiates to 64KB as rsize/wsize) then it should generate more TCP packets in case of 1GE (1500 MTU) than 10GE (MTU >= 9000). [Exception: 1GE also can support jumbo-frames if configured]. But in our case, the reverse happens. I tested with 1GE setup where iozone test passed with all record size. In fact, iozone record size should not matter, what matters is the NFS (r/w)size.

(2) In the past, Ben tested the same with RHS 2.0 U4 (as per the shared test result) where iozone test passed. But with same version of GlusterFS s/w, the test fails in this setup.

From both of the above mentioned points, it looks like the disk plays something in the performance degradation.

I will check with Anand/system admin (here in bangalore) if jumbo frame can be enabled in the switch where both of BLR-servers are connected. Both of the servers have 10GE interface. Both of the machines are connected to SATA disks (Anand, please rectify me if the disks are SAS or so).

Comment 40 Ben England 2013-07-30 20:08:39 UTC

I tried using vm.dirty_ratio=2 vm.dirty_background_ratio=1 and it hung, so no reasonable vm.dirty tuning is going to make Gluster-NFS work.  

I also saw glusterfs fail to stop when I stopped the volume.  THis makes it impossible to fix without killing the glusterfs NFS process, not a good thing.

re: .-1

These clients are not abnormally sized.  If anything you could see clients with far more memory than this in the field.

RAID level is 6
chunk size is 256 KB or 64 KB, not sure which right now.
block size is default (directory block size is 8192)
XFS file system is made according to RHS best practices document as follows by the test harness used by perf team:

mkfs -t xfs -f -i size=512 -n size=8192 -d agcount=31,su=256k,sw=10 -L RHSbrick0 /dev/vg_brick0/lv && mount -v -t xfs -o inode64,noatime -L RHSbrick0 /mnt/brick0

agcount=31 should not be needed if you have latest xfsprogs rpm in Big Bend.
noatime is irrelevant for this workload.

Typically jumbo frames can be enabled in the switch.  jumbo frame size of 7000 is almost as good as 9000 but is much easier for memory allocator to find, so if you are seeing kernel stack traces with jumbo frames try this.

as for disks, BAGL hosts use SAS 7200-RPM drives with PERC H700 (MegaRAID) with writeback cache enabled.

Comment 41 santosh pradhan 2013-08-01 17:46:43 UTC

(In reply to Ben England from comment #40)
> I tried using vm.dirty_ratio=2 vm.dirty_background_ratio=1 and it hung, so
> no reasonable vm.dirty tuning is going to make Gluster-NFS work.  
> 

[Santosh] dirty_background_ration is 1 i.e. 1% of RAM is ~ 640MB which still looks higher. May be something around 300MB would work for our RAM size as mentioned by the Novell KB article. (yet to be tried out) 


> I also saw glusterfs fail to stop when I stopped the volume.  THis makes it
> impossible to fix without killing the glusterfs NFS process, not a good
> thing.
>

[Santosh] Agree, this does not look good. :-(
I am not actively working on this bug as this is targeted for Corbett. Thanks for all the experimentation.

Thanks,
Santosh

Comment 42 Ben England 2013-08-05 21:05:47 UTC

I have a hang now even with the kernel tuning.  Performance is bad with prior vm.dirty_bytes=65536000 so I tried multiplying by 4...  

 -- gprfc077 -- 
vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 32768000
vm.dirty_ratio = 0
vm.dirty_bytes = 262144000

and with this workload 

/usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -i 0 -C -w -+n -s 67108864 -r 128 -c -e -t 8

I got multiple of these messages on multiple servers:

Aug  5 15:09:53 gprfs017 GlusterFS[17954]: [2013-08-05 15:09:53.438281] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-0: cannot lookup the saved frame corresponding to xid (1618709)
..
Aug  5 15:45:04 gprfs019 GlusterFS[16293]: [2013-08-05 15:45:04.467323] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-perftest-client-2: cannot lookup the saved frame corresponding to xid (1130351)
...

in /var/log/glusterfs/nfs.log:

[2013-08-05 15:45:09.733087] I [client-handshake.c:450:client_set_lk_version_cbk] 0-perftest-client-2: Server lk version = 1
[2013-08-05 16:15:18.775003] E [rpc-clnt.c:207:call_bail] 0-perftest-client-2: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x1138745x sent = 2013-08-05 15:45:08.199970. timeout = 1800
[2013-08-05 16:15:18.775040] W [client-rpc-fops.c:1811:client3_3_fxattrop_cbk] 0-perftest-client-2: remote operation failed: Transport endpoint is not connected
[2013-08-05 18:30:09.639496] E [rpc-clnt.c:207:call_bail] 0-perftest-client-2: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x1167741x sent = 2013-08-05 18:00:07.870139. timeout = 1800
...

my theory is that some gluster process crashes and loses state about NFS such that NFS cannot make progress because it can't get COMMITs to succeed.

Comment 43 santosh pradhan 2013-08-07 11:49:52 UTC

Ben,

I dont think gluster process crashes. They become unresponsive, mostly because of TCP/IP stack pressure. (in server side)

1. From our experimentation, one thing looks clear i.e. if we do NFS COMMIT more frequently with less data (cached in client), the test PASS.

dd /iozone with O_DIRECT/sync flag PASS because it by-passes/avoids the client cache.

No rpc retransmission and NO TCP/IP packet drops/data lost are seen.

2. So, if we increase the cache amount (vm dirty_ratio/dirty_background_ratio), then the test FAILs. 

In FAIL case, rpc retransmission goes up crazy along with TCP/IP layer pressure in the server side.

So my understanding is no single tunable is going to help for NFS performance.

Client side:
===========

1. Tuning needed in vm.dirty_ratio and vm.dirty_background_ratio (or bytes) with m/c with huge RAM. Size needs to be decided based on the experiments/various tests.

2. Increase the number of RPC connections to server i.e. /proc/sys/sunrpc/tcp_slot_table_entries, especially for 10GE interface.

[ Ref: http://researcher.watson.ibm.com/researcher/view_person_subpage.php?id=4427]

Server side:
==========

1. Tuning vm (already there in place)
2. a) Tuning tcp/ip/socket layer buffers - Needs to be done.
   b) Tune at NIC level i.e. txqueuelen
3. Tune performance.write-behind-window-size (Not sure how much would be good, need to check).
4. Tune at I/O scheduler level for Disk I/O performance:
For NFS server m/c, I guess deadline is the scheduler to go with.
It is already SET in /dev/sdb which is brick dir (if I am not wrong)
[Ref https://docspace.corp.redhat.com/docs/DOC-92040]

In our test bed, looks like the tcp/socket(core)/ip/NIC layer tuning is needed in the SERVER side and vm-cache size and RPC-tcp_slot_table_entries needs to be tuned in the CLIENT side.

Thanks,
Santosh

Comment 47 Ben England 2013-09-09 17:15:34 UTC

I disagree that the problem is single-threaded-ness of NFS. I reduced number of clients from 8 to 4, so there was only 1 NFS client per NFS server, and the same problem still occurs with sufficiently large files.  Throughput is better though until the hang occurs.

Sayan has said repeatedly that a solution which requires tuning the NFS client is not acceptable, because no one will implement it.   We need to fix the NFS server to avoid this problem.  I have not been able to tune my way out of the problem by adjusting TCP buffering kernel parameters on the server, for example.   Please do not close this bug on the basis that client tuning is a solution.

Comment 50 Pavithra 2013-09-13 16:50:57 UTC

This is added as a known issue in the 2.1 release notes.

Comment 51 Pavithra 2013-09-13 16:51:44 UTC

Created attachment 797446 [details]
known issue

Comment 52 Pavithra 2013-09-16 05:33:35 UTC

The GA link is here for the known issue:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/2.1_Release_Notes/index.html#chap-Documentation-2.1_Release_Notes-Known_Issues

Comment 53 Pavithra 2013-09-16 06:41:10 UTC

Moving it back to assigned state: 
https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/2.1_Release_Notes/index.html#chap-Documentation-2.1_Release_Notes-Known_Issues

Comment 54 Ben England 2013-09-19 16:57:12 UTC

See bz 1008301 comment 4 for workaround

Comment 55 santosh pradhan 2013-10-07 10:01:05 UTC

I tested with minimal changes (from patch posted for review) with downstream, iozone load is passing:

Volume info:
===========
Volume Name: perftest
Type: Distributed-Replicate
Volume ID: c052bfd8-fd21-4140-8059-f3d2c5993b76
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gprfs017-10ge:/mnt/brick0/brick
Brick2: gprfs018-10ge:/mnt/brick0/brick
Brick3: gprfs019-10ge:/mnt/brick0/brick
Brick4: gprfs020-10ge:/mnt/brick0/brick

How the volume is mounted with 8 clients:
=========================================

All mount points on gprfs017.sbu.lab.eng.bos.redhat.com:
172.17.10.89:/perftest
172.17.10.94:/perftest
172.17.70.252:/perftest
All mount points on gprfs018.sbu.lab.eng.bos.redhat.com:
172.17.10.90:/perftest
172.17.10.95:/perftest
All mount points on gprfs019.sbu.lab.eng.bos.redhat.com:
172.17.10.91:/perftest
172.17.10.96:/perftest
All mount points on gprfs020.sbu.lab.eng.bos.redhat.com:
172.17.10.77:/perftest
172.17.10.92:/perftest


IOZONE workload:
=================

	Iozone: Performance Test of File I/O
	        Version $Revision: 3.392 $
		Compiled for 64 bit mode.
		Build: linux-AMD64 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
	             Ben England.

	Run began: Mon Oct  7 05:46:00 2013

	Network distribution mode enabled.
	Hostname = 172.17.70.252
	Setting no_unlink
	Include close in write timing
	Include fsync in write timing
	No retest option selected
	Record Size 128 KB
	File size set to 67108864 KB
	Command line used: /usr/local/bin/iozone -+m /var/tmp/iozone_cluster.config -+h 172.17.70.252 -C -w -c -e -i 0 -+n -r 128 -s 64g -t 8
	Output is in Kbytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 8 processes
	Each process writes a 67108864 Kbyte file in 128 Kbyte records

	Test running:
	Children see throughput for  8 initial writers 	=  984939.92 KB/sec
	Min throughput per process 			=  119135.94 KB/sec 
	Max throughput per process 			=  128661.16 KB/sec
	Avg throughput per process 			=  123117.49 KB/sec
	Min xfer 					= 59963392.00 KB
	Child[0] xfer count = 59963392.00 KB, Throughput =  119135.94 KB/sec
	Child[1] xfer count = 64541440.00 KB, Throughput =  128661.16 KB/sec
	Child[2] xfer count = 67108864.00 KB, Throughput =  122749.62 KB/sec
	Child[3] xfer count = 60181632.00 KB, Throughput =  123384.07 KB/sec
	Child[4] xfer count = 60026624.00 KB, Throughput =  119214.86 KB/sec
	Child[5] xfer count = 60088448.00 KB, Throughput =  126651.59 KB/sec
	Child[6] xfer count = 66225664.00 KB, Throughput =  121826.70 KB/sec
	Child[7] xfer count = 60401152.00 KB, Throughput =  123315.98 KB/sec



iozone test complete.

Comment 57 Ben England 2013-10-09 14:06:30 UTC

manoj says some of his tests still fail with the patch.    I think we need to understand whether the original problem is solved.  In the original configuration we used 8 clients, 4 servers, and 1/2/4 threads/server, writing a total of 128 GB data per server in the test.  I'd love to think this solves the problem but until I see the original test case pass I'm unconvinced.  The reason that the original test case matters is that with 4 servers, you have 50% chance that any file will replicas not on the NFS servers, this induces more network traffic and latency.

Comment 58 santosh pradhan 2013-10-09 14:28:26 UTC

(In reply to Ben England from comment #57)
> manoj says some of his tests still fail with the patch.    I think we need

[Santosh] I dont know what is the workload he is trying with. In he was present while I ran the iozone I/O workload (with the 1x2 vol, 4 clients, 2 clients mounted/server) which PASSED. :-) 

> to understand whether the original problem is solved. In the original
> configuration we used 8 clients, 4 servers, and 1/2/4 threads/server,
> writing a total of 128 GB data per server in the test.  I'd love to think
> this solves the problem but until I see the original test case pass I'm
> unconvinced.  The reason that the original test case matters is that with 4
> servers, you have 50% chance that any file will replicas not on the NFS
> servers, this induces more network traffic and latency.

[Santosh] This is exactly what was being tested in comment # 55.

Comment 60 Ben England 2013-10-09 18:55:12 UTC

avati suggestions in discussion today 
- turn off write-behind, we don't need it, particularly with NFS client aggregating I/O.
- have enough RPC slots to not starve clients (above patch right?)
- implement fairness in RPC socket polling level (round-robin scheduling) so 1 client cannot monopolize resources on the NFS server (glusterfs). This last change has not been coded yet.

Comment 62 santosh pradhan 2013-10-14 16:46:46 UTC

With so many experiments, it looks like the number of RPC in-flight requests are not the real issue. The real issue is networking subsystem tuning (in driver level). While the clients are sending/bursting huge I/O, then because of auto-tuning nature of TCP buffers, they gets tuned to higher value to receive the I/O packets (NFS I/O gets broken into TCP frame size packets i.e. 1mb NFS I/O size ll be broken into 9000 bytes jumbo-frame to send across). If the Gluster NFS process (through Kernel) is not able to process the packets fast enough because of some reason e.g. especially because of slowness of disk bandwidth (high avgqu-size and await time), then back pressure builds upto NFS and its size grows. Then the real problem builds up i.e. the TCP/IP stack receive queue i.e. netdev_max_backlog gets FULL and starts dropping packets. Eventually the connection gets RESET. It really matters. If the connection between NFS server and brick-server gets reset, the NFS server memory footprint blows up and finally OOM killed or so. Or if the connection between NFS client and server gets reset, then NFS server finishes up with pending I/O to its disk and other bricks (clean up its outstanding/backlog) and its memory footprint comes down to normal. But it ll stop getting the I/O packets from client. Hence the problem.

Solution:

1. Faster disk subsystem i.e. less avgqu-size and await time.
2. Tune the txqueuelen (for Tx queue size) and netdev_max_backlog (Rx queue size).

There is no standard value but as per the requirement. Till no drop in netstat -in output.

I am going to modify the FIX which was up for review. May abandon or may push with smaller change to keep the NFS fopsize in sync with RPC lib.

Need Ben/manoj's view on this.

Comment 63 Manoj Pillai 2013-10-16 18:03:18 UTC

I followed up on a suggestion that came up from Sayan in the RHS-perf call, which was to look at kernel NFS server behavior under the same workload, on the same setup. The kernel NFS server exported the same XFS filesystem that served as the brick FS in the gluster NFS runs. gprfs017 served as the NFS server in these runs.

All kernel NFS runs completed smoothly.

Variations tried:
Iozone run with 4 clients: -s 16g, -t 4.
Iozone run with 8 clients: -s 16g, -t 8.
Iozone run with 8 clients; 1 NFS server thread instead of default 8: -s16g,-t8.

[For comparison, gluster NFS iozone runs were failing with -s8g,-t4 (4 clients), single-head mount of an RHS replicated volume]

I/O completion times reported by iostat (await) are fairly high for the kernel NFS runs as well (typically 70+ ms), but neither that nor single-threadedness is resulting in any serious problems.

Comment 64 Ben England 2013-10-16 19:11:10 UTC

Note that this is with the same NFS client, same network and same type of storage that Gluster's NFS implementation is using.  The replication traffic is not there of course.  This result is consistent with hypothesis I have been pursuing, that there is a bug in the NFS translator preventing a) fairness and b) limiting size of write request list.  Kernel NFS does not have this problem, I think this is so because vm.dirty_ratio in the kNFS server will eventually force it to flush pages in buffer cache even if client does not issue a COMMIT RPC.  Gluster does not have a similar mechanism AFAIK, but it should.  Just issuing sync if list exceeds a few hundred MB would be reasonable because this would keep the list small enough.  

if you write the same amount of data with each thread writing a series of 1-GB files with fsync after each file, instead of 1 big 16-GB file, I bet it doesn't hang.   When file is closed and fsync'ed, you can see glusterfs process giving back the memory.

Avati suggested this, but it's not clear to me why Gluster even needs a write behind translator in the NFS stack, since the NFS client is itself performing the same function (i.e. it aggregates writes before issuing the RPC).  Now that you added the RPC size tuning, NFS client can aggregate much more data before issuing the RPC, up to 1 MB right?  That's same as write-behind translator's default tuning.

Comment 67 Manoj Pillai 2013-10-21 15:34:31 UTC

(In reply to Ben England from comment #64)

> 
> if you write the same amount of data with each thread writing a series of
> 1-GB files with fsync after each file, instead of 1 big 16-GB file, I bet it
> doesn't hang.   When file is closed and fsync'ed, you can see glusterfs
> process giving back the memory.
> 

Here are the runs for this case, with the smallfile benchmark. The configuration is 4 clients, single-head NFS mount of a replica volume.

# 1st run; each client writes 8 files, each 1GB; fsync after each file.
# file-size is in KB
for op in cleanup create ; do 
 ./smallfile_cli.py --top /mnt/glnfs/smallfile --threads 1 --host-set gprfc089,gprfc090,gprfc091,gprfc092 --file-size 1048576 --record-size 256 --files 8 --fsync y --operation $op
done

This run completes successfully.



# 2nd run; each client writes 1 8GB file, then fsync
for op in cleanup create ; do 
 ./smallfile_cli.py --top /mnt/glnfs/smallfile --threads 1 --host-set gprfc089,gprfc090,gprfc091,gprfc092 --file-size 8388608 --record-size 256 --files 1 --fsync y --operation $op
done

This run failed.

Comment 68 santosh pradhan 2013-10-28 09:14:32 UTC

Patch up for review:

https://code.engineering.redhat.com/gerrit/#/c/14685/

Comment 69 santosh pradhan 2013-10-31 09:13:44 UTC

Merged the FIX.

Comment 70 Ben England 2014-01-02 12:18:04 UTC

see bz 1008301, which is about the same thing right?  I don't understand why we  clone bzs.

To close this bug, you need to fix nfs.drc=on so it does not cause massive memory leak, and you need to change default for nfs.outstanding-rpc-limit to 8 from 64.

Comment 71 santosh pradhan 2014-01-02 15:13:17 UTC

(In reply to Ben England from comment #70)
> see bz 1008301, which is about the same thing right?  I don't understand why
> we  clone bzs.

[Santosh] We clone bugs because we maintain 2 different code base (one downstream/RHS and one upstream/community). 

> 
> To close this bug, you need to fix nfs.drc=on 

[Santosh] Now in downstream/RHS, nfs.drc is going to be OFF by default. See https://bugzilla.redhat.com/show_bug.cgi?id=1047782

so it does not cause massive
> memory leak, and you need to change default for nfs.outstanding-rpc-limit to
> 8 from 64.

[Santosh] I dont think this change is needed. Could you please confirm that the performance goes down with value 64?

Comment 72 Gowrishankar Rajaiyan 2014-01-10 11:48:44 UTC

(In reply to santosh pradhan from comment #71)
[snip]
> > memory leak, and you need to change default for nfs.outstanding-rpc-limit to
> > 8 from 64.
> 
> [Santosh] I dont think this change is needed. Could you please confirm that
> the performance goes down with value 64?

Per comment #71, needinfo on Ben England.

Comment 73 Ben England 2014-01-13 18:38:49 UTC

Performance doesn't go down -- because the NFS clients hang!  The whole point of Avati's patch was to allow us to prevent the scenario where NFS clients are writing more data than Gluster can accept.  With this tuning, the numbers are good.  I published numbers for RHS 2.1 U2 nightly build a while back, see the NFS tab in this:

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/rhs21u2/rhs21u2-stats.ods

In storage bootcamp TOC section on performance.

measured with:

RPM glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64, from  http://download.lab.bos.redhat.com/nightly/RHSS-2.1u2-20131124.n.0/2.1u2/RHS/x86_64/os/
with tuning nfs.outstanding-rpc-limit: 16

raw data at http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/
  glnfs-rhs21u2-nfsorl16.csv - samples
  glnfs-rhs21u2-nfsorl16-4svr-13-12-01-15-47-33 - run with 4 servers
  glnfs-rhs21u2-nfsorl16-2svr-1repl-13-12-02-08-18-09 - run with 2 servers

When I ran tests without nfs.outstanding-rpc-limit=16, the NFS clients would hang, I'd have to reboot them all to get them back.  

Do you have any evidence that this tuning will cause problems for anyone?  Has anyone else tried to reproduce this problem and observed whether or not this fixes it?  If not, why not just change it and be done with it?

Comment 74 santosh pradhan 2014-01-14 06:08:38 UTC

The default value of outstanding rpc limit is changed to 16 as per the suggestion from Ben. Code is posted for review.

https://code.engineering.redhat.com/gerrit/#/c/18326/

Comment 75 santosh pradhan 2014-01-14 06:09:52 UTC

(In reply to Ben England from comment #73)
> Performance doesn't go down -- because the NFS clients hang!  The whole
> point of Avati's patch was to allow us to prevent the scenario where NFS
> clients are writing more data than Gluster can accept.  With this tuning,
> the numbers are good.  I published numbers for RHS 2.1 U2 nightly build a
> while back, see the NFS tab in this:
> 
> http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/bigbend/rhs21u2/
> rhs21u2-stats.ods
> 
> In storage bootcamp TOC section on performance.
> 
> measured with:
> 
> RPM glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64, from 
> http://download.lab.bos.redhat.com/nightly/RHSS-2.1u2-20131124.n.0/2.1u2/RHS/
> x86_64/os/
> with tuning nfs.outstanding-rpc-limit: 16
> 
> raw data at
> http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/scalability-
> cluster/ben-test-driver-clone2/shared/benchmarks/gluster_test/logs/
>   glnfs-rhs21u2-nfsorl16.csv - samples
>   glnfs-rhs21u2-nfsorl16-4svr-13-12-01-15-47-33 - run with 4 servers
>   glnfs-rhs21u2-nfsorl16-2svr-1repl-13-12-02-08-18-09 - run with 2 servers
> 
> When I ran tests without nfs.outstanding-rpc-limit=16, the NFS clients would
> hang, I'd have to reboot them all to get them back.  
> 
> Do you have any evidence that this tuning will cause problems for anyone? 
> Has anyone else tried to reproduce this problem and observed whether or not
> this fixes it?  If not, why not just change it and be done with it?

I did the change and posted for review. With this, I think we are done and the bug can be closed.

Comment 76 ssamanta 2014-01-15 11:21:28 UTC

Since the fix is abandoned so marking this bug to "Assigned". Please move in to ON_QA with correct fixed-in version.

Comment 77 santosh pradhan 2014-01-16 03:20:40 UTC

https://code.engineering.redhat.com/gerrit/#/c/18388/

Comment 78 Ben England 2014-01-19 14:30:11 UTC

what does comment 77 mean?  Is it in RHS 2.1 Update 2 or is it in Corbett?  

If you need doctext: " Default Gluster volume parameter values changed from nfs.outstanding-rpc-limit=64 and nfs.drc=on to nfs.outstanding-rpc-limit=16 and nfs.drc=off.  These changes improve performance and stability of Gluster NFS implementation for multi-GB file writes".

Comment 79 Vivek Agarwal 2014-01-20 04:35:18 UTC

Update 2 and Corbett are same. Thanks for the doc text.

Comment 80 Pavithra 2014-01-20 05:58:22 UTC

Vivek,

Can you please review the doc text for technical accuracy and sign off?

Comment 81 Ben Turner 2014-02-05 19:15:07 UTC

Verified on glusterfs-3.4.0.58rhs-1.el6rhs.x86_64.  I tested several different configurations, this one seemed to be a good repro from earlier comments so I am documenting it:

[root@gqas003 ~]# gluster volume status
Status of volume: testvol
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 192.168.123.4:/bricks/brick1			49152	Y	4369
Brick 192.168.123.2:/bricks/brick1			49152	Y	4373
Brick 192.168.123.4:/bricks/brick2			49153	Y	4380
Brick 192.168.123.2:/bricks/brick2			49153	Y	4384
Brick 192.168.123.4:/bricks/brick3			49154	Y	4391
Brick 192.168.123.2:/bricks/brick3			49154	Y	4395
Brick 192.168.123.4:/bricks/brick4			49155	Y	4402
Brick 192.168.123.2:/bricks/brick4			49155	Y	4406
NFS Server on localhost					2049	Y	4416
Self-heal Daemon on localhost				N/A	Y	4421
NFS Server on 192.168.123.2				2049	Y	4420
Self-heal Daemon on 192.168.123.2			N/A	Y	4424
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks

[root@gqas011 ~]# cat /clients.ioz 
192.168.123.7 /gluster-mount /usr/bin/iozone
192.168.123.6 /gluster-mount /usr/bin/iozone
192.168.123.3 /gluster-mount /usr/bin/iozone
192.168.123.5 /gluster-mount /usr/bin/iozone

[root@gqas015 ~]# /opt/iozone/bin/iozone -+m /clients.ioz -+h 192.168.123.7 -i 0 -C -+n -s 64g -r 128 -c -e -t 4
	Iozone: Performance Test of File I/O
	        Version $Revision: 3.420 $
		Compiled for 32 bit mode.
		Build: linux 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
	             Vangel Bojaxhi, Ben England, Vikentsi Lapa.

	Run began: Wed Feb  5 13:58:26 2014

	Network distribution mode enabled.
	Hostname = 192.168.123.7
	No retest option selected
	File size set to 67108864 KB
	Record Size 128 KB
	Include close in write timing
	Include fsync in write timing
	Command line used: /opt/iozone/bin/iozone -+m /clients.ioz -+h 192.168.123.7 -i 0 -C -+n -s 64g -r 128 -c -e -t 4
	Output is in Kbytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 4 processes
	Each process writes a 67108864 Kbyte file in 128 Kbyte records

	Test running:
	Children see throughput for  4 initial writers 	=  471090.27 KB/sec
	Min throughput per process 			=   78883.48 KB/sec 
	Max throughput per process 			=  166030.09 KB/sec
	Avg throughput per process 			=  117772.57 KB/sec
	Min xfer 					= 32556032.00 KB
	Child[0] xfer count = 32556032.00 KB, Throughput =   78883.48 KB/sec
	Child[1] xfer count = 40002944.00 KB, Throughput =   95576.45 KB/sec
	Child[2] xfer count = 53048960.00 KB, Throughput =  130600.24 KB/sec
	Child[3] xfer count = 67108864.00 KB, Throughput =  166030.09 KB/sec

	Test cleanup:


iozone test complete.

Comment 83 errata-xmlrpc 2014-02-25 07:31:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.