Bug 728508 - Huge performance regression in NFS client
Summary: Huge performance regression in NFS client
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Jian Li
URL:
Whiteboard:
: 745648 (view as bug list)
Depends On:
Blocks: 730686
TreeView+ depends on / blocked
 
Reported: 2011-08-05 11:55 UTC by Henry Geay de Montenon
Modified: 2018-12-03 17:20 UTC (History)
34 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A previously introduced patch forced the ->flush and ->fsync operations to wait on all WRITE and COMMIT remote procedure calls (RPC) to complete to ensure that those RPCs were completed before returning from fsync() or close(). As a consequence, all WRITEs issued by nfs_flush_list were serialized and caused a performance regression on NFS clients. This update changes nfs_flush_one and nfs_flush_multi to not wait for WRITEs issued when the FLUSH_SYNC parameter is set, resolving performance issues on NFS clients.
Clone Of:
Environment:
Last Closed: 2012-02-21 03:51:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
RHEL 5.5 to 5.7 NFSv3 iozone results (35.29 KB, application/x-compressed-tar)
2011-08-05 11:55 UTC, Henry Geay de Montenon
no flags Details
Details on one regression spot (6.35 KB, text/plain)
2011-08-05 14:34 UTC, Henry Geay de Montenon
no flags Details
mountstats details on one of the regression spots (13.84 KB, text/plain)
2011-08-08 08:40 UTC, Henry Geay de Montenon
no flags Details
Same test on a Netapp NFS export (13.33 KB, text/plain)
2011-08-08 09:14 UTC, Henry Geay de Montenon
no flags Details
/proc/sys/fs/nfs/nfs_congestion_kb content (365 bytes, text/plain)
2011-08-08 11:11 UTC, Henry Geay de Montenon
no flags Details
patch -- revert nfs: ->flush and ->fsync should use FLUSH_SYNC (1.55 KB, patch)
2011-08-08 13:19 UTC, Jeff Layton
no flags Details | Diff
patch results (5.38 KB, text/plain)
2011-08-08 15:47 UTC, Henry Geay de Montenon
no flags Details
patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel (4.60 KB, patch)
2011-08-08 19:03 UTC, Jeff Layton
no flags Details | Diff
patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel (-274.el5) (4.67 KB, patch)
2011-08-09 11:15 UTC, Jeff Layton
no flags Details | Diff
patch results (5.38 KB, text/plain)
2011-08-09 14:50 UTC, Henry Geay de Montenon
no flags Details
patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel (4.62 KB, patch)
2011-08-09 15:21 UTC, Jeff Layton
no flags Details | Diff
iozone result (45.93 KB, application/vnd.ms-excel)
2011-10-24 12:01 UTC, Jian Li
no flags Details
iozone test result (33.29 KB, application/x-tar)
2011-10-24 12:02 UTC, Jian Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0150 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 5.8 kernel update 2012-02-21 07:35:24 UTC

Description Henry Geay de Montenon 2011-08-05 11:55:29 UTC
Created attachment 516884 [details]
RHEL 5.5 to 5.7 NFSv3 iozone results

Description of problem:
NFS (mostly v3, didn't try with v4 yet) performance have been cut down by a rough 50% between kernel 2.6.18-238.19.1.el5 and 2.6.18-274.el5 (between RHEL 5.6 and RHEL 5.7) 

Version-Release number of selected component (if applicable):

kernel-2.6.18-274.el5, nfsv3

How reproducible:

Allways

Steps to Reproduce:
1. mount an NFSv3 share on a RHEL 5.6 server with -o rsize=32767,wsize=32767 options
2. test it with iozone
3. do the same on a RHEL 5.7 server
  
Actual results:

see attached iozone results (roughly a 50% decrease in write performance)

Expected results:

performance should be at least equal between the two kernels with the same test parameters ...

Additional info:
tested with HNAS NFS exports, still have to check NFSv4 performance too

Comment 1 Jeff Layton 2011-08-05 13:22:30 UTC
What is a "HNAS NFS export" ?

Comment 2 Jeff Layton 2011-08-05 14:01:47 UTC
FWIW, I find this a bit odd since it's directly contradictory to my experience in testing 5.7 which showed much *better* write performance. The difference however may be in the test setup however and in particular, the server details...

If this is a RHEL5.6 server, what is the underlying filesystem being exported?

I suggest that we focus on NFSv3 for now since that's the default in RHEL5. NFSv4 will probably be similar anyway...

What may be best is to start with one of the tests that has regressed here (say Fwrite performance). Choose one "spot" in the set (single file size and I/O size), and verify whether the performance has consistently regressed when you run just that single test.

Assuming it does, what would be a good first pass would be the nfsstat -c output before and after each test run. That would allow us to see if there are much larger numbers of any particular type of RPCs in the regressed case.

Comment 3 Henry Geay de Montenon 2011-08-05 14:19:18 UTC
HNAS NFS export refers to an NFS export from an Hitachi "High performance NAS platform" (HNAS).
I will redo the write test on one regression spot and post results of both the test and nfsstat -c results ASAP

Comment 4 Jeff Layton 2011-08-05 14:34:17 UTC
(In reply to comment #3)
> HNAS NFS export refers to an NFS export from an Hitachi "High performance NAS
> platform" (HNAS).

OK, I'm a little confused -- the initial report said this was a RHEL5.6 server. Is Hitachi's NAS platform built on RHEL? If so, can you provide some details as to what they're using? Hardware, underlying filesystem type, etc...

Comment 5 Henry Geay de Montenon 2011-08-05 14:34:22 UTC
Created attachment 516904 [details]
Details on one regression spot

Here is an attachment of the results I got from running iozone again only on one of the regression spot I got out of the whole test and the associated nfsstat ran before and after the test on both RHEL 5.6 and  5.7 kernels

Comment 6 Henry Geay de Montenon 2011-08-05 14:36:46 UTC
Sorry my bad, the problem is a regression that happened between RHEL 5.6 and RHEL 5.7, I didn't try with a RHEL NFS server, only a third party one (Hitachi).
This bug report is all about RHEL's NFS CLIENT, not server component.

Comment 7 Jeff Layton 2011-08-05 16:59:04 UTC
Ok, we'll need some details about their server. One oddity in the data:

5.6:
writes = 4096
creates = 2

5.7:
writes = 14340
creates = 4

...so we see almost 4 times the number of write calls in this data, along with twice as many files being created. The increase in create calls is particularly suspicious...

Was this test being run in isolation, or was there something else going on on the 5.7 machine at the same time?

What may be better than nfsstat here is the output from /usr/bin/mountstats on the mountpoint in question. That would actually give us more detailed info.

Please make sure that there's no other NFS activity on the machine at the time as well and preferably do it after a reboot. That should help ensure that we have a clean test.

Comment 8 Henry Geay de Montenon 2011-08-08 08:40:08 UTC
Created attachment 517131 [details]
mountstats details on one of the regression spots

Here is the result from a fresh test, with a completely isolated environment and freshly rebooted servers.

Comment 9 Henry Geay de Montenon 2011-08-08 09:14:01 UTC
Created attachment 517145 [details]
Same test on a Netapp NFS export

Just made a test with a netapp NFS export to check if the problem is within the HNAS server, and the performance drop is even worse !
Also made a quick test with RHEL5.6 being the NFS server and RHEL5.7 the client, and I got the exact same write performance than with HNAS server (~50MB/s max) (I'll post detailled results from this one too if needed)

Comment 10 Jeff Layton 2011-08-08 10:43:20 UTC
How much RAM is in this machine? I wonder whether you're hitting some sort of problem with the congestion control...

Can you also give me the output from:

# cat /proc/sys/fs/nfs/nfs_congestion_kb

Comment 11 Henry Geay de Montenon 2011-08-08 11:11:45 UTC
Created attachment 517171 [details]
/proc/sys/fs/nfs/nfs_congestion_kb content

Both servers are DL380 G5 servers with 16GB of RAM.
Here is the output of cat /proc/sys/fs/nfs/nfs_congestion_kb on both (RHEL 5.6 and 5.7) servers. The key doesn't seem to exist by default on the 5.6 server though (nor on our other RHEL 5.5 servers).

Comment 12 Jeff Layton 2011-08-08 13:19:57 UTC
Created attachment 517210 [details]
patch -- revert nfs: ->flush and ->fsync should use FLUSH_SYNC

Would you be able to test out this patch and let me know if it helps?

Note that this is just to verify whether this is the change that causes the performance to regress in your case. Reverting this patch may cause the data-integrity issue that it fixes to regress.

Comment 13 Henry Geay de Montenon 2011-08-08 14:25:11 UTC
Currently rebuilding the kernel with this patch, I will post the results ASAP

Comment 14 Henry Geay de Montenon 2011-08-08 15:47:41 UTC
Created attachment 517249 [details]
patch results

The patch seems to fix the performance issue. I also built a debug kernel if you need me to do some tests about the data-integrity issues that may have come back.

Comment 15 Jeff Layton 2011-08-08 15:54:20 UTC
Thanks for confirming it. Unfortunately, we can't take that patch as-is, as it fixes a data integrity problem where close returns before it should. I'll need to look over the code more closely to see what we can be done.

Comment 16 Jeff Layton 2011-08-08 18:14:10 UTC
I suspect I know what the problem is, but fixing it won't be trivial...

The reason we use FLUSH_SYNC here is to prevent a data-integrity race that can cause close() to return more quickly than it should. nfs_flush_one will do this:

        if (how & FLUSH_SYNC)
                rpc_wait_for_completion_task(&data->task);

Unfortunately, that has the effect of serializing all of the writes being issued by nfs_flush_list. We could parallelize these writes -- have nfs_flush_one/nfs_flush_multi return before the call completes, but then we'd need to collect a list of rpc_task pointers, and make sure we wait on them to complete before exiting.

That's doable but complicated and will make the code diverge further from what went in upstream. Still, the time to worry about that has really passed with RHEL5, so I'll probably at least hack out a proof-of-concept patch that we can use for testing.

Comment 17 Jeff Layton 2011-08-08 19:03:16 UTC
Created attachment 517284 [details]
patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel

Here's a potential patch. This changes nfs_flush_one/multi to not wait for the RPC task to complete after issuing it in the FLUSH_SYNC case. Instead, tasks are placed on a private list, and then it waits for them to complete after issuing them all. This allows more of them to run in parallel.

Cursory testing shows a pretty substantial performance boost. If you can test this patch and report back with the results, then that would be great.

Comment 18 Henry Geay de Montenon 2011-08-09 08:55:42 UTC
Hi, the patch doesn't seem to apply to the 2.6.18-274-src kernel package, are you making it against a git kernel ? (I get errors on the include/linux/sunrpc/sched.h, and my source file doesn't seem to look like yours as I can't even find the hook manually)

Comment 19 Jeff Layton 2011-08-09 11:15:40 UTC
Created attachment 517388 [details]
patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel (-274.el5)

Yes, that patch is from the top of the git tree. Here's one based on 274.el5. We just need to add a field to the end of struct rpc_task, but a patch that went in since -274 has done the same thing.

Comment 20 Henry Geay de Montenon 2011-08-09 14:50:45 UTC
Created attachment 517423 [details]
patch results

here are the results of your latest patch. This seems to fix the issue with performances. Please let me know if you need me to make any more tests with it (I'm going to make a full iozone test anyway, I will report if I spot more problems there)

Comment 21 Jeff Layton 2011-08-09 15:21:35 UTC
Created attachment 517427 [details]
patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel

Ok, thanks for testing it.

There is at least one bug in the above patch. It uses list_for_each on the list of tasks and then calls rpc_put_task on each one. If the "put" ends up freeing the task, it'll be accessing freed memory. So we need to use list_for_each_entry_safe there.

In most cases, the freed memory still has the right contents, but it could oops so you may want to do further tests with this one.

Comment 23 RHEL Program Management 2011-08-10 14:00:38 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 Jarod Wilson 2011-08-23 14:06:44 UTC
Patch(es) available in kernel-2.6.18-282.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 31 Jonathan Peatfield 2011-09-15 18:00:20 UTC
Sorry if this should be obvious, but is this fix likely to appear in one of the EL5.7 updates, or if not will it make it into in EL5.8 - or has that window already closed?

Comment 32 Jeff Layton 2011-09-15 18:25:16 UTC
Yes, it's coming to a 5.7.z kernel. See bug 730686.

Comment 33 Jonathan Peatfield 2011-09-15 18:31:46 UTC
(In reply to comment #32)
> Yes, it's coming to a 5.7.z kernel. See bug 730686.

Many thanks.  I hadn't spotted that one.

 -- Jon

Comment 37 Jian Li 2011-10-24 12:00:19 UTC
According to comment 20, the bug has been verified. And In my test, although 2.6.18-274 performance is not much worse than 2.6.18-293/238, but performance of the latter two are equal. NFS server use rhel6.2. Result are attached.

Comment 38 Jian Li 2011-10-24 12:01:09 UTC
Created attachment 529845 [details]
iozone result

Comment 39 Jian Li 2011-10-24 12:02:25 UTC
Created attachment 529847 [details]
iozone test result

Comment 40 Jeff Layton 2011-10-25 10:20:43 UTC
*** Bug 745648 has been marked as a duplicate of this bug. ***

Comment 41 Martin Prpič 2011-10-27 09:19:50 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A previously introduced patch forced the ->flush and ->fsync operations to wait on all WRITE and COMMIT remote procedure calls (RPC) to complete to ensure that those RPCs were completed before returning from fsync() or close(). As a consequence, all WRITEs issued by nfs_flush_list were serialized and caused a performance regression on NFS clients. This update changes nfs_flush_one and nfs_flush_multi to not wait for WRITEs issued when the FLUSH_SYNC parameter is set, resolving performance issues on NFS clients.

Comment 42 errata-xmlrpc 2012-02-21 03:51:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html


Note You need to log in before you can comment on or make changes to this bug.