Created attachment 516884 [details] RHEL 5.5 to 5.7 NFSv3 iozone results Description of problem: NFS (mostly v3, didn't try with v4 yet) performance have been cut down by a rough 50% between kernel 2.6.18-238.19.1.el5 and 2.6.18-274.el5 (between RHEL 5.6 and RHEL 5.7) Version-Release number of selected component (if applicable): kernel-2.6.18-274.el5, nfsv3 How reproducible: Allways Steps to Reproduce: 1. mount an NFSv3 share on a RHEL 5.6 server with -o rsize=32767,wsize=32767 options 2. test it with iozone 3. do the same on a RHEL 5.7 server Actual results: see attached iozone results (roughly a 50% decrease in write performance) Expected results: performance should be at least equal between the two kernels with the same test parameters ... Additional info: tested with HNAS NFS exports, still have to check NFSv4 performance too
What is a "HNAS NFS export" ?
FWIW, I find this a bit odd since it's directly contradictory to my experience in testing 5.7 which showed much *better* write performance. The difference however may be in the test setup however and in particular, the server details... If this is a RHEL5.6 server, what is the underlying filesystem being exported? I suggest that we focus on NFSv3 for now since that's the default in RHEL5. NFSv4 will probably be similar anyway... What may be best is to start with one of the tests that has regressed here (say Fwrite performance). Choose one "spot" in the set (single file size and I/O size), and verify whether the performance has consistently regressed when you run just that single test. Assuming it does, what would be a good first pass would be the nfsstat -c output before and after each test run. That would allow us to see if there are much larger numbers of any particular type of RPCs in the regressed case.
HNAS NFS export refers to an NFS export from an Hitachi "High performance NAS platform" (HNAS). I will redo the write test on one regression spot and post results of both the test and nfsstat -c results ASAP
(In reply to comment #3) > HNAS NFS export refers to an NFS export from an Hitachi "High performance NAS > platform" (HNAS). OK, I'm a little confused -- the initial report said this was a RHEL5.6 server. Is Hitachi's NAS platform built on RHEL? If so, can you provide some details as to what they're using? Hardware, underlying filesystem type, etc...
Created attachment 516904 [details] Details on one regression spot Here is an attachment of the results I got from running iozone again only on one of the regression spot I got out of the whole test and the associated nfsstat ran before and after the test on both RHEL 5.6 and 5.7 kernels
Sorry my bad, the problem is a regression that happened between RHEL 5.6 and RHEL 5.7, I didn't try with a RHEL NFS server, only a third party one (Hitachi). This bug report is all about RHEL's NFS CLIENT, not server component.
Ok, we'll need some details about their server. One oddity in the data: 5.6: writes = 4096 creates = 2 5.7: writes = 14340 creates = 4 ...so we see almost 4 times the number of write calls in this data, along with twice as many files being created. The increase in create calls is particularly suspicious... Was this test being run in isolation, or was there something else going on on the 5.7 machine at the same time? What may be better than nfsstat here is the output from /usr/bin/mountstats on the mountpoint in question. That would actually give us more detailed info. Please make sure that there's no other NFS activity on the machine at the time as well and preferably do it after a reboot. That should help ensure that we have a clean test.
Created attachment 517131 [details] mountstats details on one of the regression spots Here is the result from a fresh test, with a completely isolated environment and freshly rebooted servers.
Created attachment 517145 [details] Same test on a Netapp NFS export Just made a test with a netapp NFS export to check if the problem is within the HNAS server, and the performance drop is even worse ! Also made a quick test with RHEL5.6 being the NFS server and RHEL5.7 the client, and I got the exact same write performance than with HNAS server (~50MB/s max) (I'll post detailled results from this one too if needed)
How much RAM is in this machine? I wonder whether you're hitting some sort of problem with the congestion control... Can you also give me the output from: # cat /proc/sys/fs/nfs/nfs_congestion_kb
Created attachment 517171 [details] /proc/sys/fs/nfs/nfs_congestion_kb content Both servers are DL380 G5 servers with 16GB of RAM. Here is the output of cat /proc/sys/fs/nfs/nfs_congestion_kb on both (RHEL 5.6 and 5.7) servers. The key doesn't seem to exist by default on the 5.6 server though (nor on our other RHEL 5.5 servers).
Created attachment 517210 [details] patch -- revert nfs: ->flush and ->fsync should use FLUSH_SYNC Would you be able to test out this patch and let me know if it helps? Note that this is just to verify whether this is the change that causes the performance to regress in your case. Reverting this patch may cause the data-integrity issue that it fixes to regress.
Currently rebuilding the kernel with this patch, I will post the results ASAP
Created attachment 517249 [details] patch results The patch seems to fix the performance issue. I also built a debug kernel if you need me to do some tests about the data-integrity issues that may have come back.
Thanks for confirming it. Unfortunately, we can't take that patch as-is, as it fixes a data integrity problem where close returns before it should. I'll need to look over the code more closely to see what we can be done.
I suspect I know what the problem is, but fixing it won't be trivial... The reason we use FLUSH_SYNC here is to prevent a data-integrity race that can cause close() to return more quickly than it should. nfs_flush_one will do this: if (how & FLUSH_SYNC) rpc_wait_for_completion_task(&data->task); Unfortunately, that has the effect of serializing all of the writes being issued by nfs_flush_list. We could parallelize these writes -- have nfs_flush_one/nfs_flush_multi return before the call completes, but then we'd need to collect a list of rpc_task pointers, and make sure we wait on them to complete before exiting. That's doable but complicated and will make the code diverge further from what went in upstream. Still, the time to worry about that has really passed with RHEL5, so I'll probably at least hack out a proof-of-concept patch that we can use for testing.
Created attachment 517284 [details] patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel Here's a potential patch. This changes nfs_flush_one/multi to not wait for the RPC task to complete after issuing it in the FLUSH_SYNC case. Instead, tasks are placed on a private list, and then it waits for them to complete after issuing them all. This allows more of them to run in parallel. Cursory testing shows a pretty substantial performance boost. If you can test this patch and report back with the results, then that would be great.
Hi, the patch doesn't seem to apply to the 2.6.18-274-src kernel package, are you making it against a git kernel ? (I get errors on the include/linux/sunrpc/sched.h, and my source file doesn't seem to look like yours as I can't even find the hook manually)
Created attachment 517388 [details] patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel (-274.el5) Yes, that patch is from the top of the git tree. Here's one based on 274.el5. We just need to add a field to the end of struct rpc_task, but a patch that went in since -274 has done the same thing.
Created attachment 517423 [details] patch results here are the results of your latest patch. This seems to fix the issue with performances. Please let me know if you need me to make any more tests with it (I'm going to make a full iozone test anyway, I will report if I spot more problems there)
Created attachment 517427 [details] patch -- have nfs_flush_list issue FLUSH_SYNC writes in parallel Ok, thanks for testing it. There is at least one bug in the above patch. It uses list_for_each on the list of tasks and then calls rpc_put_task on each one. If the "put" ends up freeing the task, it'll be accessing freed memory. So we need to use list_for_each_entry_safe there. In most cases, the freed memory still has the right contents, but it could oops so you may want to do further tests with this one.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available in kernel-2.6.18-282.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Sorry if this should be obvious, but is this fix likely to appear in one of the EL5.7 updates, or if not will it make it into in EL5.8 - or has that window already closed?
Yes, it's coming to a 5.7.z kernel. See bug 730686.
(In reply to comment #32) > Yes, it's coming to a 5.7.z kernel. See bug 730686. Many thanks. I hadn't spotted that one. -- Jon
According to comment 20, the bug has been verified. And In my test, although 2.6.18-274 performance is not much worse than 2.6.18-293/238, but performance of the latter two are equal. NFS server use rhel6.2. Result are attached.
Created attachment 529845 [details] iozone result
Created attachment 529847 [details] iozone test result
*** Bug 745648 has been marked as a duplicate of this bug. ***
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A previously introduced patch forced the ->flush and ->fsync operations to wait on all WRITE and COMMIT remote procedure calls (RPC) to complete to ensure that those RPCs were completed before returning from fsync() or close(). As a consequence, all WRITEs issued by nfs_flush_list were serialized and caused a performance regression on NFS clients. This update changes nfs_flush_one and nfs_flush_multi to not wait for WRITEs issued when the FLUSH_SYNC parameter is set, resolving performance issues on NFS clients.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html