Bug 739113

Summary: NFS: avoid COMMIT message when we combine direct I/O writes
Product: Red Hat Enterprise Linux 6 Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: bfields, dhowells, eguan, jjarvis, jlayton, nobody+PNT0273897, rwheeler, sbest, sprabhu, steved
Target Milestone: betaKeywords: FutureFeature
Target Release: 6.3   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-13 15:33:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 704634, 705085, 767187    
Attachments:
Description Flags
patch -- only do COMMIT for DIO range written rather than whole file
none
NFS Vector Support Performance - Sync options
none
Latest performance data for NFS vector support none

Description IBM Bug Proxy 2011-09-16 15:04:51 UTC
1. Feature Overview:
Feature Id: [74900]
a. Name of Feature: NFS: avoid COMMIT message when we combine direct I/O writes
b. Feature Description

The current NFS vector I/O patch (for direct I/O) sends NFS WRITE with ASYNC flag followed by COMMIT message. Since we immediately send COMMIT message, we should send the NFS WRITE with FILE_SYNC flag and avoid sending COMMIT messae altogether. This improves performance in our environment and should not impact negatively under other environments.

2. Feature Details:
Sponsor: LTC
Architectures:  x86_64, 

Arch Specificity: ---
Affects Kernel Modules: Yes
Delivery Mechanism: MaintWeb
Category: kernel
Request Type: Kernel - Enhancement from Upstream
d. Upstream Acceptance: Not evaluated
Sponsor Priority P3
f. Severity: high
IBM Confidential: No
Code Contribution: unsure
g. Component Version Target: ---

3. Business Case
Lots of customers use Virtual Machines over NFS environments (e.g. SONAS as NFS server). The fact that they get better performance in older distro's is really a regression!

4. Primary contact at Red Hat:
John Jarvis, jjarvis

5. Primary contacts at Partner:
Project Management Contact:
Stephanie A. Glass, sglass.com

Technical contact(s):
Malahal R. Naineni, naineni.com

Comment 2 Jeff Layton 2011-09-16 17:10:20 UTC
(In reply to comment #0)
> The current NFS vector I/O patch (for direct I/O) sends NFS WRITE with ASYNC
> flag followed by COMMIT message. Since we immediately send COMMIT message, we
> should send the NFS WRITE with FILE_SYNC flag and avoid sending COMMIT messae
> altogether. This improves performance in our environment and should not impact
> negatively under other environments.
> 

I have to be a bit skeptical of this, especially on that last claim...

The current code has this:

        if (dreq->commit_data == NULL || count <= wsize)
                sync = NFS_FILE_SYNC;

...which basically means that we use NFS_FILE_SYNC when we either couldn't allocate commitdata (rare -ENOMEM case), or the amount of data to be written is less than or equal to the wsize.

So we only use unstable writes when we'll need to do multiple writes to satisfy the DIO request. It does not stand to reason with me that FILE_SYNC writes would be more efficient in this situation.

Typically, it's better for the server to allow it to batch up multiple writes and fsync them out to disk all at once. Perhaps this is different with your underlying filesystem (GPFS?) but it seems like this change would negatively impact other environments.

So, I think we'll need a bit more than just the above assertion that this is necessary and harmless on other filesystems:

1) We need to understand why this improves performance in your situation.

2) We need to ensure that this will not regress performance in other cases.

One thing that clearly is sub-optimal in the NFS code is that the follow-on commit in the DIO case is done over the entire file. We could consider changing the code just to commit the range that was written. Would this help your situation at all? If you're not sure, then I can give you a patch to test...

Comment 3 IBM Bug Proxy 2011-09-16 20:30:34 UTC
------- Comment From malahal.com 2011-09-16 16:27 EDT-------
(In reply to comment #5)
> (In reply to comment #0)

> One thing that clearly is sub-optimal in the NFS code is that the follow-on
> commit in the DIO case is done over the entire file. We could consider changing
> the code just to commit the range that was written. Would this help your
> situation at all? If you're not sure, then I can give you a patch to test...

Jeff, I will gladly take the patch and test it. That may work out for us.

Comment 4 Jeff Layton 2011-09-16 23:00:18 UTC
Created attachment 523663 [details]
patch -- only do COMMIT for DIO range written rather than whole file

Ok, here's the patch. I've not tested it other than for compilation, but I believe it'll do the right thing. Let me know how it goes. I'll plan to give it better testing when I get the chance next week.

Comment 5 Jeff Layton 2011-09-27 14:41:24 UTC
Any word on the above patch? Did it help?

Comment 9 IBM Bug Proxy 2011-11-07 15:50:30 UTC
------- Comment From sglass.com 2011-11-07 10:42 EDT-------
From Khoa Huynh who is having access problems to bugzilla:

He just wanted to update the bug that he is still working on evaluating the patch.  He will post results soon.

Comment 10 IBM Bug Proxy 2011-11-11 05:20:47 UTC
------- Comment From khoa.com 2011-11-11 00:11 EDT-------
I hope to post results tomorrow (Friday 11/11) or Monday next week.
Thanks.

Comment 11 IBM Bug Proxy 2011-11-14 22:50:51 UTC
------- Comment From khoa.com 2011-11-14 17:45 EDT-------
I'd like to update this bug.  Evaluating this patch has been more difficult than I thought.  It's been several months since I last looked at the vector support in NFS, and my test setup has gone through many new upgrades and software revisions since then.  I have tried to get back to the exact environment when I last noticed this problem, but this has been impossible, so I think it's best now that I revert back to the current environment and recollect everything again (with the current setup, which is different from the old setup).  First I need to re-collect the data with the Linearized QEMU for the baseline, then, with the NFS vector support without NFS_FILE_SYNC, and see if it makes any difference in the new environment (with a newer version of GPFS).  If so, then I would move on to Jeff's latest patch where he limits the scope of the commits and see if that would help.  It will take me another few days to get all of this done.  I'll post data as soon as possible, hopefully in the next few days.  Thanks for your continued patience!

-Khoa

Comment 12 IBM Bug Proxy 2011-11-29 16:00:41 UTC
------- Comment From khoa.com 2011-11-29 10:59 EDT-------
(In reply to comment #7)
> Created an attachment (id=64700) [details]
> patch -- only do COMMIT for DIO range written rather than whole file
> ------- Comment on attachment From jlayton 2011-09-16 19:00:18
> EDT-------
> Ok, here's the patch. I've not tested it other than for compilation, but I
> believe it'll do the right thing. Let me know how it goes. I'll plan to give it
> better testing when I get the chance next week.

I can confirm that this patch is GOOD.  I was able to compile and test it on my setup with RHEL6.1 (on KVM guest, KVM host, and NFS server) and GPFS 3.4.0-2.  In this environment (which has same hardware as the original test environment but has the newer software), this patch (which limits the scope of the COMMIT) has brought the performance of the NFS vector support very close to that of the linearized QEMU (a temporary hack).  In fact, the difference is so small that it is pretty much insignificant.

As a result, I think we should include the NFS vector support code plus this patch in the next RHEL update and remove the linearized QEMU hack (provided that all other testing of this NFS vector code is good).

Jeff - do you have any plans to submit the NFS vector code plus this patch upstream ?

Thanks,
-Khoa

Comment 13 Jeff Layton 2011-11-29 16:44:44 UTC
Sure. Most filesystems don't seem to care much one way or another that the
commit is done over the whole file. In the case of a clustered filesystem though, I could see that that might be different. I'll plan to submit this upstream sometime this week and ask Trond to include it into 3.3. I don't expect that it would be very controversial.

Do you have any numbers you can share that show the performance delta with and without this patch? It might help make for a more convincing case...

Comment 14 Jeff Layton 2011-11-29 16:50:09 UTC
Oh, and I have no intention of submitting any sort of fix for the vectorized DIO code in NFS upstream. I believe Badari did that a while back, and Trond said NAK.

I think he'll be more receptive to this change however.

Comment 15 Jeff Layton 2011-11-30 15:39:41 UTC
Ok, patch sent upstream, and I cc'ed Koah perhaps he can provide some numbers to make the patch more compelling for Trond (though I don't expect he'll have much of an issue with it).

Comment 16 IBM Bug Proxy 2011-11-30 21:13:17 UTC
------- Comment From khoa.com 2011-11-30 14:56 EDT-------
(In reply to comment #17)
> Ok, patch sent upstream, and I cc'ed Koah perhaps he can provide some numbers
> to make the patch more compelling for Trond (though I don't expect he'll have
> much of an issue with it).

Hi Jeff, yes, I saw the patch being sent out.  I'll add some data shortly. Thanks.
-Khoa

Comment 17 IBM Bug Proxy 2011-12-02 17:20:26 UTC
------- Comment From khoa.com 2011-12-02 12:18 EDT-------
I just discovered that I mistakenly applied the NFS_FILE_SYNC patch in the kernel that I built with Jeff's range sync patch.  As a result, I reported (erroneously) that Jeff's range sync patch had the same performance level as the NFS_FILE_SYNC patch.  I fixed the error yesterday, and the latest results show that Jeff's range sync patch actually has no performance impact in my performance testing.

I'll post the latest data today.  I apologize for this confusion.
-Khoa

Comment 18 IBM Bug Proxy 2011-12-05 16:32:00 UTC
Created attachment 541006 [details]
NFS Vector Support Performance - Sync options


------- Comment on attachment From khoa.com 2011-12-05 11:24 EDT-------


This chart (in PDF format) shows the write performance data (collected on my setup with RHEL6.1 and GPFS 3.4.0-2) for:

1) NFS vector support (base code)
2) NFS vector support with range sync (Jeff Layton's latest patch where we perform sync on the range of data being written instead of entire file)
3) NFS vector support with NFS_FILE_SYNC (Badari Pulavarty's patch where we issue sync for each vector I/O)
4) Linearized QEMU hack (which is currently in RHEL6.1 errata and RHEL6.2)

The data shows that Jeff Layton's range sync patch had no performance impact on the scenarios that I tested on my setup.  The "linearized" QEMU hack had the best performance, which can be matched by Badari's patch (which issues NFS_FILE_SYNC on every vector I/O).
Thanks.
-Khoa

Comment 19 Jeff Layton 2011-12-05 16:54:02 UTC
Then we're back at square 1, and I'll need for you to answer the questions I posed in comment #2.

Comment 20 IBM Bug Proxy 2011-12-05 17:21:08 UTC
------- Comment From malahal.com 2011-12-05 12:14 EDT-------
(In reply to comment #21)
> Created an attachment (id=66878) [details]
> NFS Vector Support Performance - Sync options
> This chart (in PDF format) shows the write performance data (collected on my
> setup with RHEL6.1 and GPFS 3.4.0-2) for:
>
> 1) NFS vector support (base code)
> 2) NFS vector support with range sync (Jeff Layton's latest patch where we
> perform sync on the range of data being written instead of entire file)
> 3) NFS vector support with NFS_FILE_SYNC (Badari Pulavarty's patch where we
> issue sync for each vector I/O)
> 4) Linearized QEMU hack (which is currently in RHEL6.1 errata and RHEL6.2)
>
> The data shows that Jeff Layton's range sync patch had no performance impact on
> the scenarios that I tested on my setup.  The "linearized" QEMU hack had the
> best performance, which can be matched by Badari's patch (which issues
> NFS_FILE_SYNC on every vector I/O).
> Thanks.
> -Khoa

Khoa, what is the wsize in your case? If it is 256K or more, you should not see any separate COMMIT messages based on what Jeff says in #2 comment.

Comment 21 IBM Bug Proxy 2011-12-05 22:11:21 UTC
------- Comment From khoa.com 2011-12-05 17:01 EDT-------
(In reply to comment #23)
> Khoa, what is the wsize in your case? If it is 256K or more, you should not see
> any separate COMMIT messages based on what Jeff says in #2 comment.

The wsize was set to 1MB (auto-negotiated).  If I understand the code correctly, for a direct write, we only issue a stable write (without commit) if we could not allocate commit data or if this was a single write.  If this was part of a vector write, we would not issue stable write.  Instead we would bunch up these vector writes (as long as they are less than wsize) as unstable writes and then follow with a commit.  On GPFS, a commit message is very expensive, so this negatively impacts the performance.

What Badari tried to do was to make sure that each vector write is a stable write and do away with commits altogether if we go through the vector direct write path.

Jeff - in the code that I have (with all the patches applied), I only see this in nfs_direct_write():

if (dreq->commit_data == NULL)

In your earlier comment in this bug report, you wrote that the current had this:

sync = NFS_FILE_SYNC;

Did I miss the (count <= wsize) condition by mistake ?  Or was it removed from the patches for some reason ?

I think generating stable write (NFS_FILE_SYNC) if count <= wsize would solve the problem in my testing.  What do you think ?

-Khoa

Comment 22 IBM Bug Proxy 2011-12-07 00:30:27 UTC
------- Comment From malahal.com 2011-12-06 19:20 EDT-------
> In your earlier comment in this bug report, you wrote that the current had
> this:
>
> if (dreq->commit_data == NULL || count <= wsize)
> sync = NFS_FILE_SYNC;
>
> Did I miss the (count <= wsize) condition by mistake ?  Or was it removed from
> the patches for some reason ?

Jeff re-arranged Badari's patches a bit. The current RHEL kernel release mentioned below has that  extra check with wsize.

> I think generating stable write (NFS_FILE_SYNC) if count <= wsize would solve
> the problem in my testing.  What do you think ?

It should solve our problem for sure! Please test with the latest RHEL kernel release from here:

http://rhn.redhat.com/errata/RHSA-2011-1530.html

Comment 23 Jeff Layton 2011-12-08 19:09:21 UTC
Yes, we took Badari's patches as an interim hack until the DIO code is overhauled upstream. I think you'll need to ensure you're testing with those patches in place.

What may be interesting is whether this patch on top of a 6.2 kernel helps performance any. If you have some numbers that demonstrate that then we can send those to Trond and ask him to include them. If not, then I suggest we drop this case and close it WONTFIX.

Let me know either way...

Comment 24 IBM Bug Proxy 2011-12-08 19:30:33 UTC
------- Comment From malahal.com 2011-12-08 14:23 EDT-------
(In reply to comment #26)
> Yes, we took Badari's patches as an interim hack until the DIO code is
> overhauled upstream. I think you'll need to ensure you're testing with those
> patches in place.
>
> What may be interesting is whether this patch on top of a 6.2 kernel helps
> performance any.

Based on Khoa's block sizes and NFS write sizes, the range patch should not change any performance numbers on 6.2. In fact, he should not be seeing any commits on RHEL6.2. We will close it as soon as we get it verified by Khoa.

Comment 25 IBM Bug Proxy 2011-12-11 19:20:44 UTC
Created attachment 545397 [details]
Latest performance data for NFS vector support


------- Comment on attachment From khoa.com 2011-12-11 14:13 EDT-------


This latest chart shows the performance data of the following:

1) Early NFS vector support patches (Badari Pulavarty's original vector patch + Jeff Layton's context fix

2) Early NFS vector support patches + Jeff Layton's range sync patch

3) Early NFS vector support patches + Badari Pulavarty's NFS_FILE_SYNC patch for vector code 

4) RHEL6.2 with Linearized QEMU hack DISABLED to show impact of the existing NFS vector support code

5) RHEL6.2 with Linearized QEMU hack ENABLED 

From the data in the chart, we can conclude the following for the tested workloads (FFSB, direct I/O, sequential and random writes at 256KB and 8KB block sizes):

a) Jeff Layton's range sync patch has not noticeable performance impact

b) Badari's NFS_FILE_SYNC patch was needed for the early NFS vector support patches, but for the existing NFS vector code in RHEL6.2, this patch is NO LONGER needed

c) On RHEL6.2, the existing NFS vector code has the same performance as the linearized QEMU hack.  As a result, it is now safe to remove the linearized QEMU hack from upcoming RHEL releases.

Regarding this bug report, I think we have two choices:

1) Close this bug report as the problem raised by this bug report has been addressed in the NFS vector code in RHEL6.2.

2) Use this bug report to track the removal of the linearized QEMU hack.

My personal opinion is that we should close this bug report, and if we decide to remove the linearized QEMU hack in the upcoming RHEL versions, we should open another bug report for that.

Thanks,
-Khoa

Comment 26 Jeff Layton 2011-12-13 15:33:38 UTC
Ok, closing this with a resolution of CURRENTRELEASE.