Bug 1416127 - Regression in NFS DIO write sizes
Summary: Regression in NFS DIO write sizes
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Yongcheng Yang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-24 16:27 UTC by Jeff Layton
Modified: 2018-11-09 13:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-06 12:38:58 UTC
Target Upstream Version:


Attachments (Terms of Use)
testcase (814 bytes, text/plain)
2017-01-24 16:27 UTC, Jeff Layton
no flags Details

Description Jeff Layton 2017-01-24 16:27:02 UTC
Created attachment 1243987 [details]
testcase

While working on a ceph-related patch, I noticed a regression in how the kernel is issuing DIO writes now. It used to be that when you passed down a large DIO write request, the kernel would (usually) try to break it up into wsize write requests on the wire.

More recent kernels have started issuing them in page-sized chunks however. They do get sent in parallel with a commit at the end, so it's not as bad as it could be, but it's still slower than it should be.

Scott Mayhew did a quick test and saw:

[root@rhel7 ~]# uname -r
3.10.0-123.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t| grep -A3 WRITE
WRITE:
        1 ops (3%)      0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 1048760  avg bytes received per op: 68
        backlog wait: 5.000000  RTT: 35.000000  total execute time: 40.000000 (milliseconds)

...vs...


[root@rhel7 ~]# uname -r
3.10.0-229.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        256 ops (90%)   0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 4280     avg bytes received per op: 68
        backlog wait: 5.304688  RTT: 3.367188   total execute time: 8.695312 (milliseconds)
[root@rhel7 ~]# 


Note that this is a problem in mainline kernels as well, and I've started discussing it there. The attached testcase is how I'm testing it.

Comment 1 Scott Mayhew 2017-01-24 17:16:57 UTC
Looks like regression occurred between 3.10.0-171.el7 and 3.10.0-172.el7:

[root@rhel7 ~]# uname -r
3.10.0-171.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        1 ops (3%)      0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 1048760  avg bytes received per op: 68
        backlog wait: 4.000000  RTT: 28.000000  total execute time: 32.000000 (milliseconds)

[root@rhel7 ~]# uname -r                                                                                                                                                                                           
3.10.0-172.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file                                                                                                                                                                             
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        256 ops (90%)   0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 4280     avg bytes received per op: 68
        backlog wait: 4.761719  RTT: 3.441406   total execute time: 8.230469 (milliseconds)

Comment 2 Jeff Layton 2017-01-24 19:46:36 UTC
Ahh, I think I might get it now and it's not as bad as I had originally feared...

If you dirty all of the pages before writing, it seems to coalesce them correctly. The reproducer allocates pages, but doesn't actually dirty them before writing them. Apparently the allocator is setting up the mapping such that each page offset address in the allocation points to the same page. I imagine it's then setting up that page for CoW.

So we end up in this test in nfs_can_coalesce_requests and hit the return false:

                if (req->wb_page == prev->wb_page) {
                        if (req->wb_pgbase != prev->wb_pgbase + prev->wb_bytes)
                                return false;

I think that's in place to handle sub-page write requests, but maybe we should consider doing that a different way for DIO?

Comment 3 Jeff Layton 2017-01-25 12:32:14 UTC
I'm debating going ahead and closing this as WONTFIX. I'm having a hard time coming up with a valid use-case where this would matter. I'll leave it open for a few days to see if I think of anything, but for now I'm leaning toward just closing it.


Note You need to log in before you can comment on or make changes to this bug.