Bug 1416127

Summary: Regression in NFS DIO write sizes
Product: Red Hat Enterprise Linux 7 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Jeff Layton <jlayton>
kernel sub component: NFS QA Contact: Yongcheng Yang <yoyang>
Status: CLOSED WONTFIX Docs Contact:
Severity: unspecified    
Priority: unspecified CC: chuck.lever, eguan, jiyin, smayhew, yoyang, zlang
Version: 7.4Keywords: Reproducer
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-06 12:38:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
testcase none

Description Jeff Layton 2017-01-24 16:27:02 UTC
Created attachment 1243987 [details]
testcase

While working on a ceph-related patch, I noticed a regression in how the kernel is issuing DIO writes now. It used to be that when you passed down a large DIO write request, the kernel would (usually) try to break it up into wsize write requests on the wire.

More recent kernels have started issuing them in page-sized chunks however. They do get sent in parallel with a commit at the end, so it's not as bad as it could be, but it's still slower than it should be.

Scott Mayhew did a quick test and saw:

[root@rhel7 ~]# uname -r
3.10.0-123.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t| grep -A3 WRITE
WRITE:
        1 ops (3%)      0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 1048760  avg bytes received per op: 68
        backlog wait: 5.000000  RTT: 35.000000  total execute time: 40.000000 (milliseconds)

...vs...


[root@rhel7 ~]# uname -r
3.10.0-229.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        256 ops (90%)   0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 4280     avg bytes received per op: 68
        backlog wait: 5.304688  RTT: 3.367188   total execute time: 8.695312 (milliseconds)
[root@rhel7 ~]# 


Note that this is a problem in mainline kernels as well, and I've started discussing it there. The attached testcase is how I'm testing it.

Comment 1 Scott Mayhew 2017-01-24 17:16:57 UTC
Looks like regression occurred between 3.10.0-171.el7 and 3.10.0-172.el7:

[root@rhel7 ~]# uname -r
3.10.0-171.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        1 ops (3%)      0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 1048760  avg bytes received per op: 68
        backlog wait: 4.000000  RTT: 28.000000  total execute time: 32.000000 (milliseconds)

[root@rhel7 ~]# uname -r                                                                                                                                                                                           
3.10.0-172.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file                                                                                                                                                                             
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        256 ops (90%)   0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 4280     avg bytes received per op: 68
        backlog wait: 4.761719  RTT: 3.441406   total execute time: 8.230469 (milliseconds)

Comment 2 Jeff Layton 2017-01-24 19:46:36 UTC
Ahh, I think I might get it now and it's not as bad as I had originally feared...

If you dirty all of the pages before writing, it seems to coalesce them correctly. The reproducer allocates pages, but doesn't actually dirty them before writing them. Apparently the allocator is setting up the mapping such that each page offset address in the allocation points to the same page. I imagine it's then setting up that page for CoW.

So we end up in this test in nfs_can_coalesce_requests and hit the return false:

                if (req->wb_page == prev->wb_page) {
                        if (req->wb_pgbase != prev->wb_pgbase + prev->wb_bytes)
                                return false;

I think that's in place to handle sub-page write requests, but maybe we should consider doing that a different way for DIO?

Comment 3 Jeff Layton 2017-01-25 12:32:14 UTC
I'm debating going ahead and closing this as WONTFIX. I'm having a hard time coming up with a valid use-case where this would matter. I'll leave it open for a few days to see if I think of anything, but for now I'm leaning toward just closing it.