Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1416127

Summary: Regression in NFS DIO write sizes
Product: Red Hat Enterprise Linux 7 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Jeff Layton <jlayton>
kernel sub component: NFS QA Contact: Yongcheng Yang <yoyang>
Status: CLOSED WONTFIX Docs Contact:
Severity: unspecified    
Priority: unspecified CC: chuck.lever, eguan, jiyin, smayhew, yoyang, zlang
Version: 7.4Keywords: Reproducer
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-06 12:38:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
testcase none

Description Jeff Layton 2017-01-24 16:27:02 UTC
Created attachment 1243987 [details]
testcase

While working on a ceph-related patch, I noticed a regression in how the kernel is issuing DIO writes now. It used to be that when you passed down a large DIO write request, the kernel would (usually) try to break it up into wsize write requests on the wire.

More recent kernels have started issuing them in page-sized chunks however. They do get sent in parallel with a commit at the end, so it's not as bad as it could be, but it's still slower than it should be.

Scott Mayhew did a quick test and saw:

[root@rhel7 ~]# uname -r
3.10.0-123.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t| grep -A3 WRITE
WRITE:
        1 ops (3%)      0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 1048760  avg bytes received per op: 68
        backlog wait: 5.000000  RTT: 35.000000  total execute time: 40.000000 (milliseconds)

...vs...


[root@rhel7 ~]# uname -r
3.10.0-229.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        256 ops (90%)   0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 4280     avg bytes received per op: 68
        backlog wait: 5.304688  RTT: 3.367188   total execute time: 8.695312 (milliseconds)
[root@rhel7 ~]# 


Note that this is a problem in mainline kernels as well, and I've started discussing it there. The attached testcase is how I'm testing it.

Comment 1 Scott Mayhew 2017-01-24 17:16:57 UTC
Looks like regression occurred between 3.10.0-171.el7 and 3.10.0-172.el7:

[root@rhel7 ~]# uname -r
3.10.0-171.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        1 ops (3%)      0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 1048760  avg bytes received per op: 68
        backlog wait: 4.000000  RTT: 28.000000  total execute time: 32.000000 (milliseconds)

[root@rhel7 ~]# uname -r                                                                                                                                                                                           
3.10.0-172.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file                                                                                                                                                                             
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
        256 ops (90%)   0 retrans (0%)  0 major timeouts
        avg bytes sent per op: 4280     avg bytes received per op: 68
        backlog wait: 4.761719  RTT: 3.441406   total execute time: 8.230469 (milliseconds)

Comment 2 Jeff Layton 2017-01-24 19:46:36 UTC
Ahh, I think I might get it now and it's not as bad as I had originally feared...

If you dirty all of the pages before writing, it seems to coalesce them correctly. The reproducer allocates pages, but doesn't actually dirty them before writing them. Apparently the allocator is setting up the mapping such that each page offset address in the allocation points to the same page. I imagine it's then setting up that page for CoW.

So we end up in this test in nfs_can_coalesce_requests and hit the return false:

                if (req->wb_page == prev->wb_page) {
                        if (req->wb_pgbase != prev->wb_pgbase + prev->wb_bytes)
                                return false;

I think that's in place to handle sub-page write requests, but maybe we should consider doing that a different way for DIO?

Comment 3 Jeff Layton 2017-01-25 12:32:14 UTC
I'm debating going ahead and closing this as WONTFIX. I'm having a hard time coming up with a valid use-case where this would matter. I'll leave it open for a few days to see if I think of anything, but for now I'm leaning toward just closing it.