Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Created attachment 1243987[details]
testcase
While working on a ceph-related patch, I noticed a regression in how the kernel is issuing DIO writes now. It used to be that when you passed down a large DIO write request, the kernel would (usually) try to break it up into wsize write requests on the wire.
More recent kernels have started issuing them in page-sized chunks however. They do get sent in parallel with a commit at the end, so it's not as bad as it could be, but it's still slower than it should be.
Scott Mayhew did a quick test and saw:
[root@rhel7 ~]# uname -r
3.10.0-123.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t| grep -A3 WRITE
WRITE:
1 ops (3%) 0 retrans (0%) 0 major timeouts
avg bytes sent per op: 1048760 avg bytes received per op: 68
backlog wait: 5.000000 RTT: 35.000000 total execute time: 40.000000 (milliseconds)
...vs...
[root@rhel7 ~]# uname -r
3.10.0-229.el7.x86_64
[root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t
[root@rhel7 ~]# ./diotest2 /mnt/t/file
[root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE
WRITE:
256 ops (90%) 0 retrans (0%) 0 major timeouts
avg bytes sent per op: 4280 avg bytes received per op: 68
backlog wait: 5.304688 RTT: 3.367188 total execute time: 8.695312 (milliseconds)
[root@rhel7 ~]#
Note that this is a problem in mainline kernels as well, and I've started discussing it there. The attached testcase is how I'm testing it.
Ahh, I think I might get it now and it's not as bad as I had originally feared...
If you dirty all of the pages before writing, it seems to coalesce them correctly. The reproducer allocates pages, but doesn't actually dirty them before writing them. Apparently the allocator is setting up the mapping such that each page offset address in the allocation points to the same page. I imagine it's then setting up that page for CoW.
So we end up in this test in nfs_can_coalesce_requests and hit the return false:
if (req->wb_page == prev->wb_page) {
if (req->wb_pgbase != prev->wb_pgbase + prev->wb_bytes)
return false;
I think that's in place to handle sub-page write requests, but maybe we should consider doing that a different way for DIO?
I'm debating going ahead and closing this as WONTFIX. I'm having a hard time coming up with a valid use-case where this would matter. I'll leave it open for a few days to see if I think of anything, but for now I'm leaning toward just closing it.
Created attachment 1243987 [details] testcase While working on a ceph-related patch, I noticed a regression in how the kernel is issuing DIO writes now. It used to be that when you passed down a large DIO write request, the kernel would (usually) try to break it up into wsize write requests on the wire. More recent kernels have started issuing them in page-sized chunks however. They do get sent in parallel with a commit at the end, so it's not as bad as it could be, but it's still slower than it should be. Scott Mayhew did a quick test and saw: [root@rhel7 ~]# uname -r 3.10.0-123.el7.x86_64 [root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t [root@rhel7 ~]# ./diotest2 /mnt/t/file [root@rhel7 ~]# mountstats --rpc /mnt/t| grep -A3 WRITE WRITE: 1 ops (3%) 0 retrans (0%) 0 major timeouts avg bytes sent per op: 1048760 avg bytes received per op: 68 backlog wait: 5.000000 RTT: 35.000000 total execute time: 40.000000 (milliseconds) ...vs... [root@rhel7 ~]# uname -r 3.10.0-229.el7.x86_64 [root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t [root@rhel7 ~]# ./diotest2 /mnt/t/file [root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE WRITE: 256 ops (90%) 0 retrans (0%) 0 major timeouts avg bytes sent per op: 4280 avg bytes received per op: 68 backlog wait: 5.304688 RTT: 3.367188 total execute time: 8.695312 (milliseconds) [root@rhel7 ~]# Note that this is a problem in mainline kernels as well, and I've started discussing it there. The attached testcase is how I'm testing it.