Bug 1416127
Summary: | Regression in NFS DIO write sizes | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jeff Layton <jlayton> | ||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||
kernel sub component: | NFS | QA Contact: | Yongcheng Yang <yoyang> | ||||
Status: | CLOSED WONTFIX | Docs Contact: | |||||
Severity: | unspecified | ||||||
Priority: | unspecified | CC: | chuck.lever, eguan, jiyin, smayhew, yoyang, zlang | ||||
Version: | 7.4 | Keywords: | Reproducer | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-06 12:38:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Looks like regression occurred between 3.10.0-171.el7 and 3.10.0-172.el7: [root@rhel7 ~]# uname -r 3.10.0-171.el7.x86_64 [root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t [root@rhel7 ~]# ./diotest2 /mnt/t/file [root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE WRITE: 1 ops (3%) 0 retrans (0%) 0 major timeouts avg bytes sent per op: 1048760 avg bytes received per op: 68 backlog wait: 4.000000 RTT: 28.000000 total execute time: 32.000000 (milliseconds) [root@rhel7 ~]# uname -r 3.10.0-172.el7.x86_64 [root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t [root@rhel7 ~]# ./diotest2 /mnt/t/file [root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE WRITE: 256 ops (90%) 0 retrans (0%) 0 major timeouts avg bytes sent per op: 4280 avg bytes received per op: 68 backlog wait: 4.761719 RTT: 3.441406 total execute time: 8.230469 (milliseconds) Ahh, I think I might get it now and it's not as bad as I had originally feared... If you dirty all of the pages before writing, it seems to coalesce them correctly. The reproducer allocates pages, but doesn't actually dirty them before writing them. Apparently the allocator is setting up the mapping such that each page offset address in the allocation points to the same page. I imagine it's then setting up that page for CoW. So we end up in this test in nfs_can_coalesce_requests and hit the return false: if (req->wb_page == prev->wb_page) { if (req->wb_pgbase != prev->wb_pgbase + prev->wb_bytes) return false; I think that's in place to handle sub-page write requests, but maybe we should consider doing that a different way for DIO? I'm debating going ahead and closing this as WONTFIX. I'm having a hard time coming up with a valid use-case where this would matter. I'll leave it open for a few days to see if I think of anything, but for now I'm leaning toward just closing it. |
Created attachment 1243987 [details] testcase While working on a ceph-related patch, I noticed a regression in how the kernel is issuing DIO writes now. It used to be that when you passed down a large DIO write request, the kernel would (usually) try to break it up into wsize write requests on the wire. More recent kernels have started issuing them in page-sized chunks however. They do get sent in parallel with a commit at the end, so it's not as bad as it could be, but it's still slower than it should be. Scott Mayhew did a quick test and saw: [root@rhel7 ~]# uname -r 3.10.0-123.el7.x86_64 [root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t [root@rhel7 ~]# ./diotest2 /mnt/t/file [root@rhel7 ~]# mountstats --rpc /mnt/t| grep -A3 WRITE WRITE: 1 ops (3%) 0 retrans (0%) 0 major timeouts avg bytes sent per op: 1048760 avg bytes received per op: 68 backlog wait: 5.000000 RTT: 35.000000 total execute time: 40.000000 (milliseconds) ...vs... [root@rhel7 ~]# uname -r 3.10.0-229.el7.x86_64 [root@rhel7 ~]# mount -o sec=sys nfs.smayhew.test:/export /mnt/t [root@rhel7 ~]# ./diotest2 /mnt/t/file [root@rhel7 ~]# mountstats --rpc /mnt/t | grep -A3 WRITE WRITE: 256 ops (90%) 0 retrans (0%) 0 major timeouts avg bytes sent per op: 4280 avg bytes received per op: 68 backlog wait: 5.304688 RTT: 3.367188 total execute time: 8.695312 (milliseconds) [root@rhel7 ~]# Note that this is a problem in mainline kernels as well, and I've started discussing it there. The attached testcase is how I'm testing it.