2166658 – slow NFSD performance

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2166658 - slow NFSD performance

Summary: slow NFSD performance

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	9.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Yongcheng Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-02 15:00 UTC by Pierguido Lambri
Modified:	2023-05-09 10:12 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-5.14.0-284.3.1.el9_2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-09 08:13:27 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
c8 kernel flamegraph (111.72 KB, image/svg+xml) 2023-02-03 16:01 UTC, Jeff Layton	no flags	Details
c9 kernel flamegraph (212.41 KB, image/svg+xml) 2023-02-03 16:02 UTC, Jeff Layton	no flags	Details
test program attempt (1.15 KB, text/plain) 2023-02-06 13:58 UTC, Jeff Layton	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/kernel rhel-9 merge_requests 561	None	None	None	2023-03-03 18:58:00 UTC
Red Hat Issue Tracker	RHELPLAN-147465	None	None	None	2023-02-02 15:03:45 UTC
Red Hat Knowledge Base (Solution)	6996579	None	None	None	2023-02-13 15:17:30 UTC
Red Hat Product Errata	RHSA-2023:2458	None	None	None	2023-05-09 08:13:41 UTC

Description Pierguido Lambri 2023-02-02 15:00:43 UTC

Description of problem:
When mounting NFSv4 shares from a RHEL9 NFS server, workloads with writes seems to be slower than on RHEL8.

Version-Release number of selected component (if applicable):
kernel-5.14.0-162.16.1.el9_1

How reproducible:
Always

Steps to Reproduce:
1. Exports some NFS shares 
2. Mount them with NFSv4
3. Do a dd with either oflag=direct or conv=fsync

Actual results:
On RHEL9 it's much slower than RHEL8

Expected results:
I'd expect similar results

Additional info:
NFSv3 doesn't seem to be impacted

Comment 1 Pierguido Lambri 2023-02-02 15:13:23 UTC

Few tests I did against RHEL9 and RHEL8 servers.
I did tests first on VMs using the same storage backend and then tried on physical machines too.
These are the results from the latters:

RHEL9 NFS server:
# uname -r
5.14.0-162.16.1.el9_1.x86_64

RHEL8 NFS server:
# uname -r
4.18.0-425.12.1.el8_7.x86_64


# for distro in rhel8 rhel9; do for fs in ext4 xfs; do for i in {1..5}; do strace -fttTvxyo strace.$distro.$fs.$i dd if=/dev/zero of=/mnt/$distro/$fs/zeroes.direct.v4 bs=1M count=4000 conv=fsync; sleep 1; sync; done 2>&1 | awk -v RHEL="$distro" -v files="$fs" '/bytes/ {total+=$(NF-1); n++}END{print RHEL" "files" avg: " total/n}'; done; done
rhel8 ext4 avg: 783
rhel8 xfs avg: 777.8
rhel9 ext4 avg: 171.8
rhel9 xfs avg: 174


There's a bit of variability with the end results depending if I run trace-cmd on the NFS servers. I guess this is expected because of the tracing overhead.
The poorer RHEL9 results don't vary thou.

RHEL9 close:

            nfsd-2719  [002] 42353.202103: funcgraph_exit:       $ 15058325 us |                    } /* ext4_writepages */ (9)
            nfsd-2719  [002] 42353.202103: funcgraph_exit:       $ 15058325 us |                  } /* do_writepages */ (8)
            nfsd-2719  [002] 42353.202103: funcgraph_exit:       $ 15058327 us |                } /* __filemap_fdatawrite_range */ (7)
            nfsd-2719  [002] 42353.202104: funcgraph_exit:       $ 15058327 us |              } /* filemap_flush */ (6)
            nfsd-2719  [002] 42353.202108: funcgraph_exit:       $ 15058332 us |            } /* nfsd_file_put */ (5)
            nfsd-2719  [002] 42353.202108: funcgraph_exit:       $ 15058333 us |          } /* nfs4_file_put_access */ (4)
            nfsd-2719  [002] 42353.202108: funcgraph_exit:       $ 15058334 us |        } /* release_all_access */ (3)
            nfsd-2719  [002] 42353.202114: funcgraph_exit:       $ 15058340 us |      } /* nfs4_free_ol_stateid */ (2)
            nfsd-2719  [002] 42353.202116: funcgraph_exit:       $ 15058345 us |    } /* nfs4_put_stid */ (1)
            nfsd-2719  [002] 42353.202116: funcgraph_exit:       $ 15058369 us |  } /* nfsd4_close */ (0)


            nfsd-2719  [000] 42529.619401: funcgraph_exit:       $ 20029259 us |                    } /* xfs_vm_writepages */ (9)
            nfsd-2719  [000] 42529.619401: funcgraph_exit:       $ 20029259 us |                  } /* do_writepages */ (8)
            nfsd-2719  [000] 42529.619401: funcgraph_entry:      + 12.355 us  |                  wbc_detach_inode(); (8)
            nfsd-2719  [000] 42529.619414: funcgraph_exit:       $ 20029294 us |                } /* __filemap_fdatawrite_range */ (7)
            nfsd-2719  [000] 42529.619414: funcgraph_exit:       $ 20029294 us |              } /* filemap_flush */ (6)
            nfsd-2719  [000] 42529.619415: funcgraph_exit:       $ 20029296 us |            } /* nfsd_file_put */ (5)
            nfsd-2719  [000] 42529.619416: funcgraph_exit:       $ 20029297 us |          } /* nfs4_file_put_access */ (4)
            nfsd-2719  [000] 42529.619416: funcgraph_exit:       $ 20029299 us |        } /* release_all_access */ (3)
            nfsd-2719  [000] 42529.619426: funcgraph_exit:       $ 20029309 us |      } /* nfs4_free_ol_stateid */ (2)
            nfsd-2719  [000] 42529.619429: funcgraph_exit:       $ 20029315 us |    } /* nfs4_put_stid */ (1)
            nfsd-2719  [000] 42529.619430: funcgraph_exit:       $ 20029326 us |  } /* nfsd4_close */ (0)

RHEL8 close:

            nfsd-2908  [001] 42244.367854: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [001] 42244.367894: funcgraph_exit:       + 21.363 us  |  } /* nfsd4_close */ (0)
            nfsd-2908  [002] 42261.579662: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [002] 42261.579689: funcgraph_exit:         5.987 us   |  } /* nfsd4_close */ (0)
            nfsd-2908  [002] 42278.668547: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [002] 42278.668621: funcgraph_exit:       + 73.898 us  |  } /* nfsd4_close */ (0)
            nfsd-2908  [002] 42295.954218: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [002] 42295.954257: funcgraph_exit:       + 38.599 us  |  } /* nfsd4_close */ (0)
            nfsd-2908  [003] 42313.147938: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [003] 42313.147992: funcgraph_exit:         6.258 us   |  } /* nfsd4_close */ (0)
            nfsd-2908  [001] 42323.718212: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [001] 42323.718287: funcgraph_exit:       + 73.044 us  |  } /* nfsd4_close */ (0)
            nfsd-2908  [003] 42334.471958: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [003] 42334.472004: funcgraph_exit:       + 45.033 us  |  } /* nfsd4_close */ (0)
            nfsd-2908  [002] 42346.084628: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [002] 42346.084714: funcgraph_exit:       + 85.616 us  |  } /* nfsd4_close */ (0)
            nfsd-2908  [001] 42361.609642: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [001] 42361.609695: funcgraph_exit:       + 52.735 us  |  } /* nfsd4_close */ (0)
            nfsd-2908  [001] 42372.525113: funcgraph_entry:                   |  nfsd4_close() { (0)
            nfsd-2908  [001] 42372.525182: funcgraph_exit:       + 67.963 us  |  } /* nfsd4_close */ (0)


oflag=direct doesn't change much:

# for distro in rhel8 rhel9; do for fs in ext4 xfs; do for i in {1..5}; do strace -fttTvxyo strace.$distro.$fs.$i dd if=/dev/zero of=/mnt/$distro/$fs/zeroes.direct.v4 bs=1M count=4000 oflag=direct; sleep 1; sync; done 2>&1 | awk -v RHEL="$distro" -v files="$fs" '/bytes/ {total+=$(NF-1); n++}END{print RHEL" "files" avg: " total/n}'; done; done
rhel8 ext4 avg: 507.6
rhel8 xfs avg: 624.2
rhel9 ext4 avg: 159.6
rhel9 xfs avg: 161.6

            nfsd-2719  [002] 44745.965495: funcgraph_exit:       # 9453.283 us |                      } /* mpage_prepare_extent_to_map */ (10)
            nfsd-2719  [002] 44745.965503: funcgraph_exit:       $ 14600115 us |                    } /* ext4_writepages */ (9)
            nfsd-2719  [002] 44745.965504: funcgraph_exit:       $ 14600117 us |                  } /* do_writepages */ (8)
            nfsd-2719  [002] 44745.965505: funcgraph_exit:       $ 14600119 us |                } /* __filemap_fdatawrite_range */ (7)
            nfsd-2719  [002] 44745.965505: funcgraph_exit:       $ 14600119 us |              } /* filemap_flush */ (6)
            nfsd-2719  [002] 44745.965509: funcgraph_exit:       $ 14600124 us |            } /* nfsd_file_put */ (5)
            nfsd-2719  [002] 44745.965510: funcgraph_exit:       $ 14600125 us |          } /* nfs4_file_put_access */ (4)
            nfsd-2719  [002] 44745.965510: funcgraph_exit:       $ 14600126 us |        } /* release_all_access */ (3)
            nfsd-2719  [002] 44745.965514: funcgraph_exit:       $ 14600131 us |      } /* nfs4_free_ol_stateid */ (2)
            nfsd-2719  [002] 44745.965517: funcgraph_exit:       $ 14600137 us |    } /* nfs4_put_stid */ (1)
            nfsd-2719  [002] 44745.965517: funcgraph_exit:       $ 14600146 us |  } /* nfsd4_close */ (0)

            nfsd-2719  [003] 44979.767891: funcgraph_exit:       $ 16751800 us |                    } /* xfs_vm_writepages */ (9)
            nfsd-2719  [003] 44979.767891: funcgraph_exit:       $ 16751800 us |                  } /* do_writepages */ (8)
            nfsd-2719  [003] 44979.767892: funcgraph_exit:       $ 16751807 us |                } /* __filemap_fdatawrite_range */ (7)
            nfsd-2719  [003] 44979.767892: funcgraph_exit:       $ 16751808 us |              } /* filemap_flush */ (6)
            nfsd-2719  [003] 44979.767893: funcgraph_exit:       $ 16751809 us |            } /* nfsd_file_put */ (5)
            nfsd-2719  [003] 44979.767893: funcgraph_exit:       $ 16751828 us |          } /* nfs4_file_put_access */ (4)
            nfsd-2719  [003] 44979.767894: funcgraph_exit:       $ 16751829 us |        } /* release_all_access */ (3)
            nfsd-2719  [003] 44979.767910: funcgraph_exit:       + 16.269 us  |        } /* nfs4_put_stateowner */ (3)
            nfsd-2719  [003] 44979.767911: funcgraph_exit:       $ 16751847 us |      } /* nfs4_free_ol_stateid */ (2)
            nfsd-2719  [003] 44979.767913: funcgraph_exit:       $ 16751866 us |    } /* nfs4_put_stid */ (1)
            nfsd-2719  [003] 44979.767914: funcgraph_exit:       $ 16751875 us |  } /* nfsd4_close */ (0)



RHEL9 NFS server:
# uname -r
5.14.0-249.jtl.plambri.1.el9.x86_64


This has this patch:

From: Jeffrey Layton <jlayton>

Signed-off-by: Jeffrey Layton <jlayton>
---
 fs/nfsd/filecache.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 605dad958670..fe64b9865420 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -396,7 +396,8 @@ nfsd_file_free(struct nfsd_file *nf)
         * that writeback happens before the entry is freed, and that any
         * errors reported result in the write verifier changing.
         */
-       nfsd_file_fsync(nf);
+//     nfsd_file_fsync(nf);
+       filemap_fdatawrite(nf->nf_file->f_mapping);

        if (nf->nf_mark)
                nfsd_file_mark_put(nf->nf_mark);


Not much changes with this test kernel:

direct:

# for distro in rhel8 rhel9; do for fs in ext4 xfs; do for i in {1..5}; do strace -fttTvxyo strace.$distro.$fs.$i dd if=/dev/zero of=/mnt/$distro/$fs/zeroes.direct.v4 bs=1M count=4000 oflag=direct; sleep 1; sync; done 2>&1 | awk -v RHEL="$distro" -v files="$fs" '/bytes/ {total+=$(NF-1); n++}END{print RHEL" "files" avg: " total/n}'; done; done
rhel8 ext4 avg: 509.6
rhel8 xfs avg: 630.6
rhel9 ext4 avg: 158
rhel9 xfs avg: 161.6

fsync:

# for distro in rhel8 rhel9; do for fs in ext4 xfs; do for i in {1..5}; do strace -fttTvxyo strace.$distro.$fs.$i dd if=/dev/zero of=/mnt/$distro/$fs/zeroes.direct.v4 bs=1M count=4000 conv=fsync; sleep 1; sync; done 2>&1 | awk -v RHEL="$distro" -v files="$fs" '/bytes/ {total+=$(NF-1); n++}END{print RHEL" "files" avg: " total/n}'; done; done
rhel8 ext4 avg: 785.6
rhel8 xfs avg: 777
rhel9 ext4 avg: 168.8
rhel9 xfs avg: 168


NFSv3 on the other hand works better:

192.168.100.5:/exports/ext4 /mnt/rhel8/ext4 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.100.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.100.5 0 0
192.168.100.5:/exports/xfs /mnt/rhel8/xfs nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.100.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.100.5 0 0
192.168.100.4:/exports/ext4 /mnt/rhel9/ext4 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.100.4,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.100.4 0 0
192.168.100.4:/exports/xfs /mnt/rhel9/xfs nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.100.4,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.100.4 0 0

# for distro in rhel8 rhel9; do for fs in ext4 xfs; do for i in {1..5}; do strace -fttTvxyo strace.$distro.$fs.$i dd if=/dev/zero of=/mnt/$distro/$fs/zeroes.direct.v4 bs=1M count=4000 oflag=direct; sleep 1; sync; done 2>&1 | awk -v RHEL="$distro" -v files="$fs" '/bytes/ {total+=$(NF-1); n++}END{print RHEL"v3 "files" avg: " total/n}'; done; done
rhel8v3 ext4 avg: 515.2
rhel8v3 xfs avg: 630.6
rhel9v3 ext4 avg: 528
rhel9v3 xfs avg: 631

However the iteration after (with trace-cmd) showed:

# for distro in rhel8 rhel9; do for fs in ext4 xfs; do for i in {1..5}; do strace -fttTvxyo strace.$distro.$fs.$i dd if=/dev/zero of=/mnt/$distro/$fs/zeroes.direct.v4 bs=1M count=4000 oflag=direct; sleep 1; sync; done 2>&1 | awk -v RHEL="$distro" -v files="$fs" '/bytes/ {total+=$(NF-1); n++}END{print RHEL"v3 "files" avg: " total/n}'; done; done
rhel8v3 ext4 avg: 513.6
rhel8v3 xfs avg: 633.2
rhel9v3 ext4 avg: 152.2
rhel9v3 xfs avg: 227

Running without trace-cmd seems to show normal (i.e. high throughput on all the mounts) behaviour.

Comment 2 Jeff Layton 2023-02-02 16:40:03 UTC

Yeah, buffered vs. direct won't make a lot of difference. That really just governs client-side behavior and doesn't end up translating into any real difference in the WRITE/COMMIT/CLOSE RPCs. Your results are interesting, as I get much different ones on my (virtualized) test rig:

centos9:/export on /mnt/centos9 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.3,local_lock=none,addr=192.168.1.137)
centos8:/export on /mnt/centos8 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.3,local_lock=none,addr=192.168.1.117)

They're both using the same options for the rootfs (which is where I'm writing to for this quick test):

    /dev/vda1 / xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

...here, centos8 kernels are a bit slower on the simple dd test:

[jlayton@tleilax ~]$ time dd if=/dev/zero of=/mnt/centos8/dd.out bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.2458 s, 317 MB/s

real	0m13.909s
user	0m0.002s
sys	0m3.792s
[jlayton@tleilax ~]$ time dd if=/dev/zero of=/mnt/centos8/dd.out bs=1M count=4000 conv=fsync
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 14.4547 s, 290 MB/s

real	0m15.184s
user	0m0.003s
sys	0m3.848s

[jlayton@tleilax ~]$ time dd if=/dev/zero of=/mnt/centos9/dd.out bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.476 s, 311 MB/s

real	0m14.124s
user	0m0.002s
sys	0m3.873s

[jlayton@tleilax ~]$ time dd if=/dev/zero of=/mnt/centos9/dd.out bs=1M count=4000 conv=fsync
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.3116 s, 315 MB/s

real	0m13.930s
user	0m0.002s
sys	0m3.727s

This test is not super-scientific, but I do wonder if this might be something hw-specific.

If you export a tmpfs filesystem, does that make any difference in the performance? That might help us tease out whether this might be a nfsd/rpc problem or a performance delta in the exported fs. I'll give that a go next.

Comment 3 Jeff Layton 2023-02-02 17:07:28 UTC

I do see some difference with tmpfs as the underlying export though:

[jlayton@tleilax linux]$ dd if=/dev/zero of=/mnt/centos8/testfile bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 4.79932 s, 874 MB/s

[jlayton@tleilax linux]$ dd if=/dev/zero of=/mnt/centos9/testfile bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 4.94213 s, 849 MB/s


...so here RHEL9 server is a bit slower. Looking at the mountstats, the slowdown here seems to mostly be in WRITE calls. They're almost twice as slow on RHEL9. The OPEN is a bit faster there though. Other calls only took negligible time.

RHEL8 server:
WRITE:
	4000 ops (99%) 
	avg bytes sent per op: 1048820	avg bytes received per op: 184
	backlog wait: 544.552000 	RTT: 7.500000 	total execute time: 552.121500 (milliseconds)
OPEN:
	1 ops (0%) 
	avg bytes sent per op: 320	avg bytes received per op: 356
	backlog wait: 0.000000 	RTT: 375.000000 	total execute time: 375.000000 (milliseconds)


RHEL9 server:
WRITE:
	4000 ops (99%) 
	avg bytes sent per op: 1048820	avg bytes received per op: 184
	backlog wait: 760.859250 	RTT: 16.673500 	total execute time: 777.565750 (milliseconds)
OPEN:
	1 ops (0%) 
	avg bytes sent per op: 320	avg bytes received per op: 356
	backlog wait: 0.000000 	RTT: 328.000000 	total execute time: 329.000000 (milliseconds)

This may or may not be the issue you're seeing though.

Comment 4 Pierguido Lambri 2023-02-02 17:41:03 UTC

Thanks Jeff.
I did try with tmpfs as well:

# for distro in rhel8 rhel9; do for fs in ext4 xfs tmpfs; do for i in {1..5}; do strace -fttTvxyo strace.$distro.$fs.$i dd if=/dev/zero of=/mnt/$distro/$fs/zeroes.direct.v4 bs=1M count=4000; done 2>&1 | awk -v RHEL="$distro" -v files="$fs" '/bytes/ {total+=$(NF-1); n++}END{print RHEL" "files" avg: " total/n}'; done; done
rhel8 ext4 avg: 789.6
rhel8 xfs avg: 776.8
rhel8 tmpfs avg: 786.6
rhel9 ext4 avg: 169.2
rhel9 xfs avg: 169.6
rhel9 tmpfs avg: 790

With tmpfs we are able to reach (probably) the NICs limit (these are 10GB nics) because tmpfs is much faster.
The customer is using ceph as backend and when rhel9 writes back on close, it gets very poor performances.
Also, I'm using these dirty settings on the nfs servers:

vm.dirty_background_ratio = 10
vm.dirty_ratio = 40

Comment 5 Jeff Layton 2023-02-02 18:54:24 UTC

Yes, tmpfs doesn't do any writeback, so it's a good way to take that out of the equation for basic testing. We may have other bottlenecks in the writeback codepath too, but this is a reasonable place to start. I played around a bit with tracepoints in the WRITE codepath, and it looks like the I/O just takes longer on RHEL9. Note that I didn't do any special tuning here on either box. Grabbing two entries at random (that seem to be fairly representative:

[root@centos8 tracing]# grep xid=0x423465ef trace
            nfsd-1425  [001] ....   515.451020: nfsd_write_start: xid=0x423465ef fh_hash=0x49de26d1 offset=4165992448 len=1048576
            nfsd-1425  [001] ....   515.451024: nfsd_write_opened: xid=0x423465ef fh_hash=0x49de26d1 offset=4165992448 len=1048576
            nfsd-1425  [001] ....   515.451449: nfsd_write_io_done: xid=0x423465ef fh_hash=0x49de26d1 offset=4165992448 len=1048576
            nfsd-1425  [001] ....   515.451449: nfsd_write_done: xid=0x423465ef fh_hash=0x49de26d1 offset=4165992448 len=1048576

515.451449-515.451020 = 0.000429000000053747

[root@centos9 tracing]# grep xid=0xee90bb65 trace
            nfsd-1096    [005] .....   365.169907: nfsd_write_start: xid=0xee90bb65 fh_hash=0x1f9353aa offset=3723493376 len=1048576
            nfsd-1096    [005] .....   365.169909: nfsd_write_opened: xid=0xee90bb65 fh_hash=0x1f9353aa offset=3723493376 len=1048576
            nfsd-1096    [005] .....   365.172473: nfsd_write_io_done: xid=0xee90bb65 fh_hash=0x1f9353aa offset=3723493376 len=1048576
            nfsd-1096    [005] .....   365.172473: nfsd_write_done: xid=0xee90bb65 fh_hash=0x1f9353aa offset=3723493376 len=1048576

365.172473-365.169907 = 0.00256600000000162

    0.00256600000000162/0.000429000000053747 = 5.98135198060639

Almost 6 times slower! Where RHEL8 takes about .5ms to do the write, RHEL9 takes about 2.5ms. That function has changed a bit

This time delta represents the time spent in nfsd_vfs_write, which is the point where we call vfs_iter_write() to write into the filesystem. We do some extra error processing on RHEL9, but I'm not sure whether that accounts for the difference.

Comment 6 Jeff Layton 2023-02-02 20:45:03 UTC

I went back and added some slower storage here to hopefully get closer to what you're using with xfs over rbd. On the box hosting the centos8 and 9 VMs, I added a new lvm volume on a SATA hdd that I used to create a btrfs fs. I then created raw files in that fs that I added as extra disks to the VMs. Then I created an xfs volume on top of that and exported it, and ran dd tests against it from the host:

Ouch! RHEL9 is less than half the speed with the slower storage. These hosts 

[jlayton@tleilax ~]$ dd if=/dev/zero of=/mnt/centos8/testfile bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.3525 s, 314 MB/s
[jlayton@tleilax ~]$ dd if=/dev/zero of=/mnt/centos9/testfile bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 31.1737 s, 135 MB/s

The CLOSE takes almost no time in both tests, but the COMMIT takes >3 times longer on RHEL9:

RHEL8:	1 ops (0%) 
	avg bytes sent per op: 208	avg bytes received per op: 104
	backlog wait: 0.000000 	RTT: 8016.000000 	total execute time: 8016.000000 (milliseconds)

RHEL9:	1 ops (0%) 
	avg bytes sent per op: 220	avg bytes received per op: 104
	backlog wait: 0.000000 	RTT: 25202.000000 	total execute time: 25202.000000 (milliseconds)

I'll have a look at some profiling next.

Comment 7 Jeff Layton 2023-02-03 16:01:01 UTC

Created attachment 1942079 [details]
c8 kernel flamegraph

Flamegraph of centos 8 kernel nfsd threads during test.

Comment 8 Jeff Layton 2023-02-03 16:02:27 UTC

Created attachment 1942080 [details]
c9 kernel flamegraph

Flamegraph of nfsd threads on centos9 kernel during dd test

Comment 9 Jeff Layton 2023-02-03 16:47:51 UTC

Two flamegraphs. You have to drill down into the nfsd_commit stack to see the main difference here. The nfsd commit code hasn't changed substantially between the two releases. The big slowdown seems to be occurring at or below the iomap layer. See these URLs for the nfsd_commit stacks:

    https://bugzilla-attachments.redhat.com/attachment.cgi?id=1942080&x=13.2&y=485
    https://bugzilla-attachments.redhat.com/attachment.cgi?id=1942079&x=1036.7&y=533

xfs_vm_writepages is in 454M samples in the RHEL8 kernel, but in 680M samples in the RHEL9 one. The interesting bit is that if I run the same dd test locally on the servers, they don't exhibit the same problem:
 
    $ dd if=/dev/zero of=/mnt/testfile bs=1M count=4000 conv=fsync

NFS's I/O pattern is different though. The WRITEs can come at any time via different nfsd threads, and they may overlap. So, that may be a factor. I'll see if I can figure out a way to generate the same sort of I/O pattern with fio. Maybe we can reproduce this w/o NFS in the mix.

Comment 10 Jeff Layton 2023-02-06 13:58:07 UTC

Created attachment 1942539 [details]
test program attempt

Small test program I wrote to attempt to reproduce this.

It opens a file, spawns 64 threads using pthread_create, and has each of them write 1M to the file and then skip 64M ahead and write again, 64 times. The main program then joins the threads and calls fsync and exits.

The fsync times between centos8 and centos9 kernels seem to be pretty much the same (about 25s on my test box), so this doesn't reproduce the slowdown.

Comment 11 Jeff Layton 2023-02-06 15:09:37 UTC

On both hosts, when running the test program on the slow disk, the fsync takes about 25s (regardless of kernel version). When writing 4g via NFS though, I get times between 30-35s on RHEL9 kernels, but when writing to a RHEL8 server, I see times about 15s!

So, writing over NFS is faster than the (local) test program that attempts to replicate this behavior. I'm not sure what to make of that.

Comment 12 Jeff Layton 2023-02-06 15:51:41 UTC

I'm stumped at this point. Here's basically what I'm seeing. If I run a simple dd test on the RHEL9 NFS server vs. running it locally they both finish in a little under 30s:

Local run on RHEL9:

[jlayton@centos9 ~]$ time dd if=/dev/zero of=/slowfs/testfile bs=1M count=4000 conv=fsync
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 29.7422 s, 141 MB/s

real	0m30.033s
user	0m0.003s
sys	0m2.499s


Run over NFS (RHEL9 server):

[jlayton@tleilax ~]$ time dd if=/dev/zero of=/mnt/centos9/testfile bs=1M count=4000 conv=fsync
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 29.7274 s, 141 MB/s

real	0m29.997s
user	0m0.001s
sys	0m3.506s


Local run on RHEL8 server:

[jlayton@centos8 ~]$ time dd if=/dev/zero of=/slowfs/testfile bs=1M count=4000 conv=fsync
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 27.6617 s, 152 MB/s

real	0m27.664s
user	0m0.004s
sys	0m2.086s

Run over NFS to RHEL8 server:

[jlayton@tleilax ~]$ time dd if=/dev/zero of=/mnt/centos8/testfile bs=1M count=4000 conv=fsync
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 15.7771 s, 266 MB/s

real	0m15.785s
user	0m0.001s
sys	0m3.360s

Why would this be faster when the I/Os are coming via NFS?

In particular, the COMMIT (aka fsync) is much (15s!) faster on RHEL8 when writing 4g via NFS:

10:46:49.400512 fsync(1)                = 0 <10.713106>

...vs. on RHEL9 NFS server:

10:47:21.053436 fsync(1)                = 0 <25.867356>

I'm stumped as to what could account for this.

bfoster, dchinner : do you guys have any thoughts here? Could it be that writeback is getting started sooner on the RHEL8 NFS server in this situation? How could we confirm that?

Comment 13 Jeff Layton 2023-02-06 19:29:08 UTC

Ok, I finally took a closer look at Pierguido's test rig and see this:

    /exports/xfs	*(rw,async,no_root_squash)

The async export option is particularly dangerous, and I think that explains the symptoms you're seeing with your test rig. That option turns COMMIT calls into no-ops, so adding conv=fsync to the dd command line has little effect on this setup. Setting that to "sync" on your test servers brings the timing of a run with conv=fsync much more closely into line. Probably we ought to make exportfs throw a warning when someone tries to use -o async these days.

RHEL9 does an implicit fsync on the last close of a file, but RHEL8 never did that. I also made a mistake with the test kernel and had it use filemap_fdatawrite, but that's still pretty slow with slow storage. I'm posting a new kernel that doesn't flush the file at all on close, which will probably make RHEL9 behave much more like RHEL8 with -o async export option. If that works, we may want to make the fsync in nfsd_file_free conditional on the export being EX_ISSYNC.

But...we may be able to avoid the fsync on last close as well, which would be better for performance overall. I may just try to make that work instead.

Comment 14 Dave Chinner 2023-02-07 04:58:51 UTC

Yup, the async export option will cause this sort of thing - the OTW COMMIT becomes an no-op at the server side, so it'll just end up being a normal buffered write into the page cache.

As for the nfsd doing a flush on last close: I'd suggest that it isn't necessary.

Filesystems like XFS already flush on last close if it is necessary to avoid data loss from unsafe overwrite operations via:

__fput()
  ->release()
    xfs_file_release()
      xfs_release()
        filemap_flush()

ext4 does a flush for the same reasons via ext4_release_file() -> ext4_alloc_da_blocks() -> filemap_flush(). 

btrfs has a BTRFS_INODE_FLUSH_ON_CLOSE flag that triggers filemap_flush() from btrfs_release_file(). That flag gets set on truncate down, so it's also doing exactly the same thing XFS and ext4 are doing for the exact same reasons.

So unless there's something else that the nfsd requires a flush on last close for, I don't see why it would be necessary.

-Dave.

Comment 15 Chuck Lever 2023-02-07 13:57:37 UTC

> So unless there's something else that the nfsd requires a flush on last close for, I don't see why it would be necessary.

Fwiw, the full comment in nfsd_file_free() reads:

 293         /*
 294          * We call fsync here in order to catch writeback errors. It's not
 295          * strictly required by the protocol, but an nfsd_file could get
 296          * evicted from the cache before a COMMIT comes in. If another
 297          * task were to open that file in the interim and scrape the error,
 298          * then the client may never see it. By calling fsync here, we ensure
 299          * that writeback happens before the entry is freed, and that any
 300          * errors reported result in the write verifier changing.
 301          */
 302         nfsd_file_fsync(nf);

Comment 16 Jeff Layton 2023-02-07 15:19:42 UTC

Yep, and I think that's probably overzealous. We don't need to fsync, really. We just need to know if there have been writeback errors. Patch posted to linux-nfs list:

https://lore.kernel.org/linux-nfs/9137413986ba9c2e83c030513fa9ae3358f30a85.camel@kernel.org/T/#mcb88f091263d07d8b9c13e6cc5ce0a0413d3f761

Comment 17 Jeff Layton 2023-02-08 11:28:17 UTC

(In reply to Dave Chinner from comment #14)
> Yup, the async export option will cause this sort of thing - the OTW COMMIT
> becomes an no-op at the server side, so it'll just end up being a normal
> buffered write into the page cache.
> 

Except that on the test rig where I got those results, I wasn't using the async option. I think this is a separate problem from what Pierguido first reported. I'll plan to reconfirm the results I saw and open a new bug for that one.

Comment 20 Jeff Layton 2023-02-08 11:36:38 UTC

For QA, the easiest way to test this is to have an export on slow storage (an HDD is fine) with the "async" option enabled. Then do a simple sequential write test on it and see whether it's slower than RHEL8.

Comment 43 errata-xmlrpc 2023-05-09 08:13:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2458

Note You need to log in before you can comment on or make changes to this bug.