Description of problem: We see significant performance drop on mmapmany test (subtest of stress-ng benchmark). We see this performance drop as well on the same subtest in Libmicro benchmark. Drop we see is up to 80%. We see this drop on many different systems AMDs, INTELs. amd-epyc2-rome-7702-2s.lab.eng.brq2.redhat.com: kernel-6.0.0-54.eln121.x86_64 1 thread result 184587 bogo_ops kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 67901 bogo_ops amd-epyc3-milan-7313-2s.tpb.lab.eng.brq.redhat.com kernel-6.0.0-54.eln121.x86_64 1 thread result 340082 bogo_ops kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 94601 bogo_ops in perf records we checked outputs and we see differences in system calls, which possibly explains the problem. ==> 6.0.0-54.eln121.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <== # Overhead Command Shared Object Symbol # ........ ......... .................... ........................................... # 22.01% stress-ng libc.so.6 [.] __munmap 16.43% stress-ng [kernel.kallsyms] [k] syscall_enter_from_user_mode 5.78% stress-ng [kernel.kallsyms] [k] anon_vma_interval_tree_insert 5.63% stress-ng libc.so.6 [.] __mmap 4.49% stress-ng [kernel.kallsyms] [k] find_vma 3.62% stress-ng [kernel.kallsyms] [k] __vma_adjust 2.87% stress-ng [kernel.kallsyms] [k] unmapped_area_topdown 2.51% stress-ng [kernel.kallsyms] [k] kmem_cache_free 2.32% stress-ng [kernel.kallsyms] [k] vm_area_dup ==> 6.1.0-0.rc6.46.eln123.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <== # # Overhead Command Shared Object Symbol # ........ ......... .................... ........................................... # 9.80% stress-ng libc.so.6 [.] __munmap 7.33% stress-ng [kernel.kallsyms] [k] syscall_enter_from_user_mode 6.70% stress-ng [kernel.kallsyms] [k] memset_erms 4.56% stress-ng [kernel.kallsyms] [k] kmem_cache_alloc_bulk 4.34% stress-ng [kernel.kallsyms] [k] build_detached_freelist 3.60% stress-ng [kernel.kallsyms] [k] mas_walk 3.53% stress-ng [kernel.kallsyms] [k] ___slab_alloc 3.34% stress-ng [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.34% stress-ng [kernel.kallsyms] [k] mas_wr_walk Version-Release number of selected component (if applicable): kernel-6.1.0-0.rc1.15.eln122 How reproducible: 1. Reserve any system 2. Install kernels you need to compare in my example repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-07_10h36m54s_kernel-6.0.0-54.eln121.repo dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel 3. do rhts-reboot to boot this kernel 4. install redhat certificate curl --insecure --output /etc/pki/ca-trust/source/anchors/RH-IT-Root-CA.crt https://password.corp.redhat.com/RH-IT-Root-CA.crt update-ca-trust extract 5.clone perf team tools with git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git 6. go to cd scheduler-benchmarks/Stress_ng-test/ 7. run ./manual_test.sh --perf-record mmapmany 8. install affected kernel with repo=$ repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-21_10h19m36s_kernel-6.1.0-0.rc1.15.eln122.repo dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel 9. do rhts-reboot to boot this kernel 10. run ./manual_test.sh --perf-record mmapmany 11. compare results (bogo-ops result)
Changing the product to Fedora/rawhide with the keyword 'Regression'.
Hi Rafael, thanks a lot for sharing this! Here is the comparison of perf record results for mmapmany between kernels 6.1 (on the right) and 6.0 (on the left). http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/Perf/mmapmany.html These calls stand out on 6.1: ======================================== 17.56% stress-ng [kernel.kallsyms] [k] memset 7.00% stress-ng [kernel.kallsyms] [k] build_detached_freelist 4.78% stress-ng [kernel.kallsyms] [k] kmem_cache_alloc_bulk ======================================== The call graph is: --24.57%--mas_preallocate | --24.57%--mas_alloc_nodes | |--22.07%--kmem_cache_alloc_bulk | | | |--15.70%--memset I will upload the complete call graph for kernels 6.0 and 6.1 for you to check. I should also mention that we see big gains for other system calls. Check the overview here. http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/index.html#overview Performance drop for mmapmany is the exception. So I think the only action needed is to double-check if we can mitigate the slowdown for mmapmany. If not, we can probably close this BZ as (a) there is a trade-off: page faults get faster, calls to mmap get slower (b) we don't see any slowdown in the real app, just in the synthetic benchmarks. Thanks a lot Jirka
(In reply to Jiri Hladky from comment #11) > Created attachment 1934125 [details] > Proposal for the patch to improve the mmap performance Jirka, Here's an ELN scratch build with the attached patch: https://koji.fedoraproject.org/koji/taskinfo?taskID=95620541 -- Rafael
Rafael, thanks a lot! I have created a permanent repo here: http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Dec-23_10h58m11s_kernel-6.1.0-0.rc6.46.test.eln124.repo and I have scheduled Beaker jobs. I will post the results here. Thanks Jirka
Hi Rafael, happy new year! The patch helps, the mmap performance has improved by 20%. http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7542-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.1.0/2022-11-23T08:31:54.400000vs2022-12-23T10:49:01.300000/6c9b653d-5ea6-54c1-a549-8ddcb0e8d3f2/index.html#overview I will inform the upstream. Let's try to get the patch to the mainline kernel. Jirka
Hi Rafael, are there any news on the patch? We have compared 6.3.0-63.eln126 vs 6.1.0-65.eln124 and mmapmany performance has degraded by another 30% - 50%. The trend worries me - 6.1 saw degradation compared to the 6.0 version, and it's worsening again. http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.2.0-20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:13:31.164572/8b23d6ac-6ac2-5fdc-b78f-544ec059f5a5/index.html#mmapmany_section http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc4-genoa-9354p-1s.lab.eng.brq2.redhat.com/RHEL-9.2.0-20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:13:31.164572/94429302-618e-5a8e-a6ee-33a689d50464/index.html#mmapmany_section Thanks a lot! Jirka
(In reply to Jiri Hladky from comment #15) > Hi Rafael, > > are there any news on the patch? > > We have compared 6.3.0-63.eln126 vs 6.1.0-65.eln124 and mmapmany performance > has degraded by another 30% - 50%. The trend worries me - 6.1 saw > degradation compared to the 6.0 version, and it's worsening again. > Jirka, the particular patch we discussed back in December is merged in upstream v6.3-rc1: commit 541e06b772c1aaffb3b6a245ccface36d7107af2 Author: Liam Howlett <liam.howlett> Date: Thu Jan 5 16:05:34 2023 +0000 maple_tree: remove GFP_ZERO from kmem_cache_alloc() and kmem_cache_alloc_bulk() so, it is part of the target 6.3.0-63.eln126 tested kernel. Whatever is causing the diff has to be something else among the 32k patches integrated upstream between v6.1 and v6.3 ... > http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/ > stress-ng/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.2.0- > 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10: > 13:31.164572/8b23d6ac-6ac2-5fdc-b78f-544ec059f5a5/index.html#mmapmany_section > > http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/ > stress-ng/amd-epyc4-genoa-9354p-1s.lab.eng.brq2.redhat.com/RHEL-9.2.0- > 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10: > 13:31.164572/94429302-618e-5a8e-a6ee-33a689d50464/index.html#mmapmany_section > > > Thanks a lot! > Jirka
Hi Rafael, thanks for the confirmation that the patch was merged. We will keep an eye on this particular benchmark as we test the upstream kernels. Jirka
Hi Jirka, Are you still seeing this in kernel-ark (rawhide/eln) or can this be closed? Scott
Hi Scott, yes, we still see the performance degradation with the latest tested kernel 6.4.0-59.eln127 Compared to RHEL-9.2 (5.14.0-284.11.1.el9_2), the performance drop is over 60%: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/index.html#mmapmany_section Here is the diff for per record/report for that particular test: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/Perf/mmapmany.html In 6.4 kernel, these calls appear at the top and seem to be the main source of the performance degradation: 4.93% stress-ng-mmapm [kernel.kallsyms] [k] build_detached_freelist 4.85% stress-ng-mmapm [kernel.kallsyms] [k] __slab_free 4.48% stress-ng-mmapm [kernel.kallsyms] [k] ___slab_alloc 3.56% stress-ng-mmapm [kernel.kallsyms] [k] __kmem_cache_alloc_bulk 3.13% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 3.03% stress-ng-mmapm [kernel.kallsyms] [k] mtree_range_walk Could you look into this? It can be easily replicated - see the description for the details. To install the latest ELN kernel, use this repo repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo and compare the results against RHEL-9.2.0 vanilla numbers. The test runs for ~23 seconds, and the problem can be reproduced in VM - no need for bare-metal. Thanks a lot Jirka
I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install kernel-6.4.0-59.eln127 Here are the results from 1minutetip container - it took me ~15 minutes in total to reproduce the problem. 1minutetip -n 1MT-RHEL-9.3.0-20230521.45 Command executed: repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git cd scheduler-benchmarks/Stress_ng-test/ ./manual_test.sh --perf-record mmapmany reboot cd scheduler-benchmarks/Stress_ng-test/ ./manual_test.sh --perf-record mmapmany Results: [root@ci-vm-10-0-136-39 Stress_ng-test]# tail -v -n+1 *summary ==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record.summary <== 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors mmapmany 111361.403087 ==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record.summary <== 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors mmapmany 49583.785148 => performance drop by 65% [root@ci-vm-10-0-136-39 Stress_ng-test]# head -v -n 20 *perfrecord.perf.txt ==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <== # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 91K of event 'cpu-clock:pppH' # Event count (approx.): 22949750000 # # Overhead Command Shared Object Symbol # ........ ............... .................... ........................................... # 8.16% stress-ng-mmapm [kernel.kallsyms] [k] do_user_addr_fault 6.05% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 6.02% stress-ng-mmapm [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.81% stress-ng-mmapm [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 3.62% stress-ng-mmapm [kernel.kallsyms] [k] find_vma 3.38% stress-ng-mmapm [kernel.kallsyms] [k] __rcu_read_unlock 3.07% stress-ng-mmapm [kernel.kallsyms] [k] __vma_adjust 3.01% stress-ng-mmapm [kernel.kallsyms] [k] __rcu_read_lock 2.75% stress-ng-mmapm [kernel.kallsyms] [k] unmapped_area_topdown ==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <== # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 87K of event 'cpu-clock:pppH' # Event count (approx.): 21787500000 # # Overhead Command Shared Object Symbol # ........ ............... .................... ........................................... # 7.43% stress-ng-mmapm [kernel.kallsyms] [k] ___slab_alloc 4.84% stress-ng-mmapm [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 4.05% stress-ng-mmapm [kernel.kallsyms] [k] __kmem_cache_alloc_bulk 3.92% stress-ng-mmapm [kernel.kallsyms] [k] do_user_addr_fault 3.90% stress-ng-mmapm [kernel.kallsyms] [k] __slab_free 3.60% stress-ng-mmapm [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.46% stress-ng-mmapm [kernel.kallsyms] [k] build_detached_freelist 3.16% stress-ng-mmapm [kernel.kallsyms] [k] mtree_range_walk 3.10% stress-ng-mmapm stress-ng [.] stress_mmapmany_child => slowdown is caused by these system calls which appear on top in 6.4 kernel: 7.43% stress-ng-mmapm [kernel.kallsyms] [k] ___slab_alloc 4.84% stress-ng-mmapm [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 4.05% stress-ng-mmapm [kernel.kallsyms] [k] __kmem_cache_alloc_bulk 3.92% stress-ng-mmapm [kernel.kallsyms] [k] do_user_addr_fault 3.90% stress-ng-mmapm [kernel.kallsyms] [k] __slab_free 3.60% stress-ng-mmapm [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.46% stress-ng-mmapm [kernel.kallsyms] [k] build_detached_freelist 3.16% stress-ng-mmapm [kernel.kallsyms] [k] mtree_range_walk Thanks Jirka
Hi Rafael, Would you have a little time to spare to look into this? Thank you! Scott
(In reply to Jiri Hladky from comment #20) > I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install > kernel-6.4.0-59.eln127 > > Here are the results from 1minutetip container - it took me ~15 minutes in > total to reproduce the problem. > > 1minutetip -n 1MT-RHEL-9.3.0-20230521.45 > > Command executed: > repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul- > 03_09h54m52s_kernel-6.4.0-59.eln127.repo > dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" > --nogpgcheck update kernel > git -c http.sslVerify=false clone > https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks. > git > cd scheduler-benchmarks/Stress_ng-test/ > ./manual_test.sh --perf-record mmapmany > reboot > cd scheduler-benchmarks/Stress_ng-test/ > ./manual_test.sh --perf-record mmapmany After instrumentating the maple tree code with the latest upstream kernel on how it uses slab, I got the following maple_node allocation and freeing stats after running # ./manual_test.sh --perf-record mmapmany kmem_cache_alloc() - 12,384,365 kmem_cache_alloc_bulk() size 1: 1 size 2: 0 size 3: 634 size 4: 21 size 5-9: 60,693 size 10-19: 6,343,240 size 20-29: 0 size >= 30: 1 kmem_cache_free() - 29,195,121 kmem_cache_free_bulk() size 1: 511 size 2: 1,098 size 3: 1,494 size 4: 7,734 size 5-9: 843,483 size 10-19: 5,550,772 size 20-29: 0 size >= 30: 0 It can be seen that there are quite a lot of slab allocation and free, especially the bulk one between 10-19 in size. My test results were # 4.67% stress-ng-mmapm [kernel.vmlinux] [k] ___slab_alloc 4.45% stress-ng-mmapm [kernel.vmlinux] [k] build_detached_freelist 3.16% stress-ng-mmapm [kernel.vmlinux] [k] __kmem_cache_alloc_bulk 3.16% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 2.83% stress-ng-mmapm [kernel.vmlinux] [k] __slab_free 2.65% stress-ng-mmapm [kernel.vmlinux] [k] anon_vma_interval_tree_insert 2.55% stress-ng-mmapm [kernel.vmlinux] [k] perf_iterate_ctx 2.10% stress-ng-mmapm [kernel.vmlinux] [k] kmem_cache_free_bulk.part.0 mmapmany: 59441.085073 bogo-ops-per-second The maple_node slab is 8k in size and have 32 objects per slab. I tried the simple experiment of doubling the slab size to 16k and got the following results: # 3.38% stress-ng-mmapm [kernel.vmlinux] [k] build_detached_freelist 3.33% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 3.03% stress-ng-mmapm [kernel.vmlinux] [k] __kmem_cache_alloc_bulk 2.78% stress-ng-mmapm [kernel.vmlinux] [k] ___slab_alloc 2.58% stress-ng-mmapm [kernel.vmlinux] [k] perf_iterate_ctx 2.56% stress-ng-mmapm [kernel.vmlinux] [k] get_mem_cgroup_from_mm 2.53% stress-ng-mmapm [kernel.vmlinux] [k] anon_vma_interval_tree_insert 2.46% stress-ng-mmapm [kernel.vmlinux] [k] __slab_free mmapmany: 65249.766589 bogo-ops-per-second There was an almost 10% performance improvement by doubling the slab size. For further performance improvement, we may need to do more optimization in the interaction between maple tree and slub.
Update from Liam R. Howlett <Liam.Howlett> =========================================================================== An email on the maple tree mailing list about a performance gain on mmapmany: http://lists.infradead.org/pipermail/maple-tree/2023-September/002799.html ************************************************************************************************ kernel test robot noticed a 21.4% improvement of stress-ng.mmapmany.ops_per_sec on: commit: 17983dc617837a588a52848ab4034d8efa6c1fa6 ("maple_tree: refine mas_preallocate() node calculations") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master ************************************************************************************************ There is also future work with the allocator that's being developed in an attempt to further speed things up: https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/ =========================================================================== The upstream results align with our testing - see comment #28. The great news is that the community is actively working on this, and we expect additional improvements soon. Thanks Jirka
Hi, Today I have fresh results from kernel-6.6.0-0.rc2.20.eln130 and regression on mmapmany is still there. Libmicro example result: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Libmicro-test/amd-epyc3-milan-7713-2s.tpb.lab.eng.brq.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.3.0-20230718.0/2023-07-20T14:03:11.500000vs2023-09-19T14:11:51.596947/bd6232d0-ab0a-5273-b2a2-5c14de9b100b/index.html Stress-ng example result: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s-02.lab.eng.brq2.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.3.0-20230718.0/2023-07-20T14:03:11.500000vs2023-09-19T14:11:51.596947/4daee582-14fd-59a5-9b2f-02ec5d38625e/index.html Thanks Kamil
*** Bug 2239808 has been marked as a duplicate of this bug. ***
Libmicro shows regression up to factor 2x for mmap, unmamp and mprot tests. I'm going to attach a minimal reproducer in .c to show performance degradation for the unmap. gcc -Wall -Wextra -O1 mmap_munmap.c -o mmap_munmap ./run_mmap_munmap.sh Results show that the performance drop started in kernel 6.1. In 6.6, there was an improvement, but results are still 2x slower than in the 6.0 kernel: $ tail -v -n+1 *mmap_munmap.log ==> 5.14.0-339.el9.x86_64_mmap_munmap.log <== TSC for 1048576 munmap calls with len of 8kiB: 2851688 K-cycles. Avg: 2.71958 K-cycles/call ==> 6.0.0-54.eln121.x86_64_mmap_munmap.log <== TSC for 1048576 munmap calls with len of 8kiB: 2855807 K-cycles. Avg: 2.72351 K-cycles/call ==> 6.1.0-0.rc1.15.eln122.x86_64_mmap_munmap.log <== TSC for 1048576 munmap calls with len of 8kiB: 5366007 K-cycles. Avg: 5.11742 K-cycles/call ==> 6.5.0-57.eln130.x86_64_mmap_munmap.log <== TSC for 1048576 munmap calls with len of 8kiB: 5757203 K-cycles. Avg: 5.4905 K-cycles/call ==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_munmap.log <== TSC for 1048576 munmap calls with len of 8kiB: 5044454 K-cycles. Avg: 4.81077 K-cycles/call Here is the perf record/report result for kernel 6.6: ====================================================================================== $ head -30 6.6.0-0.rc2.20.eln130.x86_64_mmap_munmap.perf # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 21K of event 'cycles' # Event count (approx.): 18470612451 # # Overhead Command Shared Object Symbol # ........ ........... .................... ........................................... # 7.39% mmap_munmap mmap_munmap [.] main 3.89% mmap_munmap [kernel.kallsyms] [k] sync_regs 3.60% mmap_munmap [kernel.kallsyms] [k] perf_iterate_ctx 3.41% mmap_munmap [kernel.kallsyms] [k] __folio_throttle_swaprate 3.22% mmap_munmap [kernel.kallsyms] [k] native_irq_return_iret 3.17% mmap_munmap [kernel.kallsyms] [k] native_flush_tlb_one_user 2.75% mmap_munmap [kernel.kallsyms] [k] get_mem_cgroup_from_mm 2.44% mmap_munmap [kernel.kallsyms] [k] __slab_free 2.06% mmap_munmap [kernel.kallsyms] [k] mas_wr_node_store 1.99% mmap_munmap [kernel.kallsyms] [k] charge_memcg 1.82% mmap_munmap [kernel.kallsyms] [k] try_charge_memcg 1.57% mmap_munmap [kernel.kallsyms] [k] kmem_cache_alloc 1.39% mmap_munmap [kernel.kallsyms] [k] kmem_cache_free 1.21% mmap_munmap [kernel.kallsyms] [k] __rcu_read_lock 1.11% mmap_munmap [kernel.kallsyms] [k] mtree_range_walk 1.06% mmap_munmap [kernel.kallsyms] [k] __handle_mm_fault 1.04% mmap_munmap [kernel.kallsyms] [k] __rcu_read_unlock 1.04% mmap_munmap [kernel.kallsyms] [k] up_write 1.01% mmap_munmap [kernel.kallsyms] [k] mas_rev_awalk ====================================================================================== Please note maple trees related functions (mtree_*, mas_*) - see also https://docs.kernel.org/core-api/maple_tree.html#:~:text=The%20Maple%20Tree%20is%20a,a%20user%20written%20search%20method.
Created attachment 1991090 [details] Simple reproducer for munmap written in .c This is a minimal reproducer in .c to show performance degradation for the unmap. gcc -Wall -Wextra -O1 mmap_munmap.c -o mmap_munmap ./run_mmap_munmap.sh It includes results from Intel Icelake Platinum 8351N CPU server with RHEL-9.3 Beta kernel (5.14.0-339.el9) and upstream kernels 6.0, 6.1 rc1, 6.5, and 6.6 rc2.
On the Maple Tree mailing list, a nonsynthetic performance report was posted: http://lists.infradead.org/pipermail/maple-tree/2023-September/002908.html Test "wordcount" from Phonix-2.0 https://github.com/kozyraki/phoenix/blob/master/phoenix-2.0/tests/word_count/word_count.c shows a significant performance drop with Maple Trees. Phoenix is a shared-memory implementation of Google's MapReduce model for data-intensive processing tasks. Reproducer: =================================================================================== git clone https://github.com/kozyraki/phoenix.git cd phoenix/phoenix-2.0 make wget http://csl.stanford.edu/~christos/data/word_count.tar.gz tar xvf word_count.tar.gz time phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt =================================================================================== Test "wordcount" shows the most significant performance degradation with Maple Tree compared to RB when run with many threads. Below are the results from Intel Icelake server with 36-core Platinum 8351N CPU : 72 threads: ==> 6.0.0-54.eln121.x86_64_allthreads.statistics.txt <== Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt Time (mean ± σ): 797.4 ms ± 47.0 ms [User: 4868.8 ms, System: 10793.5 ms] Range (min … max): 707.0 ms … 849.8 ms 10 runs ==> 6.5.0-57.eln130.x86_64_allthreads.statistics.txt <== Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt Time (mean ± σ): 1.673 s ± 0.265 s [User: 9.574 s, System: 54.777 s] Range (min … max): 1.282 s … 2.240 s 10 runs ==> 6.6.0-0.rc2.20.eln130.x86_64_allthreads.statistics.txt <== Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt Time (mean ± σ): 1.223 s ± 0.038 s [User: 13.107 s, System: 20.346 s] Range (min … max): 1.153 s … 1.273 s 10 runs System time is 5x longer with 6.5 kernel compared to 6.0. In 6.6, there is a significant improvement, but performance is still 2x slower compared to 6.0 kernel. When I repeat the test on the same server but just with four threads, I'm getting much better results: ==> 6.0.0-54.eln121.x86_64_4threads.statistics.txt <== Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt Time (mean ± σ): 689.4 ms ± 3.3 ms [User: 2350.9 ms, System: 191.9 ms] Range (min … max): 684.5 ms … 693.8 ms 10 runs ==> 6.5.0-57.eln130.x86_64_4threads.statistics.txt <== Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt Time (mean ± σ): 699.3 ms ± 2.4 ms [User: 2371.8 ms, System: 205.8 ms] Range (min … max): 695.4 ms … 702.6 ms 10 runs ==> 6.6.0-0.rc2.20.eln130.x86_64_4threads.statistics.txt <== Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt Time (mean ± σ): 696.5 ms ± 3.0 ms [User: 2370.1 ms, System: 194.3 ms] Range (min … max): 691.2 ms … 700.2 ms 10 runs In this scenario, there is barely any difference between kernels 6.0, 6.5, and 6.6.
Test "wordcount" from Phonix-2.0 is heavily using mprotect system call. It's worth noting that mprotect syscall with Maple Trees is up to 1.5x slower than with RB trees as shown with Libmicro - look for "mprot" in this Libmicro result: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Libmicro-test/intel-icelake-platinum-8351n-1s.lab.eng.brq2.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.3.0-20230718.0/2023-07-20T14:03:11.500000vs2023-09-13T08:52:32.238798/13242074-ed62-50b0-b211-15b64f64c1c1/index.html I have created a simple .c reproducer. It times the mprotect syscall for a large mmaped region where we alternate NO_PROT and PROT_READ | PROT_WRITE for 128kiB large areas. You can run it like this: gcc -Wall -Wextra -O1 mmap_mprotect.c -o mmap_mprotect ./mmap_mprotect 128 32768 It will allocate 128kiB * 32768 = 4 GiB of memory Here is the central part where we measure the performance: // Time the mprotect calls -alternate protection between PROT_NONE and PROT_READ | PROT_WRITE // u_int64_t start_rdtsc = start_clock(); for (i = 0; i < iterations; i++) { if (i % 2 == 0 ) { prot = PROT_NONE; } else { prot = PROT_READ | PROT_WRITE; } ret =mprotect((void *)ts_map + i*mmap_len, mmap_len, prot); if (ret != 0) { perror("mprotect"); printf(" mmap error at iteration %d from %ld\n", i, iterations); } } u_int64_t stop_rdtsc = stop_clock(); Results: ==> 6.0.0-54.eln121.x86_64_mmap_mprotect.log <== TSC for 32768 mprotect calls with len of 128kiB: 202866 K-cycles. Avg: 6.191 K-cycles/call ==> 6.5.0-57.eln130.x86_64_mmap_mprotect.log <== TSC for 32768 mprotect calls with len of 128kiB: 302327 K-cycles. Avg: 9.22631 K-cycles/call ==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_mprotect.log <== TSC for 32768 mprotect calls with len of 128kiB: 269886 K-cycles. Avg: 8.2363 K-cycles/call Compared to 6.0 kernel, 6.5 shows perf. slowdown by a factor of 1.48x. 6.6 performs better, and mprotect is slower by 1.32x compared to the 6.0 kernel.