Description of problem: We see significant performance drop on mmapmany test (subtest of stress-ng benchmark). We see this performance drop as well on the same subtest in Libmicro benchmark. Drop we see is up to 80%. We see this drop on many different systems AMDs, INTELs. amd-epyc2-rome-7702-2s.lab.eng.brq2.redhat.com: kernel-6.0.0-54.eln121.x86_64 1 thread result 184587 bogo_ops kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 67901 bogo_ops amd-epyc3-milan-7313-2s.tpb.lab.eng.brq.redhat.com kernel-6.0.0-54.eln121.x86_64 1 thread result 340082 bogo_ops kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 94601 bogo_ops in perf records we checked outputs and we see differences in system calls, which possibly explains the problem. ==> 6.0.0-54.eln121.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <== # Overhead Command Shared Object Symbol # ........ ......... .................... ........................................... # 22.01% stress-ng libc.so.6 [.] __munmap 16.43% stress-ng [kernel.kallsyms] [k] syscall_enter_from_user_mode 5.78% stress-ng [kernel.kallsyms] [k] anon_vma_interval_tree_insert 5.63% stress-ng libc.so.6 [.] __mmap 4.49% stress-ng [kernel.kallsyms] [k] find_vma 3.62% stress-ng [kernel.kallsyms] [k] __vma_adjust 2.87% stress-ng [kernel.kallsyms] [k] unmapped_area_topdown 2.51% stress-ng [kernel.kallsyms] [k] kmem_cache_free 2.32% stress-ng [kernel.kallsyms] [k] vm_area_dup ==> 6.1.0-0.rc6.46.eln123.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <== # # Overhead Command Shared Object Symbol # ........ ......... .................... ........................................... # 9.80% stress-ng libc.so.6 [.] __munmap 7.33% stress-ng [kernel.kallsyms] [k] syscall_enter_from_user_mode 6.70% stress-ng [kernel.kallsyms] [k] memset_erms 4.56% stress-ng [kernel.kallsyms] [k] kmem_cache_alloc_bulk 4.34% stress-ng [kernel.kallsyms] [k] build_detached_freelist 3.60% stress-ng [kernel.kallsyms] [k] mas_walk 3.53% stress-ng [kernel.kallsyms] [k] ___slab_alloc 3.34% stress-ng [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.34% stress-ng [kernel.kallsyms] [k] mas_wr_walk Version-Release number of selected component (if applicable): kernel-6.1.0-0.rc1.15.eln122 How reproducible: 1. Reserve any system 2. Install kernels you need to compare in my example repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-07_10h36m54s_kernel-6.0.0-54.eln121.repo dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel 3. do rhts-reboot to boot this kernel 4. install redhat certificate curl --insecure --output /etc/pki/ca-trust/source/anchors/RH-IT-Root-CA.crt https://password.corp.redhat.com/RH-IT-Root-CA.crt update-ca-trust extract 5.clone perf team tools with git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git 6. go to cd scheduler-benchmarks/Stress_ng-test/ 7. run ./manual_test.sh --perf-record mmapmany 8. install affected kernel with repo=$ repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-21_10h19m36s_kernel-6.1.0-0.rc1.15.eln122.repo dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel 9. do rhts-reboot to boot this kernel 10. run ./manual_test.sh --perf-record mmapmany 11. compare results (bogo-ops result)
Changing the product to Fedora/rawhide with the keyword 'Regression'.
Hi Rafael, thanks a lot for sharing this! Here is the comparison of perf record results for mmapmany between kernels 6.1 (on the right) and 6.0 (on the left). http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/Perf/mmapmany.html These calls stand out on 6.1: ======================================== 17.56% stress-ng [kernel.kallsyms] [k] memset 7.00% stress-ng [kernel.kallsyms] [k] build_detached_freelist 4.78% stress-ng [kernel.kallsyms] [k] kmem_cache_alloc_bulk ======================================== The call graph is: --24.57%--mas_preallocate | --24.57%--mas_alloc_nodes | |--22.07%--kmem_cache_alloc_bulk | | | |--15.70%--memset I will upload the complete call graph for kernels 6.0 and 6.1 for you to check. I should also mention that we see big gains for other system calls. Check the overview here. http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/index.html#overview Performance drop for mmapmany is the exception. So I think the only action needed is to double-check if we can mitigate the slowdown for mmapmany. If not, we can probably close this BZ as (a) there is a trade-off: page faults get faster, calls to mmap get slower (b) we don't see any slowdown in the real app, just in the synthetic benchmarks. Thanks a lot Jirka
(In reply to Jiri Hladky from comment #11) > Created attachment 1934125 [details] > Proposal for the patch to improve the mmap performance Jirka, Here's an ELN scratch build with the attached patch: https://koji.fedoraproject.org/koji/taskinfo?taskID=95620541 -- Rafael
Rafael, thanks a lot! I have created a permanent repo here: http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Dec-23_10h58m11s_kernel-6.1.0-0.rc6.46.test.eln124.repo and I have scheduled Beaker jobs. I will post the results here. Thanks Jirka
Hi Rafael, happy new year! The patch helps, the mmap performance has improved by 20%. http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7542-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.1.0/2022-11-23T08:31:54.400000vs2022-12-23T10:49:01.300000/6c9b653d-5ea6-54c1-a549-8ddcb0e8d3f2/index.html#overview I will inform the upstream. Let's try to get the patch to the mainline kernel. Jirka
Hi Rafael, are there any news on the patch? We have compared 6.3.0-63.eln126 vs 6.1.0-65.eln124 and mmapmany performance has degraded by another 30% - 50%. The trend worries me - 6.1 saw degradation compared to the 6.0 version, and it's worsening again. http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.2.0-20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:13:31.164572/8b23d6ac-6ac2-5fdc-b78f-544ec059f5a5/index.html#mmapmany_section http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc4-genoa-9354p-1s.lab.eng.brq2.redhat.com/RHEL-9.2.0-20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:13:31.164572/94429302-618e-5a8e-a6ee-33a689d50464/index.html#mmapmany_section Thanks a lot! Jirka
(In reply to Jiri Hladky from comment #15) > Hi Rafael, > > are there any news on the patch? > > We have compared 6.3.0-63.eln126 vs 6.1.0-65.eln124 and mmapmany performance > has degraded by another 30% - 50%. The trend worries me - 6.1 saw > degradation compared to the 6.0 version, and it's worsening again. > Jirka, the particular patch we discussed back in December is merged in upstream v6.3-rc1: commit 541e06b772c1aaffb3b6a245ccface36d7107af2 Author: Liam Howlett <liam.howlett> Date: Thu Jan 5 16:05:34 2023 +0000 maple_tree: remove GFP_ZERO from kmem_cache_alloc() and kmem_cache_alloc_bulk() so, it is part of the target 6.3.0-63.eln126 tested kernel. Whatever is causing the diff has to be something else among the 32k patches integrated upstream between v6.1 and v6.3 ... > http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/ > stress-ng/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.2.0- > 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10: > 13:31.164572/8b23d6ac-6ac2-5fdc-b78f-544ec059f5a5/index.html#mmapmany_section > > http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/ > stress-ng/amd-epyc4-genoa-9354p-1s.lab.eng.brq2.redhat.com/RHEL-9.2.0- > 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10: > 13:31.164572/94429302-618e-5a8e-a6ee-33a689d50464/index.html#mmapmany_section > > > Thanks a lot! > Jirka
Hi Rafael, thanks for the confirmation that the patch was merged. We will keep an eye on this particular benchmark as we test the upstream kernels. Jirka
Hi Jirka, Are you still seeing this in kernel-ark (rawhide/eln) or can this be closed? Scott
Hi Scott, yes, we still see the performance degradation with the latest tested kernel 6.4.0-59.eln127 Compared to RHEL-9.2 (5.14.0-284.11.1.el9_2), the performance drop is over 60%: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/index.html#mmapmany_section Here is the diff for per record/report for that particular test: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/Perf/mmapmany.html In 6.4 kernel, these calls appear at the top and seem to be the main source of the performance degradation: 4.93% stress-ng-mmapm [kernel.kallsyms] [k] build_detached_freelist 4.85% stress-ng-mmapm [kernel.kallsyms] [k] __slab_free 4.48% stress-ng-mmapm [kernel.kallsyms] [k] ___slab_alloc 3.56% stress-ng-mmapm [kernel.kallsyms] [k] __kmem_cache_alloc_bulk 3.13% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 3.03% stress-ng-mmapm [kernel.kallsyms] [k] mtree_range_walk Could you look into this? It can be easily replicated - see the description for the details. To install the latest ELN kernel, use this repo repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo and compare the results against RHEL-9.2.0 vanilla numbers. The test runs for ~23 seconds, and the problem can be reproduced in VM - no need for bare-metal. Thanks a lot Jirka
I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install kernel-6.4.0-59.eln127 Here are the results from 1minutetip container - it took me ~15 minutes in total to reproduce the problem. 1minutetip -n 1MT-RHEL-9.3.0-20230521.45 Command executed: repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git cd scheduler-benchmarks/Stress_ng-test/ ./manual_test.sh --perf-record mmapmany reboot cd scheduler-benchmarks/Stress_ng-test/ ./manual_test.sh --perf-record mmapmany Results: [root@ci-vm-10-0-136-39 Stress_ng-test]# tail -v -n+1 *summary ==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record.summary <== 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors mmapmany 111361.403087 ==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record.summary <== 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors mmapmany 49583.785148 => performance drop by 65% [root@ci-vm-10-0-136-39 Stress_ng-test]# head -v -n 20 *perfrecord.perf.txt ==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <== # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 91K of event 'cpu-clock:pppH' # Event count (approx.): 22949750000 # # Overhead Command Shared Object Symbol # ........ ............... .................... ........................................... # 8.16% stress-ng-mmapm [kernel.kallsyms] [k] do_user_addr_fault 6.05% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 6.02% stress-ng-mmapm [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.81% stress-ng-mmapm [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 3.62% stress-ng-mmapm [kernel.kallsyms] [k] find_vma 3.38% stress-ng-mmapm [kernel.kallsyms] [k] __rcu_read_unlock 3.07% stress-ng-mmapm [kernel.kallsyms] [k] __vma_adjust 3.01% stress-ng-mmapm [kernel.kallsyms] [k] __rcu_read_lock 2.75% stress-ng-mmapm [kernel.kallsyms] [k] unmapped_area_topdown ==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <== # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 87K of event 'cpu-clock:pppH' # Event count (approx.): 21787500000 # # Overhead Command Shared Object Symbol # ........ ............... .................... ........................................... # 7.43% stress-ng-mmapm [kernel.kallsyms] [k] ___slab_alloc 4.84% stress-ng-mmapm [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 4.05% stress-ng-mmapm [kernel.kallsyms] [k] __kmem_cache_alloc_bulk 3.92% stress-ng-mmapm [kernel.kallsyms] [k] do_user_addr_fault 3.90% stress-ng-mmapm [kernel.kallsyms] [k] __slab_free 3.60% stress-ng-mmapm [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.46% stress-ng-mmapm [kernel.kallsyms] [k] build_detached_freelist 3.16% stress-ng-mmapm [kernel.kallsyms] [k] mtree_range_walk 3.10% stress-ng-mmapm stress-ng [.] stress_mmapmany_child => slowdown is caused by these system calls which appear on top in 6.4 kernel: 7.43% stress-ng-mmapm [kernel.kallsyms] [k] ___slab_alloc 4.84% stress-ng-mmapm [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 4.05% stress-ng-mmapm [kernel.kallsyms] [k] __kmem_cache_alloc_bulk 3.92% stress-ng-mmapm [kernel.kallsyms] [k] do_user_addr_fault 3.90% stress-ng-mmapm [kernel.kallsyms] [k] __slab_free 3.60% stress-ng-mmapm [kernel.kallsyms] [k] anon_vma_interval_tree_insert 3.46% stress-ng-mmapm [kernel.kallsyms] [k] build_detached_freelist 3.16% stress-ng-mmapm [kernel.kallsyms] [k] mtree_range_walk Thanks Jirka
Hi Rafael, Would you have a little time to spare to look into this? Thank you! Scott
(In reply to Jiri Hladky from comment #20) > I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install > kernel-6.4.0-59.eln127 > > Here are the results from 1minutetip container - it took me ~15 minutes in > total to reproduce the problem. > > 1minutetip -n 1MT-RHEL-9.3.0-20230521.45 > > Command executed: > repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul- > 03_09h54m52s_kernel-6.4.0-59.eln127.repo > dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" > --nogpgcheck update kernel > git -c http.sslVerify=false clone > https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks. > git > cd scheduler-benchmarks/Stress_ng-test/ > ./manual_test.sh --perf-record mmapmany > reboot > cd scheduler-benchmarks/Stress_ng-test/ > ./manual_test.sh --perf-record mmapmany After instrumentating the maple tree code with the latest upstream kernel on how it uses slab, I got the following maple_node allocation and freeing stats after running # ./manual_test.sh --perf-record mmapmany kmem_cache_alloc() - 12,384,365 kmem_cache_alloc_bulk() size 1: 1 size 2: 0 size 3: 634 size 4: 21 size 5-9: 60,693 size 10-19: 6,343,240 size 20-29: 0 size >= 30: 1 kmem_cache_free() - 29,195,121 kmem_cache_free_bulk() size 1: 511 size 2: 1,098 size 3: 1,494 size 4: 7,734 size 5-9: 843,483 size 10-19: 5,550,772 size 20-29: 0 size >= 30: 0 It can be seen that there are quite a lot of slab allocation and free, especially the bulk one between 10-19 in size. My test results were # 4.67% stress-ng-mmapm [kernel.vmlinux] [k] ___slab_alloc 4.45% stress-ng-mmapm [kernel.vmlinux] [k] build_detached_freelist 3.16% stress-ng-mmapm [kernel.vmlinux] [k] __kmem_cache_alloc_bulk 3.16% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 2.83% stress-ng-mmapm [kernel.vmlinux] [k] __slab_free 2.65% stress-ng-mmapm [kernel.vmlinux] [k] anon_vma_interval_tree_insert 2.55% stress-ng-mmapm [kernel.vmlinux] [k] perf_iterate_ctx 2.10% stress-ng-mmapm [kernel.vmlinux] [k] kmem_cache_free_bulk.part.0 mmapmany: 59441.085073 bogo-ops-per-second The maple_node slab is 8k in size and have 32 objects per slab. I tried the simple experiment of doubling the slab size to 16k and got the following results: # 3.38% stress-ng-mmapm [kernel.vmlinux] [k] build_detached_freelist 3.33% stress-ng-mmapm stress-ng [.] stress_mmapmany_child 3.03% stress-ng-mmapm [kernel.vmlinux] [k] __kmem_cache_alloc_bulk 2.78% stress-ng-mmapm [kernel.vmlinux] [k] ___slab_alloc 2.58% stress-ng-mmapm [kernel.vmlinux] [k] perf_iterate_ctx 2.56% stress-ng-mmapm [kernel.vmlinux] [k] get_mem_cgroup_from_mm 2.53% stress-ng-mmapm [kernel.vmlinux] [k] anon_vma_interval_tree_insert 2.46% stress-ng-mmapm [kernel.vmlinux] [k] __slab_free mmapmany: 65249.766589 bogo-ops-per-second There was an almost 10% performance improvement by doubling the slab size. For further performance improvement, we may need to do more optimization in the interaction between maple tree and slub.