2149636 – Performance drop visible on mmapmany test starting from kernel-6.1.0-0.rc1.15.eln122

Bug 2149636 - Performance drop visible on mmapmany test starting from kernel-6.1.0-0.rc1.15.eln122

Summary: Performance drop visible on mmapmany test starting from kernel-6.1.0-0.rc1.15...

Keywords:
Status:	NEW
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2239808 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-11-30 13:43 UTC by Kamil Kolakowski
Modified:	2023-09-30 04:31 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kamil Kolakowski 2022-11-30 13:43:59 UTC

Description of problem:
We see significant performance drop on mmapmany test (subtest of stress-ng benchmark). We see this performance drop as well on the same subtest in Libmicro benchmark.

Drop we see is up to 80%. We see this drop on many different systems AMDs, INTELs.

amd-epyc2-rome-7702-2s.lab.eng.brq2.redhat.com: 
kernel-6.0.0-54.eln121.x86_64 1 thread result 184587 bogo_ops
kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 67901 bogo_ops

amd-epyc3-milan-7313-2s.tpb.lab.eng.brq.redhat.com
kernel-6.0.0-54.eln121.x86_64 1 thread result 340082 bogo_ops
kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 94601 bogo_ops

in perf records we checked outputs and we see differences in system calls, which possibly explains the problem.

==> 6.0.0-54.eln121.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <==

# Overhead  Command    Shared Object         Symbol                                      
# ........  .........  ....................  ...........................................
#
   22.01%  stress-ng  libc.so.6             [.] __munmap
   16.43%  stress-ng  [kernel.kallsyms]     [k] syscall_enter_from_user_mode
    5.78%  stress-ng  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
    5.63%  stress-ng  libc.so.6             [.] __mmap
    4.49%  stress-ng  [kernel.kallsyms]     [k] find_vma
    3.62%  stress-ng  [kernel.kallsyms]     [k] __vma_adjust
    2.87%  stress-ng  [kernel.kallsyms]     [k] unmapped_area_topdown
    2.51%  stress-ng  [kernel.kallsyms]     [k] kmem_cache_free
    2.32%  stress-ng  [kernel.kallsyms]     [k] vm_area_dup

==> 6.1.0-0.rc6.46.eln123.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <==

#
# Overhead  Command    Shared Object         Symbol                                      
# ........  .........  ....................  ...........................................
#
    9.80%  stress-ng  libc.so.6             [.] __munmap
    7.33%  stress-ng  [kernel.kallsyms]     [k] syscall_enter_from_user_mode
    6.70%  stress-ng  [kernel.kallsyms]     [k] memset_erms
    4.56%  stress-ng  [kernel.kallsyms]     [k] kmem_cache_alloc_bulk
    4.34%  stress-ng  [kernel.kallsyms]     [k] build_detached_freelist
    3.60%  stress-ng  [kernel.kallsyms]     [k] mas_walk
    3.53%  stress-ng  [kernel.kallsyms]     [k] ___slab_alloc
    3.34%  stress-ng  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
    3.34%  stress-ng  [kernel.kallsyms]     [k] mas_wr_walk


Version-Release number of selected component (if applicable):
kernel-6.1.0-0.rc1.15.eln122

How reproducible:
1. Reserve any system
2. Install kernels you need to compare in my example
repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-07_10h36m54s_kernel-6.0.0-54.eln121.repo
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel
3. do rhts-reboot to boot this kernel
4. install redhat certificate 
curl --insecure --output /etc/pki/ca-trust/source/anchors/RH-IT-Root-CA.crt  https://password.corp.redhat.com/RH-IT-Root-CA.crt
update-ca-trust extract
5.clone perf team tools with
git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git
6. go to 
cd scheduler-benchmarks/Stress_ng-test/
7. run ./manual_test.sh --perf-record mmapmany
8. install affected kernel with
repo=$ repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-21_10h19m36s_kernel-6.1.0-0.rc1.15.eln122.repo
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel
9. do rhts-reboot to boot this kernel
10. run ./manual_test.sh --perf-record mmapmany
11. compare results (bogo-ops result)

Comment 1 Jiri Hladky 2022-11-30 23:19:58 UTC

Changing the product to Fedora/rawhide with the keyword 'Regression'.

Comment 3 Jiri Hladky 2022-12-02 13:33:48 UTC

Hi Rafael,

thanks a lot for sharing this! 

Here is the comparison of perf record results for mmapmany between kernels 6.1 (on the right) and 6.0 (on the left). 

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/Perf/mmapmany.html

These calls stand out on 6.1:

========================================
    17.56%  stress-ng  [kernel.kallsyms]  [k] memset
     7.00%  stress-ng  [kernel.kallsyms]  [k] build_detached_freelist
     4.78%  stress-ng  [kernel.kallsyms]  [k] kmem_cache_alloc_bulk
========================================

The call graph is:
--24.57%--mas_preallocate
          |
           --24.57%--mas_alloc_nodes
                     |
                     |--22.07%--kmem_cache_alloc_bulk
                     |          |
                     |          |--15.70%--memset


I will upload the complete call graph for kernels 6.0 and 6.1 for you to check. 

I should also mention that we see big gains for other system calls. Check the overview here. 
 
http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/index.html#overview

Performance drop for mmapmany is the exception. 

So I think the only action needed is to double-check if we can mitigate the slowdown for mmapmany. If not, we can probably close this BZ as (a) there is a trade-off: page faults get faster, calls to mmap get slower (b) we don't see any slowdown in the real app, just in the synthetic benchmarks. 

Thanks a lot
Jirka

Comment 12 Rafael Aquini 2022-12-23 05:29:53 UTC

(In reply to Jiri Hladky from comment #11)
> Created attachment 1934125 [details]
> Proposal for the patch to improve the mmap performance

Jirka,

Here's an ELN scratch build with the attached patch:

https://koji.fedoraproject.org/koji/taskinfo?taskID=95620541

-- Rafael

Comment 13 Jiri Hladky 2022-12-23 10:39:22 UTC

Rafael,

thanks a lot! I have created a permanent repo here:
http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Dec-23_10h58m11s_kernel-6.1.0-0.rc6.46.test.eln124.repo

and I have scheduled Beaker jobs. 

I will post the results here. 

Thanks
Jirka

Comment 14 Jiri Hladky 2023-01-03 10:34:53 UTC

Hi Rafael,

happy new year! 

The patch helps, the mmap performance has improved by 20%. 

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7542-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.1.0/2022-11-23T08:31:54.400000vs2022-12-23T10:49:01.300000/6c9b653d-5ea6-54c1-a549-8ddcb0e8d3f2/index.html#overview

I will inform the upstream. Let's try to get the patch to the mainline kernel. 

Jirka

Comment 15 Jiri Hladky 2023-05-22 11:02:47 UTC

Hi Rafael,

are there any news on the patch? 

We have compared 6.3.0-63.eln126 vs 6.1.0-65.eln124 and mmapmany performance has degraded by another 30% - 50%. The trend worries me - 6.1 saw degradation compared to the 6.0 version, and it's worsening again. 

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.2.0-20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:13:31.164572/8b23d6ac-6ac2-5fdc-b78f-544ec059f5a5/index.html#mmapmany_section

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc4-genoa-9354p-1s.lab.eng.brq2.redhat.com/RHEL-9.2.0-20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:13:31.164572/94429302-618e-5a8e-a6ee-33a689d50464/index.html#mmapmany_section


Thanks a lot!
Jirka

Comment 16 Rafael Aquini 2023-06-08 19:37:32 UTC

(In reply to Jiri Hladky from comment #15)
> Hi Rafael,
> 
> are there any news on the patch? 
> 
> We have compared 6.3.0-63.eln126 vs 6.1.0-65.eln124 and mmapmany performance
> has degraded by another 30% - 50%. The trend worries me - 6.1 saw
> degradation compared to the 6.0 version, and it's worsening again. 
> 
Jirka, the particular patch we discussed back in December is merged in
upstream v6.3-rc1:

commit 541e06b772c1aaffb3b6a245ccface36d7107af2
Author: Liam Howlett <liam.howlett>
Date:   Thu Jan 5 16:05:34 2023 +0000

    maple_tree: remove GFP_ZERO from kmem_cache_alloc() and kmem_cache_alloc_bulk()


so, it is part of the target 6.3.0-63.eln126 tested kernel.

Whatever is causing the diff has to be something else among the 32k patches 
integrated upstream between v6.1 and v6.3 ...


> http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/
> stress-ng/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.2.0-
> 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:
> 13:31.164572/8b23d6ac-6ac2-5fdc-b78f-544ec059f5a5/index.html#mmapmany_section
> 
> http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/
> stress-ng/amd-epyc4-genoa-9354p-1s.lab.eng.brq2.redhat.com/RHEL-9.2.0-
> 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:
> 13:31.164572/94429302-618e-5a8e-a6ee-33a689d50464/index.html#mmapmany_section
>
> 
> Thanks a lot!
> Jirka

Comment 17 Jiri Hladky 2023-06-08 22:45:44 UTC

Hi Rafael,

thanks for the confirmation that the patch was merged. 

We will keep an eye on this particular benchmark as we test the upstream kernels. 

Jirka

Comment 18 Scott Weaver 2023-07-03 17:51:12 UTC

Hi Jirka,

Are you still seeing this in kernel-ark (rawhide/eln) or can this be closed?

Scott

Comment 19 Jiri Hladky 2023-07-03 18:28:32 UTC

Hi Scott,

yes, we still see the performance degradation with the latest tested kernel 6.4.0-59.eln127

Compared to RHEL-9.2 (5.14.0-284.11.1.el9_2), the performance drop is over 60%:

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/index.html#mmapmany_section

Here is the diff for per record/report for that particular test:
http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/Perf/mmapmany.html

In 6.4 kernel, these calls appear at the top and seem to be the main source of the performance degradation:

4.93%  stress-ng-mmapm  [kernel.kallsyms]  [k] build_detached_freelist
4.85%  stress-ng-mmapm  [kernel.kallsyms]  [k] __slab_free
4.48%  stress-ng-mmapm  [kernel.kallsyms]  [k] ___slab_alloc
3.56%  stress-ng-mmapm  [kernel.kallsyms]  [k] __kmem_cache_alloc_bulk
3.13%  stress-ng-mmapm  stress-ng          [.] stress_mmapmany_child
3.03%  stress-ng-mmapm  [kernel.kallsyms]  [k] mtree_range_walk

Could you look into this? 

It can be easily replicated - see the description for the details. To install the latest ELN kernel, use this repo
repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo

and compare the results against RHEL-9.2.0 vanilla numbers. The test runs for ~23 seconds, and the problem can be reproduced in VM - no need for bare-metal. 


Thanks a lot
Jirka

Comment 20 Jiri Hladky 2023-07-03 18:59:34 UTC

I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install kernel-6.4.0-59.eln127

Here are the results from 1minutetip container - it took me ~15 minutes in total to reproduce the problem. 

1minutetip -n 1MT-RHEL-9.3.0-20230521.45

Command executed:
repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel
git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git
cd scheduler-benchmarks/Stress_ng-test/
./manual_test.sh --perf-record mmapmany
reboot
cd scheduler-benchmarks/Stress_ng-test/
./manual_test.sh --perf-record mmapmany


Results:
[root@ci-vm-10-0-136-39 Stress_ng-test]# tail -v -n+1 *summary
==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record.summary <==
5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors
mmapmany        111361.403087 

==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record.summary <==
6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors
mmapmany        49583.785148 


=> performance drop by 65%

[root@ci-vm-10-0-136-39 Stress_ng-test]# head -v -n 20 *perfrecord.perf.txt
==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <==
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 91K of event 'cpu-clock:pppH'
# Event count (approx.): 22949750000
#
# Overhead  Command          Shared Object         Symbol                                     
# ........  ...............  ....................  ...........................................
#
     8.16%  stress-ng-mmapm  [kernel.kallsyms]     [k] do_user_addr_fault
     6.05%  stress-ng-mmapm  stress-ng             [.] stress_mmapmany_child
     6.02%  stress-ng-mmapm  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
     3.81%  stress-ng-mmapm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     3.62%  stress-ng-mmapm  [kernel.kallsyms]     [k] find_vma
     3.38%  stress-ng-mmapm  [kernel.kallsyms]     [k] __rcu_read_unlock
     3.07%  stress-ng-mmapm  [kernel.kallsyms]     [k] __vma_adjust
     3.01%  stress-ng-mmapm  [kernel.kallsyms]     [k] __rcu_read_lock
     2.75%  stress-ng-mmapm  [kernel.kallsyms]     [k] unmapped_area_topdown

==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <==
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 87K of event 'cpu-clock:pppH'
# Event count (approx.): 21787500000
#
# Overhead  Command          Shared Object         Symbol                                     
# ........  ...............  ....................  ...........................................
#
     7.43%  stress-ng-mmapm  [kernel.kallsyms]     [k] ___slab_alloc
     4.84%  stress-ng-mmapm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     4.05%  stress-ng-mmapm  [kernel.kallsyms]     [k] __kmem_cache_alloc_bulk
     3.92%  stress-ng-mmapm  [kernel.kallsyms]     [k] do_user_addr_fault
     3.90%  stress-ng-mmapm  [kernel.kallsyms]     [k] __slab_free
     3.60%  stress-ng-mmapm  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
     3.46%  stress-ng-mmapm  [kernel.kallsyms]     [k] build_detached_freelist
     3.16%  stress-ng-mmapm  [kernel.kallsyms]     [k] mtree_range_walk
     3.10%  stress-ng-mmapm  stress-ng             [.] stress_mmapmany_child


=> slowdown is caused by these system calls which appear on top in 6.4 kernel:

     7.43%  stress-ng-mmapm  [kernel.kallsyms]     [k] ___slab_alloc
     4.84%  stress-ng-mmapm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     4.05%  stress-ng-mmapm  [kernel.kallsyms]     [k] __kmem_cache_alloc_bulk
     3.92%  stress-ng-mmapm  [kernel.kallsyms]     [k] do_user_addr_fault
     3.90%  stress-ng-mmapm  [kernel.kallsyms]     [k] __slab_free
     3.60%  stress-ng-mmapm  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
     3.46%  stress-ng-mmapm  [kernel.kallsyms]     [k] build_detached_freelist
     3.16%  stress-ng-mmapm  [kernel.kallsyms]     [k] mtree_range_walk


Thanks
Jirka

Comment 21 Scott Weaver 2023-07-07 17:15:36 UTC

Hi Rafael,

Would you have a little time to spare to look into this?
Thank you!
Scott

Comment 22 Waiman Long 2023-07-09 00:28:55 UTC

(In reply to Jiri Hladky from comment #20)
> I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install
> kernel-6.4.0-59.eln127
> 
> Here are the results from 1minutetip container - it took me ~15 minutes in
> total to reproduce the problem. 
> 
> 1minutetip -n 1MT-RHEL-9.3.0-20230521.45
> 
> Command executed:
> repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-
> 03_09h54m52s_kernel-6.4.0-59.eln127.repo
> dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp"
> --nogpgcheck update kernel
> git -c http.sslVerify=false clone
> https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.
> git
> cd scheduler-benchmarks/Stress_ng-test/
> ./manual_test.sh --perf-record mmapmany
> reboot
> cd scheduler-benchmarks/Stress_ng-test/
> ./manual_test.sh --perf-record mmapmany

After instrumentating the maple tree code with the latest upstream
kernel on how it uses slab, I got the following maple_node allocation
and freeing stats after running

# ./manual_test.sh --perf-record mmapmany

kmem_cache_alloc() - 12,384,365
kmem_cache_alloc_bulk()
 size 1: 1
 size 2: 0
 size 3: 634
 size 4: 21
 size 5-9: 60,693
 size 10-19: 6,343,240
 size 20-29: 0
 size >= 30: 1
 
kmem_cache_free() - 29,195,121
kmem_cache_free_bulk()
 size 1: 511
 size 2: 1,098
 size 3: 1,494
 size 4: 7,734
 size 5-9: 843,483
 size 10-19: 5,550,772
 size 20-29: 0
 size >= 30: 0
 
It can be seen that there are quite a lot of slab allocation and free,
especially the bulk one between 10-19 in size.

My test results were

#
     4.67%  stress-ng-mmapm  [kernel.vmlinux]  [k] ___slab_alloc
     4.45%  stress-ng-mmapm  [kernel.vmlinux]  [k] build_detached_freelist
     3.16%  stress-ng-mmapm  [kernel.vmlinux]  [k] __kmem_cache_alloc_bulk
     3.16%  stress-ng-mmapm  stress-ng         [.] stress_mmapmany_child
     2.83%  stress-ng-mmapm  [kernel.vmlinux]  [k] __slab_free
     2.65%  stress-ng-mmapm  [kernel.vmlinux]  [k] anon_vma_interval_tree_insert
     2.55%  stress-ng-mmapm  [kernel.vmlinux]  [k] perf_iterate_ctx
     2.10%  stress-ng-mmapm  [kernel.vmlinux]  [k] kmem_cache_free_bulk.part.0
mmapmany: 59441.085073 bogo-ops-per-second

The maple_node slab is 8k in size and have 32 objects per slab. I tried the simple experiment of doubling the slab size to 16k and got the following results:

#
     3.38%  stress-ng-mmapm  [kernel.vmlinux]  [k] build_detached_freelist
     3.33%  stress-ng-mmapm  stress-ng         [.] stress_mmapmany_child
     3.03%  stress-ng-mmapm  [kernel.vmlinux]  [k] __kmem_cache_alloc_bulk
     2.78%  stress-ng-mmapm  [kernel.vmlinux]  [k] ___slab_alloc
     2.58%  stress-ng-mmapm  [kernel.vmlinux]  [k] perf_iterate_ctx
     2.56%  stress-ng-mmapm  [kernel.vmlinux]  [k] get_mem_cgroup_from_mm
     2.53%  stress-ng-mmapm  [kernel.vmlinux]  [k] anon_vma_interval_tree_insert
     2.46%  stress-ng-mmapm  [kernel.vmlinux]  [k] __slab_free
mmapmany: 65249.766589 bogo-ops-per-second

There was an almost 10% performance improvement by doubling the slab
size. For further performance improvement, we may need to do more
optimization in the interaction between maple tree and slub.

Comment 29 Jiri Hladky 2023-09-12 13:03:56 UTC

Update from Liam R. Howlett <Liam.Howlett>
===========================================================================
An email on the maple tree mailing list about a performance gain on mmapmany:

http://lists.infradead.org/pipermail/maple-tree/2023-September/002799.html

************************************************************************************************
kernel test robot noticed a 21.4% improvement of stress-ng.mmapmany.ops_per_sec on:


commit: 17983dc617837a588a52848ab4034d8efa6c1fa6 ("maple_tree: refine mas_preallocate() node calculations")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
************************************************************************************************


There is also future work with the allocator that's being developed in
an attempt to further speed things up:
https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/
===========================================================================

The upstream results align with our testing - see comment #28. The great news is that the community is actively working on this, and we expect additional improvements soon. 

Thanks
Jirka

Comment 31 Kamil Kolakowski 2023-09-29 11:51:24 UTC

Hi,

Today I have fresh results from kernel-6.6.0-0.rc2.20.eln130 and regression on mmapmany is still there.

Libmicro example result:

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Libmicro-test/amd-epyc3-milan-7713-2s.tpb.lab.eng.brq.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.3.0-20230718.0/2023-07-20T14:03:11.500000vs2023-09-19T14:11:51.596947/bd6232d0-ab0a-5273-b2a2-5c14de9b100b/index.html

Stress-ng example result:
http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s-02.lab.eng.brq2.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.3.0-20230718.0/2023-07-20T14:03:11.500000vs2023-09-19T14:11:51.596947/4daee582-14fd-59a5-9b2f-02ec5d38625e/index.html

Thanks

Kamil

Comment 32 Jiri Hladky 2023-09-29 13:23:21 UTC

*** Bug 2239808 has been marked as a duplicate of this bug. ***

Comment 33 Jiri Hladky 2023-09-29 13:27:55 UTC

Libmicro shows regression up to factor 2x for mmap, unmamp and mprot tests. 

I'm going to attach a minimal reproducer in .c to show performance degradation for the unmap.

gcc -Wall -Wextra -O1 mmap_munmap.c -o mmap_munmap
./run_mmap_munmap.sh

Results show that the performance drop started in kernel 6.1. In 6.6, there was an improvement, but results are still 2x slower than in the 6.0 kernel:

$ tail -v -n+1 *mmap_munmap.log
==> 5.14.0-339.el9.x86_64_mmap_munmap.log <==
TSC for 1048576 munmap calls with len of 8kiB: 2851688 K-cycles.  Avg: 2.71958 K-cycles/call

==> 6.0.0-54.eln121.x86_64_mmap_munmap.log <==
TSC for 1048576 munmap calls with len of 8kiB: 2855807 K-cycles.  Avg: 2.72351 K-cycles/call

==> 6.1.0-0.rc1.15.eln122.x86_64_mmap_munmap.log <==
TSC for 1048576 munmap calls with len of 8kiB: 5366007 K-cycles.  Avg: 5.11742 K-cycles/call

==> 6.5.0-57.eln130.x86_64_mmap_munmap.log <==
TSC for 1048576 munmap calls with len of 8kiB: 5757203 K-cycles.  Avg: 5.4905 K-cycles/call

==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_munmap.log <==
TSC for 1048576 munmap calls with len of 8kiB: 5044454 K-cycles.  Avg: 4.81077 K-cycles/call

Here is the perf record/report result for kernel 6.6:
======================================================================================
$ head -30 6.6.0-0.rc2.20.eln130.x86_64_mmap_munmap.perf
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 21K of event 'cycles'
# Event count (approx.): 18470612451
#
# Overhead  Command      Shared Object         Symbol                                     
# ........  ...........  ....................  ...........................................
#
     7.39%  mmap_munmap  mmap_munmap           [.] main
     3.89%  mmap_munmap  [kernel.kallsyms]     [k] sync_regs
     3.60%  mmap_munmap  [kernel.kallsyms]     [k] perf_iterate_ctx
     3.41%  mmap_munmap  [kernel.kallsyms]     [k] __folio_throttle_swaprate
     3.22%  mmap_munmap  [kernel.kallsyms]     [k] native_irq_return_iret
     3.17%  mmap_munmap  [kernel.kallsyms]     [k] native_flush_tlb_one_user
     2.75%  mmap_munmap  [kernel.kallsyms]     [k] get_mem_cgroup_from_mm
     2.44%  mmap_munmap  [kernel.kallsyms]     [k] __slab_free
     2.06%  mmap_munmap  [kernel.kallsyms]     [k] mas_wr_node_store
     1.99%  mmap_munmap  [kernel.kallsyms]     [k] charge_memcg
     1.82%  mmap_munmap  [kernel.kallsyms]     [k] try_charge_memcg
     1.57%  mmap_munmap  [kernel.kallsyms]     [k] kmem_cache_alloc
     1.39%  mmap_munmap  [kernel.kallsyms]     [k] kmem_cache_free
     1.21%  mmap_munmap  [kernel.kallsyms]     [k] __rcu_read_lock
     1.11%  mmap_munmap  [kernel.kallsyms]     [k] mtree_range_walk
     1.06%  mmap_munmap  [kernel.kallsyms]     [k] __handle_mm_fault
     1.04%  mmap_munmap  [kernel.kallsyms]     [k] __rcu_read_unlock
     1.04%  mmap_munmap  [kernel.kallsyms]     [k] up_write
     1.01%  mmap_munmap  [kernel.kallsyms]     [k] mas_rev_awalk
======================================================================================
Please note maple trees related functions (mtree_*, mas_*) - see also https://docs.kernel.org/core-api/maple_tree.html#:~:text=The%20Maple%20Tree%20is%20a,a%20user%20written%20search%20method.

Comment 34 Jiri Hladky 2023-09-29 13:30:50 UTC

Created attachment 1991090 [details]
Simple reproducer for munmap written in .c

This is a minimal reproducer in .c to show performance degradation for the unmap. 

gcc -Wall -Wextra -O1 mmap_munmap.c -o mmap_munmap
./run_mmap_munmap.sh

It includes results from Intel Icelake Platinum 8351N CPU server with RHEL-9.3 Beta kernel (5.14.0-339.el9) and upstream kernels 6.0, 6.1 rc1, 6.5, and 6.6 rc2.

Comment 35 Jiri Hladky 2023-09-30 04:18:12 UTC

On the Maple Tree mailing list, a nonsynthetic performance report was posted:
http://lists.infradead.org/pipermail/maple-tree/2023-September/002908.html

Test "wordcount" from Phonix-2.0 https://github.com/kozyraki/phoenix/blob/master/phoenix-2.0/tests/word_count/word_count.c shows a significant performance drop with Maple Trees. Phoenix is a shared-memory implementation of Google's MapReduce model for data-intensive processing tasks.

Reproducer:
===================================================================================
git clone https://github.com/kozyraki/phoenix.git
cd phoenix/phoenix-2.0
make
wget http://csl.stanford.edu/~christos/data/word_count.tar.gz
tar xvf word_count.tar.gz
time phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
===================================================================================

Test "wordcount" shows the most significant performance degradation with Maple Tree compared to RB when run with many threads. Below are the results from Intel Icelake server with 36-core Platinum 8351N CPU :

72 threads:
==> 6.0.0-54.eln121.x86_64_allthreads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     797.4 ms ±  47.0 ms    [User: 4868.8 ms, System: 10793.5 ms]
 Range (min … max):   707.0 ms … 849.8 ms    10 runs
 

==> 6.5.0-57.eln130.x86_64_allthreads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
 Time (mean ± σ):      1.673 s ±  0.265 s    [User: 9.574 s, System: 54.777 s]
 Range (min … max):    1.282 s …  2.240 s    10 runs
 

==> 6.6.0-0.rc2.20.eln130.x86_64_allthreads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
 Time (mean ± σ):      1.223 s ±  0.038 s    [User: 13.107 s, System: 20.346 s]
 Range (min … max):    1.153 s …  1.273 s    10 runs

System time is 5x longer with 6.5 kernel compared to 6.0. In 6.6, there is a significant improvement, but performance is still 2x slower compared to 6.0 kernel. 

When I repeat the test on the same server but just with four threads, I'm getting much better results:

==> 6.0.0-54.eln121.x86_64_4threads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     689.4 ms ±   3.3 ms    [User: 2350.9 ms, System: 191.9 ms]
 Range (min … max):   684.5 ms … 693.8 ms    10 runs
 

==> 6.5.0-57.eln130.x86_64_4threads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     699.3 ms ±   2.4 ms    [User: 2371.8 ms, System: 205.8 ms]
 Range (min … max):   695.4 ms … 702.6 ms    10 runs
 

==> 6.6.0-0.rc2.20.eln130.x86_64_4threads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     696.5 ms ±   3.0 ms    [User: 2370.1 ms, System: 194.3 ms]
 Range (min … max):   691.2 ms … 700.2 ms    10 runs

In this scenario, there is barely any difference between kernels 6.0, 6.5, and 6.6.

Comment 36 Jiri Hladky 2023-09-30 04:29:43 UTC

Test "wordcount" from Phonix-2.0 is heavily using mprotect system call. It's worth noting that mprotect syscall with Maple Trees is up to 1.5x slower than with RB trees as shown with Libmicro - look for "mprot" in this Libmicro result:

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Libmicro-test/intel-icelake-platinum-8351n-1s.lab.eng.brq2.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.3.0-20230718.0/2023-07-20T14:03:11.500000vs2023-09-13T08:52:32.238798/13242074-ed62-50b0-b211-15b64f64c1c1/index.html

I have created a simple .c reproducer.  It times the mprotect syscall for a large mmaped region where we alternate NO_PROT and PROT_READ | PROT_WRITE for 128kiB large areas. You can run it like this:
gcc -Wall -Wextra -O1 mmap_mprotect.c -o mmap_mprotect
./mmap_mprotect 128 32768

It will allocate 128kiB * 32768 = 4 GiB of memory 

Here is the central part where we measure the performance: 

     // Time the mprotect calls -alternate protection between PROT_NONE and PROT_READ | PROT_WRITE
     //
     u_int64_t start_rdtsc = start_clock();

     for (i = 0; i < iterations; i++) {
       if (i % 2 == 0 ) {
         prot = PROT_NONE;
       } else {
         prot = PROT_READ | PROT_WRITE;
       }
       ret =mprotect((void *)ts_map + i*mmap_len, mmap_len, prot);
       if (ret != 0) {
         perror("mprotect");
         printf(" mmap error at iteration %d from %ld\n", i, iterations);
       }
     }

     u_int64_t stop_rdtsc = stop_clock();


Results:
==> 6.0.0-54.eln121.x86_64_mmap_mprotect.log <==
TSC for 32768 mprotect calls with len of 128kiB: 202866 K-cycles.  Avg: 6.191 K-cycles/call

==> 6.5.0-57.eln130.x86_64_mmap_mprotect.log <==
TSC for 32768 mprotect calls with len of 128kiB: 302327 K-cycles.  Avg: 9.22631 K-cycles/call

==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_mprotect.log <==
TSC for 32768 mprotect calls with len of 128kiB: 269886 K-cycles.  Avg: 8.2363 K-cycles/call


Compared to 6.0 kernel, 6.5 shows perf. slowdown by a factor of 1.48x. 6.6 performs better, and mprotect is slower by 1.32x compared to the 6.0 kernel.

Note You need to log in before you can comment on or make changes to this bug.