Bug 2149636 - Performance drop visible on mmapmany test starting from kernel-6.1.0-0.rc1.15.eln122
Summary: Performance drop visible on mmapmany test starting from kernel-6.1.0-0.rc1.15...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-30 13:43 UTC by Kamil Kolakowski
Modified: 2023-07-26 12:46 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Kamil Kolakowski 2022-11-30 13:43:59 UTC
Description of problem:
We see significant performance drop on mmapmany test (subtest of stress-ng benchmark). We see this performance drop as well on the same subtest in Libmicro benchmark.

Drop we see is up to 80%. We see this drop on many different systems AMDs, INTELs.

amd-epyc2-rome-7702-2s.lab.eng.brq2.redhat.com: 
kernel-6.0.0-54.eln121.x86_64 1 thread result 184587 bogo_ops
kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 67901 bogo_ops

amd-epyc3-milan-7313-2s.tpb.lab.eng.brq.redhat.com
kernel-6.0.0-54.eln121.x86_64 1 thread result 340082 bogo_ops
kernel-6.1.0-0.rc1.15.eln122.x86_64 1 thread result 94601 bogo_ops

in perf records we checked outputs and we see differences in system calls, which possibly explains the problem.

==> 6.0.0-54.eln121.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <==

# Overhead  Command    Shared Object         Symbol                                      
# ........  .........  ....................  ...........................................
#
   22.01%  stress-ng  libc.so.6             [.] __munmap
   16.43%  stress-ng  [kernel.kallsyms]     [k] syscall_enter_from_user_mode
    5.78%  stress-ng  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
    5.63%  stress-ng  libc.so.6             [.] __mmap
    4.49%  stress-ng  [kernel.kallsyms]     [k] find_vma
    3.62%  stress-ng  [kernel.kallsyms]     [k] __vma_adjust
    2.87%  stress-ng  [kernel.kallsyms]     [k] unmapped_area_topdown
    2.51%  stress-ng  [kernel.kallsyms]     [k] kmem_cache_free
    2.32%  stress-ng  [kernel.kallsyms]     [k] vm_area_dup

==> 6.1.0-0.rc6.46.eln123.x86_64_threads_1_mmapmany_003_perfrecord.perf.txt <==

#
# Overhead  Command    Shared Object         Symbol                                      
# ........  .........  ....................  ...........................................
#
    9.80%  stress-ng  libc.so.6             [.] __munmap
    7.33%  stress-ng  [kernel.kallsyms]     [k] syscall_enter_from_user_mode
    6.70%  stress-ng  [kernel.kallsyms]     [k] memset_erms
    4.56%  stress-ng  [kernel.kallsyms]     [k] kmem_cache_alloc_bulk
    4.34%  stress-ng  [kernel.kallsyms]     [k] build_detached_freelist
    3.60%  stress-ng  [kernel.kallsyms]     [k] mas_walk
    3.53%  stress-ng  [kernel.kallsyms]     [k] ___slab_alloc
    3.34%  stress-ng  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
    3.34%  stress-ng  [kernel.kallsyms]     [k] mas_wr_walk


Version-Release number of selected component (if applicable):
kernel-6.1.0-0.rc1.15.eln122

How reproducible:
1. Reserve any system
2. Install kernels you need to compare in my example
repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-07_10h36m54s_kernel-6.0.0-54.eln121.repo
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel
3. do rhts-reboot to boot this kernel
4. install redhat certificate 
curl --insecure --output /etc/pki/ca-trust/source/anchors/RH-IT-Root-CA.crt  https://password.corp.redhat.com/RH-IT-Root-CA.crt
update-ca-trust extract
5.clone perf team tools with
git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git
6. go to 
cd scheduler-benchmarks/Stress_ng-test/
7. run ./manual_test.sh --perf-record mmapmany
8. install affected kernel with
repo=$ repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Oct-21_10h19m36s_kernel-6.1.0-0.rc1.15.eln122.repo
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck list available
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel
9. do rhts-reboot to boot this kernel
10. run ./manual_test.sh --perf-record mmapmany
11. compare results (bogo-ops result)

Comment 1 Jiri Hladky 2022-11-30 23:19:58 UTC
Changing the product to Fedora/rawhide with the keyword 'Regression'.

Comment 3 Jiri Hladky 2022-12-02 13:33:48 UTC
Hi Rafael,

thanks a lot for sharing this! 

Here is the comparison of perf record results for mmapmany between kernels 6.1 (on the right) and 6.0 (on the left). 

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/Perf/mmapmany.html

These calls stand out on 6.1:

========================================
    17.56%  stress-ng  [kernel.kallsyms]  [k] memset
     7.00%  stress-ng  [kernel.kallsyms]  [k] build_detached_freelist
     4.78%  stress-ng  [kernel.kallsyms]  [k] kmem_cache_alloc_bulk
========================================

The call graph is:
--24.57%--mas_preallocate
          |
           --24.57%--mas_alloc_nodes
                     |
                     |--22.07%--kmem_cache_alloc_bulk
                     |          |
                     |          |--15.70%--memset


I will upload the complete call graph for kernels 6.0 and 6.1 for you to check. 

I should also mention that we see big gains for other system calls. Check the overview here. 
 
http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7262-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.0.0/2022-10-07T08:57:36.582552vs2022-11-23T08:31:54.400000/c69f2e16-ca8f-5c31-8654-3fd1943aa075/index.html#overview

Performance drop for mmapmany is the exception. 

So I think the only action needed is to double-check if we can mitigate the slowdown for mmapmany. If not, we can probably close this BZ as (a) there is a trade-off: page faults get faster, calls to mmap get slower (b) we don't see any slowdown in the real app, just in the synthetic benchmarks. 

Thanks a lot
Jirka

Comment 12 Rafael Aquini 2022-12-23 05:29:53 UTC
(In reply to Jiri Hladky from comment #11)
> Created attachment 1934125 [details]
> Proposal for the patch to improve the mmap performance

Jirka,

Here's an ELN scratch build with the attached patch:

https://koji.fedoraproject.org/koji/taskinfo?taskID=95620541

-- Rafael

Comment 13 Jiri Hladky 2022-12-23 10:39:22 UTC
Rafael,

thanks a lot! I have created a permanent repo here:
http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2022-Dec-23_10h58m11s_kernel-6.1.0-0.rc6.46.test.eln124.repo

and I have scheduled Beaker jobs. 

I will post the results here. 

Thanks
Jirka

Comment 14 Jiri Hladky 2023-01-03 10:34:53 UTC
Hi Rafael,

happy new year! 

The patch helps, the mmap performance has improved by 20%. 

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/amd-epyc2-rome-7542-2s.lab.eng.brq2.redhat.com/RHEL-9.0.0vsRHEL-9.1.0/2022-11-23T08:31:54.400000vs2022-12-23T10:49:01.300000/6c9b653d-5ea6-54c1-a549-8ddcb0e8d3f2/index.html#overview

I will inform the upstream. Let's try to get the patch to the mainline kernel. 

Jirka

Comment 16 Rafael Aquini 2023-06-08 19:37:32 UTC
(In reply to Jiri Hladky from comment #15)
> Hi Rafael,
> 
> are there any news on the patch? 
> 
> We have compared 6.3.0-63.eln126 vs 6.1.0-65.eln124 and mmapmany performance
> has degraded by another 30% - 50%. The trend worries me - 6.1 saw
> degradation compared to the 6.0 version, and it's worsening again. 
> 
Jirka, the particular patch we discussed back in December is merged in
upstream v6.3-rc1:

commit 541e06b772c1aaffb3b6a245ccface36d7107af2
Author: Liam Howlett <liam.howlett>
Date:   Thu Jan 5 16:05:34 2023 +0000

    maple_tree: remove GFP_ZERO from kmem_cache_alloc() and kmem_cache_alloc_bulk()


so, it is part of the target 6.3.0-63.eln126 tested kernel.

Whatever is causing the diff has to be something else among the 32k patches 
integrated upstream between v6.1 and v6.3 ...


> http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/
> stress-ng/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.2.0-
> 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:
> 13:31.164572/8b23d6ac-6ac2-5fdc-b78f-544ec059f5a5/index.html#mmapmany_section
> 
> http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/
> stress-ng/amd-epyc4-genoa-9354p-1s.lab.eng.brq2.redhat.com/RHEL-9.2.0-
> 20230115.7vsRHEL-9.2.0-20230115.7/2023-05-18T09:27:31.000000vs2023-05-03T10:
> 13:31.164572/94429302-618e-5a8e-a6ee-33a689d50464/index.html#mmapmany_section
>
> 
> Thanks a lot!
> Jirka

Comment 17 Jiri Hladky 2023-06-08 22:45:44 UTC
Hi Rafael,

thanks for the confirmation that the patch was merged. 

We will keep an eye on this particular benchmark as we test the upstream kernels. 

Jirka

Comment 18 Scott Weaver 2023-07-03 17:51:12 UTC
Hi Jirka,

Are you still seeing this in kernel-ark (rawhide/eln) or can this be closed?

Scott

Comment 19 Jiri Hladky 2023-07-03 18:28:32 UTC
Hi Scott,

yes, we still see the performance degradation with the latest tested kernel 6.4.0-59.eln127

Compared to RHEL-9.2 (5.14.0-284.11.1.el9_2), the performance drop is over 60%:

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/index.html#mmapmany_section

Here is the diff for per record/report for that particular test:
http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/stress-ng/p1-gen2.tpb.lab.eng.brq.redhat.com/RHEL-9.2.0vsRHEL-9.3.0-20230521.45/2023-05-26T15:25:50.300000vs2023-07-03T08:18:44.655306/f08eb9ff-66fa-5d5f-bff6-844676cdce99/Perf/mmapmany.html

In 6.4 kernel, these calls appear at the top and seem to be the main source of the performance degradation:

4.93%  stress-ng-mmapm  [kernel.kallsyms]  [k] build_detached_freelist
4.85%  stress-ng-mmapm  [kernel.kallsyms]  [k] __slab_free
4.48%  stress-ng-mmapm  [kernel.kallsyms]  [k] ___slab_alloc
3.56%  stress-ng-mmapm  [kernel.kallsyms]  [k] __kmem_cache_alloc_bulk
3.13%  stress-ng-mmapm  stress-ng          [.] stress_mmapmany_child
3.03%  stress-ng-mmapm  [kernel.kallsyms]  [k] mtree_range_walk

Could you look into this? 

It can be easily replicated - see the description for the details. To install the latest ELN kernel, use this repo
repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo

and compare the results against RHEL-9.2.0 vanilla numbers. The test runs for ~23 seconds, and the problem can be reproduced in VM - no need for bare-metal. 


Thanks a lot
Jirka

Comment 20 Jiri Hladky 2023-07-03 18:59:34 UTC
I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install kernel-6.4.0-59.eln127

Here are the results from 1minutetip container - it took me ~15 minutes in total to reproduce the problem. 

1minutetip -n 1MT-RHEL-9.3.0-20230521.45

Command executed:
repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-03_09h54m52s_kernel-6.4.0-59.eln127.repo
dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp" --nogpgcheck update kernel
git -c http.sslVerify=false clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git
cd scheduler-benchmarks/Stress_ng-test/
./manual_test.sh --perf-record mmapmany
reboot
cd scheduler-benchmarks/Stress_ng-test/
./manual_test.sh --perf-record mmapmany


Results:
[root@ci-vm-10-0-136-39 Stress_ng-test]# tail -v -n+1 *summary
==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record.summary <==
5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors
mmapmany        111361.403087 

==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record.summary <==
6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record: stress-ng bogo-ops per second summary of results for different stressors
mmapmany        49583.785148 


=> performance drop by 65%

[root@ci-vm-10-0-136-39 Stress_ng-test]# head -v -n 20 *perfrecord.perf.txt
==> 5.14.0-316.el9.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <==
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 91K of event 'cpu-clock:pppH'
# Event count (approx.): 22949750000
#
# Overhead  Command          Shared Object         Symbol                                     
# ........  ...............  ....................  ...........................................
#
     8.16%  stress-ng-mmapm  [kernel.kallsyms]     [k] do_user_addr_fault
     6.05%  stress-ng-mmapm  stress-ng             [.] stress_mmapmany_child
     6.02%  stress-ng-mmapm  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
     3.81%  stress-ng-mmapm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     3.62%  stress-ng-mmapm  [kernel.kallsyms]     [k] find_vma
     3.38%  stress-ng-mmapm  [kernel.kallsyms]     [k] __rcu_read_unlock
     3.07%  stress-ng-mmapm  [kernel.kallsyms]     [k] __vma_adjust
     3.01%  stress-ng-mmapm  [kernel.kallsyms]     [k] __rcu_read_lock
     2.75%  stress-ng-mmapm  [kernel.kallsyms]     [k] unmapped_area_topdown

==> 6.4.0-59.eln127.x86_64_threads_1_iterations_1_perf_record_mmapmany_001_perfrecord.perf.txt <==
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 87K of event 'cpu-clock:pppH'
# Event count (approx.): 21787500000
#
# Overhead  Command          Shared Object         Symbol                                     
# ........  ...............  ....................  ...........................................
#
     7.43%  stress-ng-mmapm  [kernel.kallsyms]     [k] ___slab_alloc
     4.84%  stress-ng-mmapm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     4.05%  stress-ng-mmapm  [kernel.kallsyms]     [k] __kmem_cache_alloc_bulk
     3.92%  stress-ng-mmapm  [kernel.kallsyms]     [k] do_user_addr_fault
     3.90%  stress-ng-mmapm  [kernel.kallsyms]     [k] __slab_free
     3.60%  stress-ng-mmapm  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
     3.46%  stress-ng-mmapm  [kernel.kallsyms]     [k] build_detached_freelist
     3.16%  stress-ng-mmapm  [kernel.kallsyms]     [k] mtree_range_walk
     3.10%  stress-ng-mmapm  stress-ng             [.] stress_mmapmany_child


=> slowdown is caused by these system calls which appear on top in 6.4 kernel:

     7.43%  stress-ng-mmapm  [kernel.kallsyms]     [k] ___slab_alloc
     4.84%  stress-ng-mmapm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     4.05%  stress-ng-mmapm  [kernel.kallsyms]     [k] __kmem_cache_alloc_bulk
     3.92%  stress-ng-mmapm  [kernel.kallsyms]     [k] do_user_addr_fault
     3.90%  stress-ng-mmapm  [kernel.kallsyms]     [k] __slab_free
     3.60%  stress-ng-mmapm  [kernel.kallsyms]     [k] anon_vma_interval_tree_insert
     3.46%  stress-ng-mmapm  [kernel.kallsyms]     [k] build_detached_freelist
     3.16%  stress-ng-mmapm  [kernel.kallsyms]     [k] mtree_range_walk


Thanks
Jirka

Comment 21 Scott Weaver 2023-07-07 17:15:36 UTC
Hi Rafael,

Would you have a little time to spare to look into this?
Thank you!
Scott

Comment 22 Waiman Long 2023-07-09 00:28:55 UTC
(In reply to Jiri Hladky from comment #20)
> I forgot to mention that you need RHEL-9.3.0-20230521.45 or later to install
> kernel-6.4.0-59.eln127
> 
> Here are the results from 1minutetip container - it took me ~15 minutes in
> total to reproduce the problem. 
> 
> 1minutetip -n 1MT-RHEL-9.3.0-20230521.45
> 
> Command executed:
> repo=http://repos.perfqe.tpb.lab.eng.brq.redhat.com/Kernel/2023-Jul-
> 03_09h54m52s_kernel-6.4.0-59.eln127.repo
> dnf --repofrompath "tmp,${repo}" --disablerepo="*" --enablerepo="tmp"
> --nogpgcheck update kernel
> git -c http.sslVerify=false clone
> https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.
> git
> cd scheduler-benchmarks/Stress_ng-test/
> ./manual_test.sh --perf-record mmapmany
> reboot
> cd scheduler-benchmarks/Stress_ng-test/
> ./manual_test.sh --perf-record mmapmany

After instrumentating the maple tree code with the latest upstream
kernel on how it uses slab, I got the following maple_node allocation
and freeing stats after running

# ./manual_test.sh --perf-record mmapmany

kmem_cache_alloc() - 12,384,365
kmem_cache_alloc_bulk()
 size 1: 1
 size 2: 0
 size 3: 634
 size 4: 21
 size 5-9: 60,693
 size 10-19: 6,343,240
 size 20-29: 0
 size >= 30: 1
 
kmem_cache_free() - 29,195,121
kmem_cache_free_bulk()
 size 1: 511
 size 2: 1,098
 size 3: 1,494
 size 4: 7,734
 size 5-9: 843,483
 size 10-19: 5,550,772
 size 20-29: 0
 size >= 30: 0
 
It can be seen that there are quite a lot of slab allocation and free,
especially the bulk one between 10-19 in size.

My test results were

#
     4.67%  stress-ng-mmapm  [kernel.vmlinux]  [k] ___slab_alloc
     4.45%  stress-ng-mmapm  [kernel.vmlinux]  [k] build_detached_freelist
     3.16%  stress-ng-mmapm  [kernel.vmlinux]  [k] __kmem_cache_alloc_bulk
     3.16%  stress-ng-mmapm  stress-ng         [.] stress_mmapmany_child
     2.83%  stress-ng-mmapm  [kernel.vmlinux]  [k] __slab_free
     2.65%  stress-ng-mmapm  [kernel.vmlinux]  [k] anon_vma_interval_tree_insert
     2.55%  stress-ng-mmapm  [kernel.vmlinux]  [k] perf_iterate_ctx
     2.10%  stress-ng-mmapm  [kernel.vmlinux]  [k] kmem_cache_free_bulk.part.0
mmapmany: 59441.085073 bogo-ops-per-second

The maple_node slab is 8k in size and have 32 objects per slab. I tried the simple experiment of doubling the slab size to 16k and got the following results:

#
     3.38%  stress-ng-mmapm  [kernel.vmlinux]  [k] build_detached_freelist
     3.33%  stress-ng-mmapm  stress-ng         [.] stress_mmapmany_child
     3.03%  stress-ng-mmapm  [kernel.vmlinux]  [k] __kmem_cache_alloc_bulk
     2.78%  stress-ng-mmapm  [kernel.vmlinux]  [k] ___slab_alloc
     2.58%  stress-ng-mmapm  [kernel.vmlinux]  [k] perf_iterate_ctx
     2.56%  stress-ng-mmapm  [kernel.vmlinux]  [k] get_mem_cgroup_from_mm
     2.53%  stress-ng-mmapm  [kernel.vmlinux]  [k] anon_vma_interval_tree_insert
     2.46%  stress-ng-mmapm  [kernel.vmlinux]  [k] __slab_free
mmapmany: 65249.766589 bogo-ops-per-second

There was an almost 10% performance improvement by doubling the slab
size. For further performance improvement, we may need to do more
optimization in the interaction between maple tree and slub.


Note You need to log in before you can comment on or make changes to this bug.