1996854 – dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1996854 - dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()

Summary: dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Deadline:	2021-08-30
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.5
Assignee:	Mike Snitzer
QA Contact:	guazhang@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-23 20:05 UTC by Mike Snitzer
Modified:	2021-11-10 06:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:	kernel-4.18.0-339.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-09 19:27:01 UTC
Type:	Bug
Target Upstream Version:	5.15
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/kernel rhel-8 merge_requests 1218	None	None	None	2021-08-23 20:36:52 UTC
Red Hat Issue Tracker	RHELPLAN-94521	None	None	None	2021-08-23 20:06:13 UTC
Red Hat Product Errata	RHSA-2021:4356	None	None	None	2021-11-09 19:27:18 UTC

Description Mike Snitzer 2021-08-23 20:05:36 UTC

Description of problem:

This dm-crypt performance issue, especially on large NUMA systems, impacts all releases of RHEL8 because it predates RHEL8.

Here is the upstream commit that details the issue and associated fix:

commit 528b16bfc3ae5f11638e71b3b63a81f9999df727
Author: Arne Welzel <arne.welzel>
Date:   Sat Aug 14 00:40:38 2021 +0200

    dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()

    On systems with many cores using dm-crypt, heavy spinlock contention in
    percpu_counter_compare() can be observed when the page allocation limit
    for a given device is reached or close to be reached. This is due
    to percpu_counter_compare() taking a spinlock to compute an exact
    result on potentially many CPUs at the same time.

    Switch to non-exact comparison of allocated and allowed pages by using
    the value returned by percpu_counter_read_positive() to avoid taking
    the percpu_counter spinlock.

    This may over/under estimate the actual number of allocated pages by at
    most (batch-1) * num_online_cpus().

    Currently, batch is bounded by 32. The system on which this issue was
    first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
    change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
    allocations, this seems an acceptable error. Certainly preferred over
    running into the spinlock contention.

    This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
    and 192GB RAM as follows, but can be provoked on systems with less CPUs
    as well.

     * Disable swap
     * Tune vm settings to promote regular writeback
         $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
         $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
         $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

     * Create 8 dmcrypt devices based on files on a tmpfs
     * Create and mount an ext4 filesystem on each crypt devices
     * Run stress-ng --hdd 8 within one of above filesystems

    Total %system usage collected from sysstat goes to ~35%. Write throughput
    on the underlying loop device is ~2GB/s. perf profiling an individual
    kworker kcryptd thread shows the following profile, indicating spinlock
    contention in percpu_counter_compare():

        99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
          |
          --ret_from_fork
            kthread
            worker_thread
            |
             --99.92%--process_one_work
                |
                |--80.52%--kcryptd_crypt
                |    |
                |    |--62.58%--mempool_alloc
                |    |  |
                |    |   --62.24%--crypt_page_alloc
                |    |     |
                |    |      --61.51%--__percpu_counter_compare
                |    |        |
                |    |         --61.34%--__percpu_counter_sum
                |    |           |
                |    |           |--58.68%--_raw_spin_lock_irqsave
                |    |           |  |
                |    |           |   --58.30%--native_queued_spin_lock_slowpath
                |    |           |
                |    |            --0.69%--cpumask_next
                |    |                |
                |    |                 --0.51%--_find_next_bit
                |    |
                |    |--10.61%--crypt_convert
                |    |          |
                |    |          |--6.05%--xts_crypt
                ...

    After applying this patch and running the same test, %system usage is
    lowered to ~7% and write throughput on the loop device increases
    to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
    in the profile and not hitting the percpu_counter() spinlock anymore.

        |--8.15%--mempool_alloc
        |    |
        |    |--3.93%--crypt_page_alloc
        |    |    |
        |    |     --3.75%--__alloc_pages
        |    |         |
        |    |          --3.62%--get_page_from_freelist
        |    |              |
        |    |               --3.22%--rmqueue_bulk
        |    |                   |
        |    |                    --2.59%--_raw_spin_lock
        |    |                      |
        |    |                       --2.57%--native_queued_spin_lock_slowpath
        |    |
        |     --3.05%--_raw_spin_lock_irqsave
        |               |
        |                --2.49%--native_queued_spin_lock_slowpath

    Suggested-by: DJ Gregor <dj>
    Reviewed-by: Mikulas Patocka <mpatocka>
    Signed-off-by: Arne Welzel <arne.welzel>
    Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages")
    Cc: stable.org
    Signed-off-by: Mike Snitzer <snitzer>

Comment 4 guazhang@redhat.com 2021-08-24 12:35:37 UTC

Hi,

Could you provide more detail test steps to reproduce the bug?

Comment 5 Mike Snitzer 2021-08-24 16:08:19 UTC

(In reply to guazhang from comment #4)
> Hi,
> 
> Could you provide more detail test steps to reproduce the bug?

This is a scalability performance bug. The more IO you issue to a dm-crypt device, particularly on a very large SMP/NUMA machine, the more spinlock contention will hurt performance.

So to test: throw as much IO at a fast dm-crypt device as you can on a large system with a lot of cpus.  You'd do well to have the dm-crypt device be layered on fast storage that handles seeks easily (e.g. ssd, nvme, ramdisk, etc).  Then just increase the thread count on your synthetic IO load generator (e.g. fio).

The kernel with the fix should consumer less CPU time and yield better IO performance.

Comment 7 guazhang@redhat.com 2021-08-25 13:52:17 UTC

Hi,


I test the stress-ng in VM and found the test kernel %sys utilization higher than default kernel.

stress-ng --cpu 4 --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8


[root@test1 test8]# stress-ng --cpu 4 --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8
stress-ng: info:  [250380] dispatching hogs: 4 cpu, 32 io, 4 vm, 4 fork, 8 hdd
stress-ng: info:  [250380] successful run completed in 120.39s (2 mins, 0.39 secs)
[root@test1 test8]# pwd
/home/test8
[root@test1 test8]# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
loop0     7:0    0  500M  0 loop  
`-test1 253:0    0  484M  0 crypt /home/test1
loop1     7:1    0  500M  0 loop  
`-test2 253:1    0  484M  0 crypt /home/test2
loop2     7:2    0  500M  0 loop  
`-test3 253:2    0  484M  0 crypt /home/test3
loop3     7:3    0  500M  0 loop  
`-test4 253:3    0  484M  0 crypt /home/test4
loop4     7:4    0  500M  0 loop  
`-test5 253:4    0  484M  0 crypt /home/test5
loop5     7:5    0  500M  0 loop  
`-test6 253:5    0  484M  0 crypt /home/test6
loop6     7:6    0  500M  0 loop  
`-test7 253:6    0  484M  0 crypt /home/test7
loop7     7:7    0  500M  0 loop  
`-test8 253:7    0  484M  0 crypt /home/test8
sr0      11:0    1 1024M  0 rom   
vda     252:0    0   27G  0 disk  
|-vda1  252:1    0    1M  0 part  
|-vda2  252:2    0  100M  0 part  /boot/efi
`-vda3  252:3    0 26.9G  0 part  /



default kernel kernel-4.18.0-335.el8.x86_64

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   60.15    0.00   24.52    0.28    0.00    1.94    0.01    0.00    0.00   13.11
Average:       0   58.14    0.00   26.62    0.21    0.00    1.84    0.01    0.00    0.00   13.18
Average:       1   55.77    0.00   28.67    0.36    0.00    2.30    0.01    0.00    0.00   12.89
Average:       2   70.47    0.00   15.02    0.21    0.00    1.15    0.01    0.00    0.00   13.14
Average:       3   56.20    0.00   27.75    0.33    0.00    2.47    0.01    0.00    0.00   13.23


kernel-4.18.0-334.el8.mr1218_210823_2059.x86_64

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   66.32    0.00   26.40    0.86    0.00    2.35    0.01    0.00    0.00    4.06
Average:       0   67.51    0.00   26.48    0.73    0.00    1.29    0.00    0.00    0.00    3.99
Average:       1   67.69    0.00   26.05    0.81    0.00    1.42    0.01    0.00    0.00    4.03
Average:       2   67.25    0.00   26.18    0.59    0.00    1.90    0.01    0.00    0.00    4.08
Average:       3   62.85    0.00   26.91    1.31    0.00    4.78    0.02    0.00    0.00    4.14

Comment 8 Mike Snitzer 2021-08-25 14:05:38 UTC

Why are you constraining the number of cpus used?  The performance issue is function of cpu count due to spinlock contention.
Also, your test device may not be fast enough to expose the spinlock contention as a bottleneck (tmpfs files are used in comment#0.. which are backed by ram).

Comment 10 guazhang@redhat.com 2021-08-26 03:43:20 UTC

Hi


mkdir /mnt/ramdisk
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
dd if=/dev/zero of=/mnt/ramdisk/file bs=1M count=500
echo 'redhat redhat'|cryptsetup -q luksFormat  /mnt/ramdisk/file
echo 'redhat redhat'|cryptsetup -q open   /mnt/ramdisk/file  test1
mkfs.ext4 /dev/mapper/test1
mkdir /home/test1
mount /dev/mapper/test1 /home/test1

lsblk
NAME                        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0                         7:0    0   500M  0 loop  
`-test1                     253:3    0   484M  0 crypt /home/test1


4.18.0-334.el8.mr1218_210823_2059.x86_64

cd /home/test1
stress-ng  --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.01    0.00   29.84    3.69    0.52    1.41    0.00    0.00    0.00   51.53

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.16    0.00   31.47    3.54    0.50    1.33    0.00    0.00    0.00   50.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.01    0.00   31.95    3.51    0.51    1.35    0.00    0.00    0.00   49.67



fio --filename=fio_test_file --size=400M --direct=1 --rw=write --bs=4k --ioengine=libaio --iodepth=64 --runtime=120  --time_based --group_reporting --name=throughput-test-job --eta-newline=1

3 time fio test result 
write: IOPS=185k, BW=724MiB/s (759MB/s)(84.8GiB/120001msec); 0 zone resets
write: IOPS=183k, BW=716MiB/s (750MB/s)(83.9GiB/120001msec); 0 zone resets
write: IOPS=184k, BW=719MiB/s (754MB/s)(84.2GiB/120001msec); 0 zone resets




4.18.0-335.el8.x86_64
stress-ng --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.95    0.00   35.25    3.25    0.64    1.65    0.00    0.00    0.00   45.26

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   12.94    0.00   34.75    2.69    0.54    1.23    0.00    0.00    0.00   47.85

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.13    0.00   34.47    2.94    0.56    1.26    0.00    0.00    0.00   47.65

fio --filename=fio_test_file --size=400M --direct=1 --rw=write --bs=4k --ioengine=libaio --iodepth=64 --runtime=120  --time_based --group_reporting --name=throughput-test-job --eta-newline=1

write: IOPS=177k, BW=691MiB/s (725MB/s)(81.0GiB/120001msec); 0 zone resets
write: IOPS=173k, BW=677MiB/s (710MB/s)(79.3GiB/120001msec); 0 zone resets
write: IOPS=174k, BW=679MiB/s (712MB/s)(79.6GiB/120001msec); 0 zone resets


please help to check the result if expected ?

Comment 11 Mike Snitzer 2021-08-26 16:34:26 UTC

(In reply to guazhang from comment #10)
> 
> please help to check the result if expected ?

Yes, while the workload is still somewhat limited, it does start to show the benefits of the fix with this revised testing.
Thanks.

Comment 16 guazhang@redhat.com 2021-08-29 10:59:46 UTC

Hi,

Please update Fixed In Version

Comment 17 guazhang@redhat.com 2021-08-31 01:11:42 UTC

HI


I did code review on kernel-4.18.0-339.el8.x86_64.rpm, After check, all modified can be found in source, and not found new regression either.

move to verified.

Comment 22 errata-xmlrpc 2021-11-09 19:27:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4356

Note You need to log in before you can comment on or make changes to this bug.