Bug 1996854

Summary: dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
Product: Red Hat Enterprise Linux 8 Reporter: Mike Snitzer <msnitzer>
Component: kernelAssignee: Mike Snitzer <msnitzer>
kernel sub component: Crypt QA Contact: guazhang <guazhang>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agk, brdeoliv, guazhang, msnitzer, okozina
Version: 8.5Keywords: Triaged
Target Milestone: rc   
Target Release: 8.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-4.18.0-339.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-09 19:27:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 5.15
Embargoed:
Deadline: 2021-08-30   

Description Mike Snitzer 2021-08-23 20:05:36 UTC
Description of problem:

This dm-crypt performance issue, especially on large NUMA systems, impacts all releases of RHEL8 because it predates RHEL8.

Here is the upstream commit that details the issue and associated fix:

commit 528b16bfc3ae5f11638e71b3b63a81f9999df727
Author: Arne Welzel <arne.welzel>
Date:   Sat Aug 14 00:40:38 2021 +0200

    dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()

    On systems with many cores using dm-crypt, heavy spinlock contention in
    percpu_counter_compare() can be observed when the page allocation limit
    for a given device is reached or close to be reached. This is due
    to percpu_counter_compare() taking a spinlock to compute an exact
    result on potentially many CPUs at the same time.

    Switch to non-exact comparison of allocated and allowed pages by using
    the value returned by percpu_counter_read_positive() to avoid taking
    the percpu_counter spinlock.

    This may over/under estimate the actual number of allocated pages by at
    most (batch-1) * num_online_cpus().

    Currently, batch is bounded by 32. The system on which this issue was
    first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
    change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
    allocations, this seems an acceptable error. Certainly preferred over
    running into the spinlock contention.

    This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
    and 192GB RAM as follows, but can be provoked on systems with less CPUs
    as well.

     * Disable swap
     * Tune vm settings to promote regular writeback
         $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
         $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
         $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

     * Create 8 dmcrypt devices based on files on a tmpfs
     * Create and mount an ext4 filesystem on each crypt devices
     * Run stress-ng --hdd 8 within one of above filesystems

    Total %system usage collected from sysstat goes to ~35%. Write throughput
    on the underlying loop device is ~2GB/s. perf profiling an individual
    kworker kcryptd thread shows the following profile, indicating spinlock
    contention in percpu_counter_compare():

        99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
          |
          --ret_from_fork
            kthread
            worker_thread
            |
             --99.92%--process_one_work
                |
                |--80.52%--kcryptd_crypt
                |    |
                |    |--62.58%--mempool_alloc
                |    |  |
                |    |   --62.24%--crypt_page_alloc
                |    |     |
                |    |      --61.51%--__percpu_counter_compare
                |    |        |
                |    |         --61.34%--__percpu_counter_sum
                |    |           |
                |    |           |--58.68%--_raw_spin_lock_irqsave
                |    |           |  |
                |    |           |   --58.30%--native_queued_spin_lock_slowpath
                |    |           |
                |    |            --0.69%--cpumask_next
                |    |                |
                |    |                 --0.51%--_find_next_bit
                |    |
                |    |--10.61%--crypt_convert
                |    |          |
                |    |          |--6.05%--xts_crypt
                ...

    After applying this patch and running the same test, %system usage is
    lowered to ~7% and write throughput on the loop device increases
    to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
    in the profile and not hitting the percpu_counter() spinlock anymore.

        |--8.15%--mempool_alloc
        |    |
        |    |--3.93%--crypt_page_alloc
        |    |    |
        |    |     --3.75%--__alloc_pages
        |    |         |
        |    |          --3.62%--get_page_from_freelist
        |    |              |
        |    |               --3.22%--rmqueue_bulk
        |    |                   |
        |    |                    --2.59%--_raw_spin_lock
        |    |                      |
        |    |                       --2.57%--native_queued_spin_lock_slowpath
        |    |
        |     --3.05%--_raw_spin_lock_irqsave
        |               |
        |                --2.49%--native_queued_spin_lock_slowpath

    Suggested-by: DJ Gregor <dj>
    Reviewed-by: Mikulas Patocka <mpatocka>
    Signed-off-by: Arne Welzel <arne.welzel>
    Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages")
    Cc: stable.org
    Signed-off-by: Mike Snitzer <snitzer>

Comment 4 guazhang@redhat.com 2021-08-24 12:35:37 UTC
Hi,

Could you provide more detail test steps to reproduce the bug?

Comment 5 Mike Snitzer 2021-08-24 16:08:19 UTC
(In reply to guazhang from comment #4)
> Hi,
> 
> Could you provide more detail test steps to reproduce the bug?

This is a scalability performance bug. The more IO you issue to a dm-crypt device, particularly on a very large SMP/NUMA machine, the more spinlock contention will hurt performance.

So to test: throw as much IO at a fast dm-crypt device as you can on a large system with a lot of cpus.  You'd do well to have the dm-crypt device be layered on fast storage that handles seeks easily (e.g. ssd, nvme, ramdisk, etc).  Then just increase the thread count on your synthetic IO load generator (e.g. fio).

The kernel with the fix should consumer less CPU time and yield better IO performance.

Comment 7 guazhang@redhat.com 2021-08-25 13:52:17 UTC
Hi,


I test the stress-ng in VM and found the test kernel %sys utilization higher than default kernel.

stress-ng --cpu 4 --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8


[root@test1 test8]# stress-ng --cpu 4 --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8
stress-ng: info:  [250380] dispatching hogs: 4 cpu, 32 io, 4 vm, 4 fork, 8 hdd
stress-ng: info:  [250380] successful run completed in 120.39s (2 mins, 0.39 secs)
[root@test1 test8]# pwd
/home/test8
[root@test1 test8]# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
loop0     7:0    0  500M  0 loop  
`-test1 253:0    0  484M  0 crypt /home/test1
loop1     7:1    0  500M  0 loop  
`-test2 253:1    0  484M  0 crypt /home/test2
loop2     7:2    0  500M  0 loop  
`-test3 253:2    0  484M  0 crypt /home/test3
loop3     7:3    0  500M  0 loop  
`-test4 253:3    0  484M  0 crypt /home/test4
loop4     7:4    0  500M  0 loop  
`-test5 253:4    0  484M  0 crypt /home/test5
loop5     7:5    0  500M  0 loop  
`-test6 253:5    0  484M  0 crypt /home/test6
loop6     7:6    0  500M  0 loop  
`-test7 253:6    0  484M  0 crypt /home/test7
loop7     7:7    0  500M  0 loop  
`-test8 253:7    0  484M  0 crypt /home/test8
sr0      11:0    1 1024M  0 rom   
vda     252:0    0   27G  0 disk  
|-vda1  252:1    0    1M  0 part  
|-vda2  252:2    0  100M  0 part  /boot/efi
`-vda3  252:3    0 26.9G  0 part  /



default kernel kernel-4.18.0-335.el8.x86_64

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   60.15    0.00   24.52    0.28    0.00    1.94    0.01    0.00    0.00   13.11
Average:       0   58.14    0.00   26.62    0.21    0.00    1.84    0.01    0.00    0.00   13.18
Average:       1   55.77    0.00   28.67    0.36    0.00    2.30    0.01    0.00    0.00   12.89
Average:       2   70.47    0.00   15.02    0.21    0.00    1.15    0.01    0.00    0.00   13.14
Average:       3   56.20    0.00   27.75    0.33    0.00    2.47    0.01    0.00    0.00   13.23


kernel-4.18.0-334.el8.mr1218_210823_2059.x86_64

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   66.32    0.00   26.40    0.86    0.00    2.35    0.01    0.00    0.00    4.06
Average:       0   67.51    0.00   26.48    0.73    0.00    1.29    0.00    0.00    0.00    3.99
Average:       1   67.69    0.00   26.05    0.81    0.00    1.42    0.01    0.00    0.00    4.03
Average:       2   67.25    0.00   26.18    0.59    0.00    1.90    0.01    0.00    0.00    4.08
Average:       3   62.85    0.00   26.91    1.31    0.00    4.78    0.02    0.00    0.00    4.14

Comment 8 Mike Snitzer 2021-08-25 14:05:38 UTC
Why are you constraining the number of cpus used?  The performance issue is function of cpu count due to spinlock contention.
Also, your test device may not be fast enough to expose the spinlock contention as a bottleneck (tmpfs files are used in comment#0.. which are backed by ram).

Comment 10 guazhang@redhat.com 2021-08-26 03:43:20 UTC
Hi


mkdir /mnt/ramdisk
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
dd if=/dev/zero of=/mnt/ramdisk/file bs=1M count=500
echo 'redhat redhat'|cryptsetup -q luksFormat  /mnt/ramdisk/file
echo 'redhat redhat'|cryptsetup -q open   /mnt/ramdisk/file  test1
mkfs.ext4 /dev/mapper/test1
mkdir /home/test1
mount /dev/mapper/test1 /home/test1

lsblk
NAME                        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0                         7:0    0   500M  0 loop  
`-test1                     253:3    0   484M  0 crypt /home/test1


4.18.0-334.el8.mr1218_210823_2059.x86_64

cd /home/test1
stress-ng  --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.01    0.00   29.84    3.69    0.52    1.41    0.00    0.00    0.00   51.53

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.16    0.00   31.47    3.54    0.50    1.33    0.00    0.00    0.00   50.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.01    0.00   31.95    3.51    0.51    1.35    0.00    0.00    0.00   49.67



fio --filename=fio_test_file --size=400M --direct=1 --rw=write --bs=4k --ioengine=libaio --iodepth=64 --runtime=120  --time_based --group_reporting --name=throughput-test-job --eta-newline=1

3 time fio test result 
write: IOPS=185k, BW=724MiB/s (759MB/s)(84.8GiB/120001msec); 0 zone resets
write: IOPS=183k, BW=716MiB/s (750MB/s)(83.9GiB/120001msec); 0 zone resets
write: IOPS=184k, BW=719MiB/s (754MB/s)(84.2GiB/120001msec); 0 zone resets




4.18.0-335.el8.x86_64
stress-ng --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.95    0.00   35.25    3.25    0.64    1.65    0.00    0.00    0.00   45.26

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   12.94    0.00   34.75    2.69    0.54    1.23    0.00    0.00    0.00   47.85

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   13.13    0.00   34.47    2.94    0.56    1.26    0.00    0.00    0.00   47.65

fio --filename=fio_test_file --size=400M --direct=1 --rw=write --bs=4k --ioengine=libaio --iodepth=64 --runtime=120  --time_based --group_reporting --name=throughput-test-job --eta-newline=1

write: IOPS=177k, BW=691MiB/s (725MB/s)(81.0GiB/120001msec); 0 zone resets
write: IOPS=173k, BW=677MiB/s (710MB/s)(79.3GiB/120001msec); 0 zone resets
write: IOPS=174k, BW=679MiB/s (712MB/s)(79.6GiB/120001msec); 0 zone resets


please help to check the result if expected ?

Comment 11 Mike Snitzer 2021-08-26 16:34:26 UTC
(In reply to guazhang from comment #10)
> 
> please help to check the result if expected ?

Yes, while the workload is still somewhat limited, it does start to show the benefits of the fix with this revised testing.
Thanks.

Comment 16 guazhang@redhat.com 2021-08-29 10:59:46 UTC
Hi,

Please update Fixed In Version

Comment 17 guazhang@redhat.com 2021-08-31 01:11:42 UTC
HI


I did code review on kernel-4.18.0-339.el8.x86_64.rpm, After check, all modified can be found in source, and not found new regression either.

move to verified.

Comment 22 errata-xmlrpc 2021-11-09 19:27:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4356