Hide Forgot
Description of problem: This dm-crypt performance issue, especially on large NUMA systems, impacts all releases of RHEL8 because it predates RHEL8. Here is the upstream commit that details the issue and associated fix: commit 528b16bfc3ae5f11638e71b3b63a81f9999df727 Author: Arne Welzel <arne.welzel> Date: Sat Aug 14 00:40:38 2021 +0200 dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() On systems with many cores using dm-crypt, heavy spinlock contention in percpu_counter_compare() can be observed when the page allocation limit for a given device is reached or close to be reached. This is due to percpu_counter_compare() taking a spinlock to compute an exact result on potentially many CPUs at the same time. Switch to non-exact comparison of allocated and allowed pages by using the value returned by percpu_counter_read_positive() to avoid taking the percpu_counter spinlock. This may over/under estimate the actual number of allocated pages by at most (batch-1) * num_online_cpus(). Currently, batch is bounded by 32. The system on which this issue was first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt allocations, this seems an acceptable error. Certainly preferred over running into the spinlock contention. This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs and 192GB RAM as follows, but can be provoked on systems with less CPUs as well. * Disable swap * Tune vm settings to promote regular writeback $ echo 50 > /proc/sys/vm/dirty_expire_centisecs $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes * Create 8 dmcrypt devices based on files on a tmpfs * Create and mount an ext4 filesystem on each crypt devices * Run stress-ng --hdd 8 within one of above filesystems Total %system usage collected from sysstat goes to ~35%. Write throughput on the underlying loop device is ~2GB/s. perf profiling an individual kworker kcryptd thread shows the following profile, indicating spinlock contention in percpu_counter_compare(): 99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork | --ret_from_fork kthread worker_thread | --99.92%--process_one_work | |--80.52%--kcryptd_crypt | | | |--62.58%--mempool_alloc | | | | | --62.24%--crypt_page_alloc | | | | | --61.51%--__percpu_counter_compare | | | | | --61.34%--__percpu_counter_sum | | | | | |--58.68%--_raw_spin_lock_irqsave | | | | | | | --58.30%--native_queued_spin_lock_slowpath | | | | | --0.69%--cpumask_next | | | | | --0.51%--_find_next_bit | | | |--10.61%--crypt_convert | | | | | |--6.05%--xts_crypt ... After applying this patch and running the same test, %system usage is lowered to ~7% and write throughput on the loop device increases to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62% in the profile and not hitting the percpu_counter() spinlock anymore. |--8.15%--mempool_alloc | | | |--3.93%--crypt_page_alloc | | | | | --3.75%--__alloc_pages | | | | | --3.62%--get_page_from_freelist | | | | | --3.22%--rmqueue_bulk | | | | | --2.59%--_raw_spin_lock | | | | | --2.57%--native_queued_spin_lock_slowpath | | | --3.05%--_raw_spin_lock_irqsave | | | --2.49%--native_queued_spin_lock_slowpath Suggested-by: DJ Gregor <dj> Reviewed-by: Mikulas Patocka <mpatocka> Signed-off-by: Arne Welzel <arne.welzel> Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages") Cc: stable.org Signed-off-by: Mike Snitzer <snitzer>
Hi, Could you provide more detail test steps to reproduce the bug?
(In reply to guazhang from comment #4) > Hi, > > Could you provide more detail test steps to reproduce the bug? This is a scalability performance bug. The more IO you issue to a dm-crypt device, particularly on a very large SMP/NUMA machine, the more spinlock contention will hurt performance. So to test: throw as much IO at a fast dm-crypt device as you can on a large system with a lot of cpus. You'd do well to have the dm-crypt device be layered on fast storage that handles seeks easily (e.g. ssd, nvme, ramdisk, etc). Then just increase the thread count on your synthetic IO load generator (e.g. fio). The kernel with the fix should consumer less CPU time and yield better IO performance.
Hi, I test the stress-ng in VM and found the test kernel %sys utilization higher than default kernel. stress-ng --cpu 4 --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8 [root@test1 test8]# stress-ng --cpu 4 --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8 stress-ng: info: [250380] dispatching hogs: 4 cpu, 32 io, 4 vm, 4 fork, 8 hdd stress-ng: info: [250380] successful run completed in 120.39s (2 mins, 0.39 secs) [root@test1 test8]# pwd /home/test8 [root@test1 test8]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 500M 0 loop `-test1 253:0 0 484M 0 crypt /home/test1 loop1 7:1 0 500M 0 loop `-test2 253:1 0 484M 0 crypt /home/test2 loop2 7:2 0 500M 0 loop `-test3 253:2 0 484M 0 crypt /home/test3 loop3 7:3 0 500M 0 loop `-test4 253:3 0 484M 0 crypt /home/test4 loop4 7:4 0 500M 0 loop `-test5 253:4 0 484M 0 crypt /home/test5 loop5 7:5 0 500M 0 loop `-test6 253:5 0 484M 0 crypt /home/test6 loop6 7:6 0 500M 0 loop `-test7 253:6 0 484M 0 crypt /home/test7 loop7 7:7 0 500M 0 loop `-test8 253:7 0 484M 0 crypt /home/test8 sr0 11:0 1 1024M 0 rom vda 252:0 0 27G 0 disk |-vda1 252:1 0 1M 0 part |-vda2 252:2 0 100M 0 part /boot/efi `-vda3 252:3 0 26.9G 0 part / default kernel kernel-4.18.0-335.el8.x86_64 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 60.15 0.00 24.52 0.28 0.00 1.94 0.01 0.00 0.00 13.11 Average: 0 58.14 0.00 26.62 0.21 0.00 1.84 0.01 0.00 0.00 13.18 Average: 1 55.77 0.00 28.67 0.36 0.00 2.30 0.01 0.00 0.00 12.89 Average: 2 70.47 0.00 15.02 0.21 0.00 1.15 0.01 0.00 0.00 13.14 Average: 3 56.20 0.00 27.75 0.33 0.00 2.47 0.01 0.00 0.00 13.23 kernel-4.18.0-334.el8.mr1218_210823_2059.x86_64 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 66.32 0.00 26.40 0.86 0.00 2.35 0.01 0.00 0.00 4.06 Average: 0 67.51 0.00 26.48 0.73 0.00 1.29 0.00 0.00 0.00 3.99 Average: 1 67.69 0.00 26.05 0.81 0.00 1.42 0.01 0.00 0.00 4.03 Average: 2 67.25 0.00 26.18 0.59 0.00 1.90 0.01 0.00 0.00 4.08 Average: 3 62.85 0.00 26.91 1.31 0.00 4.78 0.02 0.00 0.00 4.14
Why are you constraining the number of cpus used? The performance issue is function of cpu count due to spinlock contention. Also, your test device may not be fast enough to expose the spinlock contention as a bottleneck (tmpfs files are used in comment#0.. which are backed by ram).
Hi mkdir /mnt/ramdisk mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk dd if=/dev/zero of=/mnt/ramdisk/file bs=1M count=500 echo 'redhat redhat'|cryptsetup -q luksFormat /mnt/ramdisk/file echo 'redhat redhat'|cryptsetup -q open /mnt/ramdisk/file test1 mkfs.ext4 /dev/mapper/test1 mkdir /home/test1 mount /dev/mapper/test1 /home/test1 lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 500M 0 loop `-test1 253:3 0 484M 0 crypt /home/test1 4.18.0-334.el8.mr1218_210823_2059.x86_64 cd /home/test1 stress-ng --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 13.01 0.00 29.84 3.69 0.52 1.41 0.00 0.00 0.00 51.53 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 13.16 0.00 31.47 3.54 0.50 1.33 0.00 0.00 0.00 50.00 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 13.01 0.00 31.95 3.51 0.51 1.35 0.00 0.00 0.00 49.67 fio --filename=fio_test_file --size=400M --direct=1 --rw=write --bs=4k --ioengine=libaio --iodepth=64 --runtime=120 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 3 time fio test result write: IOPS=185k, BW=724MiB/s (759MB/s)(84.8GiB/120001msec); 0 zone resets write: IOPS=183k, BW=716MiB/s (750MB/s)(83.9GiB/120001msec); 0 zone resets write: IOPS=184k, BW=719MiB/s (754MB/s)(84.2GiB/120001msec); 0 zone resets 4.18.0-335.el8.x86_64 stress-ng --io 32 --vm 4 --vm-bytes 128M --fork 4 --timeout 120s --hdd 8 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 13.95 0.00 35.25 3.25 0.64 1.65 0.00 0.00 0.00 45.26 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 12.94 0.00 34.75 2.69 0.54 1.23 0.00 0.00 0.00 47.85 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 13.13 0.00 34.47 2.94 0.56 1.26 0.00 0.00 0.00 47.65 fio --filename=fio_test_file --size=400M --direct=1 --rw=write --bs=4k --ioengine=libaio --iodepth=64 --runtime=120 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 write: IOPS=177k, BW=691MiB/s (725MB/s)(81.0GiB/120001msec); 0 zone resets write: IOPS=173k, BW=677MiB/s (710MB/s)(79.3GiB/120001msec); 0 zone resets write: IOPS=174k, BW=679MiB/s (712MB/s)(79.6GiB/120001msec); 0 zone resets please help to check the result if expected ?
(In reply to guazhang from comment #10) > > please help to check the result if expected ? Yes, while the workload is still somewhat limited, it does start to show the benefits of the fix with this revised testing. Thanks.
Hi, Please update Fixed In Version
HI I did code review on kernel-4.18.0-339.el8.x86_64.rpm, After check, all modified can be found in source, and not found new regression either. move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4356