Bug 1666336 - severe performance impact using encrypted Cinder volume (QEMU luks)
Summary: severe performance impact using encrypted Cinder volume (QEMU luks)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.4
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Daniel Berrangé
QA Contact: Tingting Mao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-15 14:51 UTC by Ganesh Kadam
Modified: 2019-09-10 07:00 UTC (History)
20 users (show)

Fixed In Version: qemu-kvm-rhev-2.12.0-28.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-22 09:19:58 UTC
Target Upstream Version:


Attachments (Terms of Use)
iostat log (8.56 KB, application/gzip)
2019-02-22 14:26 UTC, Yihuang Yu
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2019:2553 None None None 2019-08-22 09:21:28 UTC

Description Ganesh Kadam 2019-01-15 14:51:07 UTC
Description of problem:

One of our Customers have deployed a pre-production RHOSP13 system. They are testing encrypted Cinder volumes following the instructions at:
~~~
 https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/manage_secrets_with_openstack_key_manager/index#encrypting_cinder_volumes
~~~

The volume type was created with:
~~~
openstack volume type create --encryption-provider nova.volume.encryptors.luks.LuksEncryptor --encryption-cipher aes-xts-plain64 --encryption-key-size 256 --encryption-control-location front-end LuksEncryptor-Template-256
~~~

The encrypted volume was created with:
~~~
openstack volume create --size 100 --type LuksEncryptor-Template-256 'enc100'
~~~

The functionality is fine, but the I/O performance using an encrypted volume is very poor compared to a plain volume using the same Ceph/RBD back-end.

- writing with "dd if=/dev/zero": plain volume 1.2GB/s; encrypted volume 91MB/s
- bonnie++ block-writes: plain volume 1.3GB/s; encrypted volume 81MB/s
- bonnie++ block-reads: plain volume 390MB/s; encrypted volume 83MB/s

Cu noticed that while writing to the plain volume, "atop" in the test VM consistently showed busy<100% and avio<1ms, e.g.
~~~
DSK |           vdc  |  busy     85%  |  read     104  |  write  12618 |  MBr/s   0.04  |  MBw/s 1255.28  |  avio 0.63 ms
whereas, when writing to the encrypted volume, those figures were very variable (busy ranged from 0% to nearly 400%; avio going over 22ms).
~~~




Version-Release number of selected component (if applicable):

[root@eta-cpu0000 ~]# rpm -qa | grep qemu
qemu-kvm-common-rhev-2.12.0-18.el7_6.1.x86_64
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.2.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64
qemu-guest-agent-2.12.0-2.el7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64

[root@eta-cpu0000 ~]# rpm -qa | grep cinder
puppet-cinder-12.4.1-0.20180628102252.el7ost.noarch
openstack-cinder-12.0.4-2.el7ost.noarch
python2-cinderclient-3.5.0-1.el7ost.noarch
python-cinder-12.0.4-2.el7ost.noarch


Actual results:

Performance Impact due to cinder-volume encryption

Expected results:

No performance impact due to cinder-volume encyption

Additional info:

Cu has searched the RH Bugzilla and found https://bugzilla.redhat.com/show_bug.cgi?id=1500334 but they saw that the hypervisor has qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 so the buffer size patch should be incorporated. So they don't believe that this bug is the cause. 

Cu also found https://bugzilla.redhat.com/show_bug.cgi?id=1434221 and verified that the hypervisor has tuned set to "throughput-performance" which rules out that cause.

Comment 6 Daniel Berrangé 2019-01-18 17:40:21 UTC
Late last year I did some significant performance optimization of the XTS cipher mode impl in QEMU which approx doubles the performance of XTS: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg04336.html 

For unrelated reasons we also switched QEMU back to use gcrypt instead of nettle for crypto algorithms in qemu-kvm-rhev-2.12.0-2.el7 (https://bugzilla.redhat.com/show_bug.cgi?id=1549543). This is actually good for performance since gcrypt's optimization for AES is about 30% faster than nettle http://lists.lysator.liu.se/pipermail/nettle-bugs/2017/003294.html

This certainly wouldn't alleviate all of the perf penalty reported in this bug, but it is an major improvement over what's in qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64.

The comparative kernel LUKS performances shows there's likely still more we can gain in QEMU with further investigation & dev work.

Comment 11 CongLi 2019-01-22 11:05:41 UTC
Hi Ganesh,

QE is trying to reproducing this issue in our test environment, 
is it possible for you to provide volume.log of openstack-cinder-volume	?

Thanks.

Comment 14 CongLi 2019-01-28 08:45:14 UTC
(In reply to CongLi from comment #13)

Sorry, please ignore the comment 13 since I pasted the wrong data.

Sorry, please ignore the comment 13 since I pasted the wrong data.

Update the data:

1. dd: could not see a big performance degradation
# dd if=/dev/zero of=raw.zero bs=4k count=1M
raw: 193 MB/s
luks: 164 MB/s

2. fio: 
There is no big performance degradation between luks and raw via fio.
For sequential write, raw speed is double of luks, others are no big difference.

2.1 sequential read: 
# fio --rw=read --bs=4k --iodepth=1 --runtime=1m --direct=1  --name=job1 --ioengine=libaio --thread --group_reporting  --time_based  --filename=raw --size=4g
raw:
READ: bw=1648KiB/s (1687kB/s), 1648KiB/s-1648KiB/s (1687kB/s-1687kB/s), io=96.5MiB (101MB), run=60002-60002msec
luks:
READ: bw=1620KiB/s (1659kB/s), 1620KiB/s-1620KiB/s (1659kB/s-1659kB/s), io=94.9MiB (99.5MB), run=60002-60002msec

2.2 sequential write:
# fio --rw=write --bs=4k --iodepth=1 --runtime=1m --direct=1  --name=job1 --ioengine=libaio --thread --group_reporting  --time_based  --filename=raw --size=4g
raw:
WRITE: bw=38.2MiB/s (40.0MB/s), 38.2MiB/s-38.2MiB/s (40.0MB/s-40.0MB/s), io=2290MiB (2401MB), run=60001-60001msec
luks:
WRITE: bw=18.9MiB/s (19.8MB/s), 18.9MiB/s-18.9MiB/s (19.8MB/s-19.8MB/s), io=1132MiB (1187MB), run=60001-60001msec


2.3 randrw
# fio --rw=randrw --bs=4k --iodepth=1 --runtime=1m --direct=1  --name=job1 --ioengine=libaio --thread --group_reporting  --time_based  --filename=raw --size=4g
raw:
READ: bw=584KiB/s (598kB/s), 584KiB/s-584KiB/s (598kB/s-598kB/s), io=34.2MiB (35.9MB), run=60002-60002msec
WRITE: bw=583KiB/s (597kB/s), 583KiB/s-583KiB/s (597kB/s-597kB/s), io=34.2MiB (35.8MB), run=60002-60002msec
luks:
READ: bw=552KiB/s (566kB/s), 552KiB/s-552KiB/s (566kB/s-566kB/s), io=32.4MiB (33.9MB), run=60001-60001msec
WRITE: bw=554KiB/s (568kB/s), 554KiB/s-554KiB/s (568kB/s-568kB/s), io=32.5MiB (34.1MB), run=60001-60001msec



Thanks.

Comment 15 CongLi 2019-01-28 08:49:25 UTC
(In reply to CongLi from comment #14)
> (In reply to CongLi from comment #13)

Tested on: qemu-kvm-rhev-2.12.0-18.el7_6.3.x86_64.
ceph volume:

raw CML:
    -drive format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap,file.driver=rbd,file.pool=rbd,file.server.0.host=10.66.144.31,file.server.0.port=6789,file.image=coli.raw,id=drive-virtio-disk0 \
    -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on \
luks CML:
    -object secret,id=sec0,data=redhat \
    -drive format=luks,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap,file.driver=rbd,file.pool=rbd,file.server.0.host=10.66.144.31,file.server.0.port=6789,file.image=coli.luks,key-secret=sec0 \
    -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on \

Comment 16 Daniel Berrangé 2019-01-28 10:07:47 UTC
Note to anyone testing performance - the drive 'cache' setting has a significant impact on the performance differential between raw & luks volumes.

Comment 17 Yanhui Ma 2019-01-29 03:25:22 UTC
@coli, for performance test, we generally use cache=none to get more stable data.

Comment 18 CongLi 2019-01-29 03:28:27 UTC
Thanks Daniel and Yanhui.

My CML is based on the customer's from the log attached.

-object secret,id=virtio-disk0-secret0,data=BSA7MEQI1FNwr2lYL9jFwCglTGWPqfQeqSxbX83XeQw=,keyid=masterKey0,iv=N/PKy/se00ZOSyCp/uhGoQ==,format=base64 \
-drive 'file=rbd:eta-vms/acb9afb0-de47-4393-815a-b2e28a357da0_disk:id=openstack-eta:auth_supported=cephx\;none:mon_host=172.27.6.11\:6789\;172.27.6.14\:6789\;172.27.6.17\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on

I will have a try with cache=none.

Thanks.

Comment 19 CongLi 2019-01-29 08:08:42 UTC
cache=none

1. dd: could see a performance degradation of luks
# dd if=/dev/zero of=raw.zero bs=4k count=1M
raw: 282 MB/s
luks: 87.4 MB/s

2. fio: 
There is no big performance degradation between luks and raw via fio.

2.1 sequential read: 
# fio --rw=read --bs=4k --iodepth=1 --runtime=1m --direct=1  --name=job1 --ioengine=libaio --thread --group_reporting  --time_based  --filename=raw --size=4g
raw:
   READ: bw=1663KiB/s (1702kB/s), 1663KiB/s-1663KiB/s (1702kB/s-1702kB/s), io=97.4MiB (102MB), run=60002-60002msec
luks:
   READ: bw=1904KiB/s (1949kB/s), 1904KiB/s-1904KiB/s (1949kB/s-1949kB/s), io=112MiB (117MB), run=60001-60001msec

2.2 sequential write:
# fio --rw=write --bs=4k --iodepth=1 --runtime=1m --direct=1  --name=job1 --ioengine=libaio --thread --group_reporting  --time_based  --filename=raw --size=4g
raw:
  WRITE: bw=93.1KiB/s (95.3kB/s), 93.1KiB/s-93.1KiB/s (95.3kB/s-95.3kB/s), io=5588KiB (5722kB), run=60025-60025msec
luks:
  WRITE: bw=90.2KiB/s (92.4kB/s), 90.2KiB/s-90.2KiB/s (92.4kB/s-92.4kB/s), io=5416KiB (5546kB), run=60019-60019msec


2.3 randrw
# fio --rw=randrw --bs=4k --iodepth=1 --runtime=1m --direct=1  --name=job1 --ioengine=libaio --thread --group_reporting  --time_based  --filename=raw --size=4g
raw:
   READ: bw=75.4KiB/s (77.2kB/s), 75.4KiB/s-75.4KiB/s (77.2kB/s-77.2kB/s), io=4528KiB (4637kB), run=60027-60027msec
  WRITE: bw=78.7KiB/s (80.6kB/s), 78.7KiB/s-78.7KiB/s (80.6kB/s-80.6kB/s), io=4724KiB (4837kB), run=60027-60027msecc
luks:
   READ: bw=76.8KiB/s (78.7kB/s), 76.8KiB/s-76.8KiB/s (78.7kB/s-78.7kB/s), io=4620KiB (4731kB), run=60135-60135msec
  WRITE: bw=79.8KiB/s (81.7kB/s), 79.8KiB/s-79.8KiB/s (81.7kB/s-81.7kB/s), io=4796KiB (4911kB), run=60135-60135msec

Comment 24 Tingting Mao 2019-02-22 02:53:58 UTC
Tried to reproduce this bug as below. However, I did not make it.


Tested packages:
qemu-kvm-rhev-2.12.0-18.el7_6.1
kernel-3.10.0-944.el7


Steps:

1. Create RAW/LUKS disks based on RBD
RAW:
qemu-img create -f raw rbd:rbd/data.img 5G

LUKS:
qemu-img create -f luks --object secret,id=sec0,data=base -o key-secret=sec0 rbd:rbd/data.luks 5G


2. Boot guest with the created disks as data disk
RAW:
-drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=raw,file=rbd:rbd/data.img \
-device virtio-blk-pci,id=virtio_blk_pci1,drive=drive_image2,bus=pci.0,addr=06 \

LUKS:
-object secret,id=sec0,data=base \
-drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=luks,file=rbd:rbd/data.luks,key-secret=sec0 \
-device virtio-blk-pci,id=virtio_blk_pci1,drive=drive_image2,bus=pci.0,addr=06 \


3. DD to the different data disks 
RAW:
dd if=/dev/zero of=/dev/vdb bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 63.5493 s, 33.8 MB/s

LUKS:
dd if=/dev/zero of=/dev/vdb bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 107.487 s, 20.0 MB/s


4. FIO for the different data disks --------------------------- READ(20 vs 19)/WRITE(19 vs 19)
Raw:
# fio --filename=/dev/vdb --direct=1 --rw=randrw --bs=8K --name=my_test --iodepth=1 --ioengine=libaio --size=1G
my_test: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=104KiB/s,w=176KiB/s][r=13,w=22 IOPS][eta 00m:00s]
my_test: (groupid=0, jobs=1): err= 0: pid=13440: Wed Feb 20 04:57:53 2019
   read: IOPS=20, BW=161KiB/s (164kB/s)(513MiB/3274764msec)
    slat (usec): min=12, max=535, avg=21.16, stdev= 3.76
    clat (usec): min=1155, max=101838, avg=1719.13, stdev=433.90
     lat (usec): min=1179, max=101860, avg=1741.95, stdev=433.93
    clat percentiles (usec):
     |  1.00th=[ 1352],  5.00th=[ 1549], 10.00th=[ 1582], 20.00th=[ 1631],
     | 30.00th=[ 1647], 40.00th=[ 1680], 50.00th=[ 1713], 60.00th=[ 1745],
     | 70.00th=[ 1778], 80.00th=[ 1811], 90.00th=[ 1860], 95.00th=[ 1893],
     | 99.00th=[ 1975], 99.50th=[ 2008], 99.90th=[ 2057], 99.95th=[ 3097],
     | 99.99th=[ 7046]
   bw (  KiB/s): min=   15, max=  512, per=100.00%, avg=161.40, stdev=77.84, samples=6513
   iops        : min=    1, max=   64, avg=20.17, stdev= 9.73, samples=6513
  write: IOPS=19, BW=160KiB/s (163kB/s)(511MiB/3274764msec)
    slat (nsec): min=14181, max=76677, avg=21557.41, stdev=2943.98
    clat (msec): min=21, max=892, avg=48.31, stdev=37.11
     lat (msec): min=21, max=892, avg=48.34, stdev=37.11
    clat percentiles (msec):
     |  1.00th=[   32],  5.00th=[   34], 10.00th=[   34], 20.00th=[   35],
     | 30.00th=[   39], 40.00th=[   41], 50.00th=[   42], 60.00th=[   42],
     | 70.00th=[   45], 80.00th=[   51], 90.00th=[   58], 95.00th=[   78],
     | 99.00th=[  292], 99.50th=[  334], 99.90th=[  397], 99.95th=[  418],
     | 99.99th=[  464]
   bw (  KiB/s): min=   16, max=  240, per=100.00%, avg=159.68, stdev=37.37, samples=6548
   iops        : min=    2, max=   30, avg=19.96, stdev= 4.67, samples=6548
  lat (msec)   : 2=49.88%, 4=0.24%, 10=0.02%, 50=40.36%, 100=8.03%
  lat (msec)   : 250=0.76%, 500=0.72%, 1000=0.01%
  cpu          : usr=0.06%, sys=0.14%, ctx=131073, majf=0, minf=27
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=65713,65359,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=161KiB/s (164kB/s), 161KiB/s-161KiB/s (164kB/s-164kB/s), io=513MiB (538MB), run=3274764-3274764msec
  WRITE: bw=160KiB/s (163kB/s), 160KiB/s-160KiB/s (163kB/s-163kB/s), io=511MiB (535MB), run=3274764-3274764msec

Disk stats (read/write):
  vdb: ios=65747/65355, merge=0/0, ticks=112660/3157531, in_queue=3270142, util=99.92%

LUKS:
# fio --filename=/dev/vdb --direct=1 --rw=randrw --bs=8K --name=my_test --iodepth=1 --ioengine=libaio --size=1G
my_test: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=120KiB/s,w=192KiB/s][r=15,w=24 IOPS][eta 00m:00s]  
my_test: (groupid=0, jobs=1): err= 0: pid=4035: Wed Feb 20 05:55:05 2019
   read: IOPS=19, BW=159KiB/s (163kB/s)(513MiB/3302249msec)
    slat (nsec): min=13085, max=80756, avg=21246.60, stdev=2936.82
    clat (usec): min=1448, max=135264, avg=1890.08, stdev=907.99
     lat (usec): min=1470, max=135287, avg=1912.98, stdev=908.00
    clat percentiles (usec):
     |  1.00th=[ 1647],  5.00th=[ 1713], 10.00th=[ 1745], 20.00th=[ 1795],
     | 30.00th=[ 1827], 40.00th=[ 1860], 50.00th=[ 1876], 60.00th=[ 1909],
     | 70.00th=[ 1926], 80.00th=[ 1975], 90.00th=[ 2024], 95.00th=[ 2057],
     | 99.00th=[ 2114], 99.50th=[ 2147], 99.90th=[ 2212], 99.95th=[ 2573],
     | 99.99th=[58459]
   bw (  KiB/s): min=   16, max=  560, per=100.00%, avg=160.28, stdev=78.03, samples=6559
   iops        : min=    2, max=   70, avg=20.03, stdev= 9.75, samples=6559
  write: IOPS=19, BW=158KiB/s (162kB/s)(511MiB/3302249msec)
    slat (nsec): min=13732, max=69401, avg=21764.54, stdev=3022.58
    clat (msec): min=19, max=2130, avg=48.56, stdev=38.86
     lat (msec): min=19, max=2130, avg=48.59, stdev=38.86
    clat percentiles (msec):
     |  1.00th=[   32],  5.00th=[   34], 10.00th=[   34], 20.00th=[   36],
     | 30.00th=[   39], 40.00th=[   41], 50.00th=[   42], 60.00th=[   42],
     | 70.00th=[   45], 80.00th=[   51], 90.00th=[   58], 95.00th=[   78],
     | 99.00th=[  300], 99.50th=[  342], 99.90th=[  409], 99.95th=[  426],
     | 99.99th=[  523]
   bw (  KiB/s): min=   16, max=  240, per=100.00%, avg=158.43, stdev=37.76, samples=6600
   iops        : min=    2, max=   30, avg=19.80, stdev= 4.72, samples=6600
  lat (msec)   : 2=43.68%, 4=6.44%, 10=0.01%, 20=0.01%, 50=40.22%
  lat (msec)   : 100=8.18%, 250=0.73%, 500=0.74%, 750=0.01%, >=2000=0.01%
  cpu          : usr=0.06%, sys=0.14%, ctx=131074, majf=0, minf=27
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=65713,65359,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=159KiB/s (163kB/s), 159KiB/s-159KiB/s (163kB/s-163kB/s), io=513MiB (538MB), run=3302249-3302249msec
  WRITE: bw=158KiB/s (162kB/s), 158KiB/s-158KiB/s (162kB/s-162kB/s), io=511MiB (535MB), run=3302249-3302249msec

Disk stats (read/write):
  vdb: ios=65748/65356, merge=0/0, ticks=124017/3173791, in_queue=3297810, util=99.92%

Comment 25 Yihuang Yu 2019-02-22 14:26:40 UTC
Created attachment 1537579 [details]
iostat log

I reproduced it on the ppc platform using a large luks file of the local filesystem.

raw:
qemu-img create -f raw data.raw 100G

-drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=raw,file=data.raw \
-device virtio-blk-pci,id=virtio_blk_pci1,drive=drive_image2,bus=pci.0,addr=06 \

Result:
dd if=/dev/zero of=/dev/vda bs=1M count=20480
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB) copied, 29.8509 s, 719 MB/s

luks:
qemu-img create -f luks --object secret,id=secret0,data="redhat" -o key-secret=secret0 data.luks 100G

-object secret,id=secret0,data="redhat" \
-drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=luks,file=data.luks,key-secret=secret0 \
-device virtio-blk-pci,id=virtio_blk_pci1,drive=drive_image2,bus=pci.0,addr=06 \

Result:
dd if=/dev/zero of=/dev/vda bs=1M count=20480
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB) copied, 484.657 s, 44.3 MB/s

iostat info (iostat 1 -x -m -p vda):

raw:
grep "vda" /home/iostat_raw.log | head -n 10
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.05    0.00     0.00     0.00   165.05     0.00    0.53    0.53    0.00   0.53   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00  5964.00    0.00  596.00     0.00   298.00  1024.00    94.72  243.88    0.00  243.88   1.24  74.00
vda               0.00  7245.00    0.00 1036.00     0.00   518.00  1024.00   128.00  260.36    0.00  260.36   0.97 100.00
vda               0.00  5817.00    0.00  836.00     0.00   418.00  1024.00   127.91  277.05    0.00  277.05   1.20 100.00
vda               0.00  6230.00    0.00  884.00     0.00   442.00  1024.00   128.00  293.28    0.00  293.28   1.13 100.00
vda               0.00  6482.00    0.00  931.00     0.00   465.50  1024.00   128.00  286.63    0.00  286.63   1.07 100.00
vda               0.00  5831.00    0.00  828.00     0.00   414.00  1024.00   128.00  294.78    0.00  294.78   1.21 100.00
vda               0.00  6307.00    0.00  903.00     0.00   451.50  1024.00   128.00  286.95    0.00  286.95   1.11 100.00

grep "vda" /home/iostat_raw.log | tail -n 10
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00  6811.00    0.00  969.00     0.00   484.50  1024.00   128.00  257.01    0.00  257.01   1.03 100.00
vda               0.00 11256.00    0.00 1608.00     0.00   804.00  1024.00   127.94  165.07    0.00  165.07   0.62 100.00
vda               0.00 15603.00    0.00 2229.00     0.00  1114.50  1024.00   128.00  112.90    0.00  112.90   0.45 100.00
vda               0.00 15547.00    0.00 2227.00     0.00  1113.50  1024.00   128.00  109.51    0.00  109.51   0.45 100.00
vda               0.00  9303.00    0.00 1382.00     0.00   677.94  1004.64   128.00  176.82    0.00  176.82   0.72 100.00
vda               0.00  6970.00    0.00 1023.00     0.00   489.88   980.71   128.00  248.55    0.00  248.55   0.98 100.00
vda               0.00  4069.00   15.00  833.00     1.25   416.19  1008.15    98.90  254.94    1.33  259.51   0.98  83.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

luks:
grep "vda" /home/iostat_luks.log | head -n 10
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.25    0.00     0.02     0.00   165.05     0.00    2.11    2.11    0.00   2.11   0.05
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00  2149.00    0.00   51.00     0.00    25.50  1024.00   105.16  713.14    0.00  713.14  16.27  83.00
vda               0.00   448.00    0.00   64.00     0.00    32.00  1024.00   128.00 1381.25    0.00 1381.25  15.62 100.00
vda               0.00   672.00    0.00   96.00     0.00    48.00  1024.00   128.00 2455.52    0.00 2455.52  10.42 100.00
vda               0.00   672.00    0.00   96.00     0.00    48.00  1024.00   128.00 2943.33    0.00 2943.33  10.42 100.00
vda               0.00   448.00    0.00   64.00     0.00    32.00  1024.00   128.00 2303.75    0.00 2303.75  15.62 100.00
vda               0.00   224.00    0.00   32.00     0.00    16.00  1024.00   128.00 4733.44    0.00 4733.44  31.25 100.00

grep "vda" /home/iostat_luks.log | tail -n 10
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00   430.00    0.00   64.00     0.00    29.81   954.00   128.00 3182.97    0.00 3182.97  15.62 100.00
vda               0.00   434.00    0.00   64.00     0.00    30.31   970.00   128.00 3297.03    0.00 3297.03  15.62 100.00
vda               0.00   627.00    0.00   97.00     0.00    47.19   996.29   128.00 3295.46    0.00 3295.46  10.31 100.00
vda               0.00   646.00    0.00   95.00     0.00    46.44  1001.09   128.00 2747.58    0.00 2747.58  10.53 100.00
vda               0.00   240.00    0.00   66.00     0.00    31.69   983.27   128.00 1908.48    0.00 1908.48  15.15 100.00
vda               0.00     0.00    0.00   97.00     0.00    45.12   952.74   128.00 3405.77    0.00 3405.77  10.31 100.00
vda               0.00     0.00   15.00  130.00     1.25    63.31   911.89    49.01 2962.34    3.33 3303.77   5.10  74.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00


Tinging,
Please try to reproduce it on x86_64.

Comment 26 Yihuang Yu 2019-02-25 08:38:16 UTC
Well, I can reproduce it from my x86 laptop.

I did some investigation, the performance of the luks format is also directly related to cpu, it requires cpu to encrypt and decrypt aes-ni. And it may not be related to the backend used.

Some info of my laptop:
---
cpu:
Model name:            Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz

kernel:
3.10.0-862.el7.x86_64

qemu:
qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64

loaded aesni module:
lsmod |grep aesni_intel
aesni_intel           189414  2 
lrw                    13286  1 aesni_intel
glue_helper            13990  1 aesni_intel
ablk_helper            13597  1 aesni_intel
cryptd                 20511  3 ghash_clmulni_intel,aesni_intel,ablk_helper
---

raw:
qemu-img create -f raw data.raw 10G

-drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=raw,file=data.raw \
-device virtio-blk-pci,id=virtio_blk_pci1,drive=drive_image2,bus=pci.0,addr=0x6 \

result:
dd if=/dev/zero of=/dev/vda bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 1.90697 s, 1.1 GB/s

qemu-io -c 'write 0 1G' --image-opts driver=raw,file.filename=data.raw
tcmalloc: large alloc 1073741824 bytes == 0x557d986dc000 @  0x7fe220b6935f 0x7fe220b89e90 0x557d96c62bb6 0x557d96c62bf9 0x557d96baa9c5 0x557d96bacf30 0x557d96badd32 0x557d96b9de7a 0x7fe21e8843d5 0x557d96b9e8ec
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:01.02 (998.145 MiB/sec and 0.9748 ops/sec)

luks:
qemu-img create -f luks --object secret,id=secret0,data="redhat" -o key-secret=secret0 data.luks 10G

-object secret,id=secret0,data="redhat" \
-drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=luks,file=data.luks,key-secret=secret0 \
-device virtio-blk-pci,id=virtio_blk_pci1,drive=drive_image2,bus=pci.0,addr=0x6 \

result:
dd if=/dev/zero of=/dev/vda bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 21.8749 s, 98.2 MB/s

qemu-io  --object secret,id=secret0,data="redhat" -c 'write 0 1G' --image-opts driver=luks,file.filename=data.luks,key-secret=secret0
tcmalloc: large alloc 1073741824 bytes == 0x55cf94268000 @  0x7f6ab2fa435f 0x7f6ab2fc4e90 0x55cf92639bb6 0x55cf92639bf9 0x55cf925819c5 0x55cf92583f30 0x55cf92584d32 0x55cf92574e7a 0x7f6ab0cbf3d5 0x55cf925758ec
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:11.34 (90.292 MiB/sec and 0.0882 ops/sec)

Hi Ganesh,

In order to know more effective information, could your please help get some info from the customer's environment?

* The host CPU information (full info via "lscpu")
* Check aes module is loaded (lsmod | grep aes)

If possible, please help to try to reproduce it on LocalFS of that host using the above steps.

Comment 27 Daniel Berrangé 2019-02-25 10:31:28 UTC
(In reply to Yihuang Yu from comment #26)
> * The host CPU information (full info via "lscpu")
> * Check aes module is loaded (lsmod | grep aes)

The 'aes' module is only relevant to the in-kernel LUKS impl.

The QEMU userspace LUKS impl uses the userspace crypto libraries. These directly call the relevant AES x86 instructions, not the kernel's AES impl.

Comment 41 Daniel Berrangé 2019-04-18 16:23:28 UTC
There is one set of upstream patches that significantly improve the performance when using AES in XTS mode, which is the default. These patches approximately double the performance for encryption/decryption:

  https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg05389.html

These are quite straightforward to backport to QEMU in RHEL-7.

There will still be a delta vs the in-kernel performance, even with these patches, but it will be reduced.

Comment 43 Miroslav Rezanina 2019-05-13 16:00:01 UTC
Fix included in qemu-kvm-rhev-2.12.0-28.el7

Comment 45 Tingting Mao 2019-05-21 11:19:59 UTC
Tried to verify this bug as below, the performance from 'dd' is improved from 97.9 MB/s -> 115 MB/s.


Tested with:
# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
Stepping:              4
CPU MHz:               3400.000
BogoMIPS:              6800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              19712K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

# free -h
              total        used        free      shared  buff/cache   available
Mem:            62G        780M         60G         10M        736M         61G
Swap:           31G          0B         31G



Steps:

In 'qemu-kvm-rhev-2.12.0-27.el7':

# dd if=/dev/zero of=/dev/vdb bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 21.9464 s, 97.9 MB/s


In 'qemu-kvm-rhev-2.12.0-29.el7':

# dd if=/dev/zero of=/dev/vdb bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 18.7088 s, 115 MB/s


Additional info:

Creation of the testing image
# qemu-img create -f luks --object secret,id=secret0,data="redhat" -o key-secret=secret0 data.luks 10G
Formatting 'data.luks', fmt=luks size=10737418240 key-secret=secret0

# df -Th data.luks 
Filesystem                               Type  Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dell--per740xd--01-home xfs   290G   57M  290G   1% /home

Boot scripts:
# /usr/libexec/qemu-kvm \
        -name 'gues' \
        -machine pc \
        -nodefaults \
        -vga qxl \
        -object secret,id=sec0,data=redhat \
        -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=tgt.qcow2 \
        -device virtio-blk-pci,id=virtio_blk_pci0,drive=drive_image1,bus=pci.0,addr=05,bootindex=0 \
        -drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=luks,file=data.luks,key-secret=sec0 \
        -device virtio-blk-pci,id=virtio_blk_pci1,drive=drive_image2,bus=pci.0,addr=06 \
        -vnc :0 \
        -monitor stdio \
        -m 8192 \
        -smp 8 \
        -device virtio-net-pci,mac=9a:b5:b6:b1:b2:b3,id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pci.0,addr=0x9  \
        -netdev tap,id=idxgXAlm \
        -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/timao/monitor-qmpmonitor1-20180220-094308-h9I6hRsI,server,nowait \
        -mon chardev=qmp_id_qmpmonitor1,mode=control  \

Comment 59 errata-xmlrpc 2019-08-22 09:19:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2553


Note You need to log in before you can comment on or make changes to this bug.