Bug 1961299

Summary: degraded thin lv performance while using nvme backing device
Product: Red Hat Enterprise Linux 7 Reporter: Rupesh Girase <rgirase>
Component: lvm2Assignee: Mikuláš Patočka <mpatocka>
lvm2 sub component: Thin Provisioning QA Contact: cluster-qe <cluster-qe>
Status: CLOSED DEFERRED Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agk, heinzm, jbrassow, loberman, msnitzer, prajnoha, thornber, zkabelac
Version: 7.9   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-30 20:04:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rupesh Girase 2021-05-17 16:30:46 UTC
Description of problem:

Thin lvm performance degrades as compared to the thick lv when using nvme device
but we do not see same performance difference between thin and thick lv while using ramdisks multipath devices.


Version-Release number of selected component (if applicable):
rhel7.9

How reproducible:
always

Steps to Reproduce:
1. create thin lv backed by nvme devices
2. run fio tests


Actual results:
performance drops significantly in case of thin lv

Expected results:
we should not see significant performance drops

Additional info:
rhel8 seems good as compared to rhel7.9

Comment 4 Zdenek Kabelac 2021-05-18 08:03:09 UTC
When you use bigger chunk size and there is need for best performance user has to disable zeroing.

lvcreate -Zn -T -Lpoolsize .....

When zeroing is enabled - each provisioned chunk is zeroing for all unwritten sectors.

If you need to keep zeroing - fio measuring should happen on provisioned blocks.

But in all cases the performance will be lower the plain striped target - however with chunksizes >=512K it should be pretty close.

Comment 5 Zdenek Kabelac 2021-05-18 08:17:12 UTC
Note - there is maybe 'a mistake' in your script? as usage of 512M per thin-pool chunk likely consumes a lot of space in thin-pool for written data (this can be useful only in some limited use cases).

8 stripes on nvme which typically goes with 512k optimal iosize should give you  8*0.5M =>   4M chunksize to keep best alignment and well usable discard. Although still such large chunksizes are not very practical with snapshots...

Finding the most optimal striping pattern may require several benchmarking rounds.

Also it also possible to use striped 'metadata' for highest performance - however in this case user has to create  data & metadata LV separately and join them into a thin-pool volume via lvconvert.  If you seek for best possible performance you may need to tune this part as well.

Comment 6 Zdenek Kabelac 2021-05-18 08:21:10 UTC
Forgot to mention:  --stripesize 512k    may help you to tune for the best striping size for each of your nvme device.
(but as said - depends on the use-case and hw what gives the best throughput)

Comment 7 Rupesh Girase 2021-05-21 09:53:53 UTC
Hello Zdenek,

Tried disabled zeroing after creating thinpool, as shown in internal comment #2

Is it same or need to disable zeroing while creating only?


** disabled zeroing

[root@hp-dl380g10-1 ~]# lvchange -Zn nvmevg/glide_thinpool
  Logical volume nvmevg/glide_thinpool changed.

Also we do not see such performance difference while using ramdisk devices. 
Seeing this behavior for thinpool/lv backed by nvme devices only.


Thanks,
Rupesh

Comment 13 Mikuláš Patočka 2021-08-26 13:07:40 UTC
In my opinion, the user has unrealistic performance expectations here.

dm-thin takes locks - it takes pmd->pool_lock for read for each I/O and then it takes the dm-bufio lock for each btree node. If you submit massively parallel I/O to dm-thin, the cache lines containing the locks will be bouncing between the CPU cores and this will cause performance degradation.

The faster the underlying device is - the more performance degradation due to lock bouncing will you see. Ramdisk has slow performance (the benchmarks here show about 360MiB/s), so you won't see much degradation on it. 8-leg NVMe RAID-0 has high performance (3GiB/s), so the performance degradation is high.

If you want to achieven 3GiB/s, you must make sure that the I/O path is lockless.

It is not easy to avoid the locks and fix this. It could be fixed by leveraging the Intel transaction instructions (TSX), but that would be a lot of work and it could certainly not be done in RHEL7.

Comment 14 Jonathan Earl Brassow 2021-08-30 20:04:36 UTC
(In reply to Mikuláš Patočka from comment #13)
> In my opinion, the user has unrealistic performance expectations here.
> 
> dm-thin takes locks - it takes pmd->pool_lock for read for each I/O and then
> it takes the dm-bufio lock for each btree node. If you submit massively
> parallel I/O to dm-thin, the cache lines containing the locks will be
> bouncing between the CPU cores and this will cause performance degradation.
> 
> The faster the underlying device is - the more performance degradation due
> to lock bouncing will you see. Ramdisk has slow performance (the benchmarks
> here show about 360MiB/s), so you won't see much degradation on it. 8-leg
> NVMe RAID-0 has high performance (3GiB/s), so the performance degradation is
> high.
> 
> If you want to achieven 3GiB/s, you must make sure that the I/O path is
> lockless.
> 
> It is not easy to avoid the locks and fix this. It could be fixed by
> leveraging the Intel transaction instructions (TSX), but that would be a lot
> of work and it could certainly not be done in RHEL7.

I would like to mention that there is ongoing work to improve thin-p performance and scalability, but that too would be too much to backport to RHEL7.  These changes are intended to land sometime in RHEL9.