1600156 – dm-thin pool disables discard passdown if on top of VDO volume due to 4K max_discard_sectors

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1600156 - dm-thin pool disables discard passdown if on top of VDO volume due to 4K max_discard_sectors

Summary: dm-thin pool disables discard passdown if on top of VDO volume due to 4K max_...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kmod-kvdo
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	bjohnsto
QA Contact:	Jakub Krysl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1612349 1657156
TreeView+	depends on / blocked

Reported:	2018-07-11 14:31 UTC by Bryan Gurney
Modified:	2021-09-03 11:51 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1612349 (view as bug list)
Environment:
Last Closed:	2019-01-24 15:23:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Bryan Gurney 2018-07-11 14:31:42 UTC

Description of problem:
If a thin-pool is created on top of a VDO volume, discard passdown will be disabled, because dm-thin is expecting to be able to perform a discard of at least the block size of the thin volume (minimum 64 KiB; on my test it seems to default to 128 KiB).  In contrast, VDO's max_discard_sectors defaults to 8 (4 KiB).

(I've seen evidence of prior versions of dm-thin being able to pass down discards to VDO with the default max_discard_sectors setting of 4 kilobytes.)

Version-Release number of selected component (if applicable):
kernel-3.10.0-862.6.3.el7.x86_64
lvm2-2.02.177-4.el7.x86_64
vdo-6.1.0.168-18.x86_64
kmod-kvdo-6.1.0.171-17.el7_5.x86_64


How reproducible:
100% so far

Steps to Reproduce:
1. Create a VDO volume with the default settings, and an optional logical size (I chose 300 GB, on top of a 120 GB solid state drive):
# vdo create --name=vdo1 --device=/dev/sdb --vdoLogicalSize=300G

2. Create a thin-pool on top of the VDO volume:
# pvcreate /dev/mapper/vdo1
# vgcreate TEST /dev/mapper/vdo1
# lvcreate -L 200G -T TEST/dmthin1

Actual results:
When the thin-pool is activated, it will print a kernel log message saying that discard passdown was disabled because the max discard sectors of the VDO volume is smaller than "a block":

(stdout from lvcreate command)
  Thin pool volume with chunk size 128.00 KiB can address at most 31.62 TiB of data.
  Logical volume "dmthin1" created.

(/var/log/messages)
Jul 11 09:40:39 localhost kernel: device-mapper: thin: Data device (dm-5) max discard sectors smaller than a block: Disabling discard passdown.
Jul 11 09:40:39 localhost lvm[11450]: Monitoring thin pool TEST-dmthin1.

(dmsetup table/status output)
# dmsetup status --target=thin-pool
TEST-dmthin1: 0 419430400 thin-pool 0 110/25600 0/1638400 - rw no_discard_passdown queue_if_no_space - 
# dmsetup table --target=thin-pool
TEST-dmthin1: 0 419430400 thin-pool 253:4 253:5 256 0 0

("256" indicates a dm-thin block size of 128 KiB)

Expected results:
When the thin-pool is activated, there are no error messages regarding inability to passdown discards, and the "dmsetup status" output for the thin-pool displays "discard_passdown":

(stdout from lvcreate command)
  Thin pool volume with chunk size 128.00 KiB can address at most 31.62 TiB of data.
  Logical volume "dmthin1" created.

(no error messages in /var/log/messages from "kernel: device-mapper: thin:")

(dmsetup table/status output)
# dmsetup status --target=thin-pool
TEST-dmthin1: 0 419430400 thin-pool 0 110/25600 0/1638400 - rw discard_passdown queue_if_no_space -
# dmsetup table --target=thin-pool
TEST-dmthin1: 0 419430400 thin-pool 253:4 253:5 256 0 0

Additional info:

Comment 2 Mike Snitzer 2018-07-11 16:03:18 UTC

VDO's dm target interface sets ti->split_discard_bios so we need to understand why VDO isn't able set a larger max_discard_sectors.

Michael Sclafani said, in email to vdo-devel and myself:
"From the VDO code it appears untenable to increase maxDiscardSector
without major performance impact - to the extent of I/O stalls."

I need to understand _why_ VDO experiences IO stalls given it is DM core that is splitting the discards that would be issued to VDO as 4K discards.

VDO should be able to advertise a max_discard_sectors equivalent to 1GB and not suffer IO stalls.

This is really a VDO bug.. not an lvm2 (or dm-thinp) bug.  Changing component.

Comment 3 sclafani 2018-07-11 17:42:27 UTC

(In reply to Mike Snitzer from comment #2)
> VDO's dm target interface sets ti->split_discard_bios so we need to
> understand why VDO isn't able set a larger max_discard_sectors.
> 
> Michael Sclafani said, in email to vdo-devel and myself:
> "From the VDO code it appears untenable to increase maxDiscardSector
> without major performance impact - to the extent of I/O stalls."
> 
> I need to understand _why_ VDO experiences IO stalls given it is DM core
> that is splitting the discards that would be issued to VDO as 4K discards.
> 
> VDO should be able to advertise a max_discard_sectors equivalent to 1GB and
> not suffer IO stalls.
> 
> This is really a VDO bug.. not an lvm2 (or dm-thinp) bug.  Changing
> component.


That quote is from James Hogarth, not me.

Comment 4 James Hogarth 2018-07-11 18:39:37 UTC

And for absolute clarity that was coming from comments in the code here:

https://github.com/dm-vdo/kvdo/blob/0a646e36bc6c120096ba9e87927dcfca0fd5ca60/vdo/kernel/dmvdo.c#L50

"* The value 1024 is the largest usable value on HD systems.  A 2048 sector
 * discard on a busy HD system takes 31 seconds.  We should use a value no
 * higher than 1024, which takes 15 to 16 seconds on a busy HD system.
 *
 * But using large values results in 120 second blocked task warnings in
 * /var/log/kern.log.  In order to avoid these warnings, we choose to use the
 * smallest reasonable value.  See VDO-3062 and VDO-3087."

Comment 5 Thomas Jaskiewicz 2018-07-11 19:46:32 UTC

The comment about the 1024 value comes from when we were testing on Linux 3.2 and on HD machine, when blkdev_issue_discard would break large discards into 1024 sector pieces.  If you used the value 2048 on that system, the bio would take a full minute to process.  And at the time we were trying to layer ISCSI on top of VDO.  The default ISCSI timeout on the client was 1 minute, so the value 2048 would lead to ISCSI discard requests timing out.

At some point we noticed in our filesystem testing that there were performance problems with filesystem mounted with -O discard (this was on SSD systems).  We could easily get 120 second blocked task warnings on such system.  The warnings came from tasks waiting for the journal thread.  The journal thread seemed to have done the flush and write of the journal commit record, but was then in a loop calling blkdev_issue_discard to trim all the freed extents.  This loop was getting 120 second blockages.

Our stats showed that we were getting a lot of 64K discards.  It turns out that we got the blockages to go away by setting the maximum discard size to 8 sectors (4K bytes), because we were processing the 16 4K blocks in parallel.

And it turns out that the performance difference on SSD systems between a setting of 8 and a setting of 1024 is very small when the user wants to to a large discard (large meaning terabytes).  And on SSD systems doing one 64K discard is much slower than doing sixteen 4K discards.

Comment 7 bjohnsto 2018-08-07 18:14:34 UTC

VDO max_discard_sectors is currently default to 4k for a variety of reasons descibed above. So when thin is created on top of VDO, it doesn't pass down any discards. 

We did have a workaround for this in older kernels (pre 4.3). We had a sysfs field /sys/kvdo/max_discard_sectors that would allow for setting of max_discard_sectors for all VDO volumes. This would allow the user to create/start VDO volumes that would work with thin on top.

We didn't that sysfs field in kernels 4.3 and greater so there was no way to get discards passed down from thin to VDO. Our change has put that field back for all kernels. 

We are still contemplating what our default max_discard_sectors might be but the workaround should be sufficient for now.

Comment 10 Mike Snitzer 2019-01-10 20:54:15 UTC

*** Bug 1665242 has been marked as a duplicate of this bug. ***

Comment 11 Andy Walsh 2019-01-17 15:14:01 UTC

I believe this bug can be closed, as there is a workaround available to change the max_discard_sectors for VDO.

https://access.redhat.com/solutions/3562021

In the RHEL-8.0 version of VDO (6.2.0.x), a user will be able to specify discard granularity per-volume.

Comment 16 Andy Walsh 2019-01-24 15:23:52 UTC

If we want to consider changing the default max_discard_sectors value, we will need to allocate some time to do sufficient performance testing to confirm whether the value we're using is still the best option for performance.  Workarounds are available, and as previously mentioned, RHEL-8.0 introduced the ability to specify per-volume max_discard_sectors values, which would directly address this issue.


Closing this as NOTABUG, since there is a workaround available for the default settings.

Comment 17 Yaniv Kaul 2021-04-27 07:07:00 UTC

It's clearly a bug (as the user experience is not amazing enough to have an Insights rule to alert on this and provide a workaround), it should have been closed as WONTFIX.

Note You need to log in before you can comment on or make changes to this bug.

agk
awalsh
bjohnsto
bshetty
guillaume.pavese
heinzm
james.hogarth
jbrassow
jkrysl
jpittman
msnitzer
nkshirsa
pmarciniak
prajnoha
rcyriac
robert
ryan.p.norwood
sabose
thornber
zkabelac