Bug 1106856 - fstrim is not trimming thin volumes data as expected
Summary: fstrim is not trimming thin volumes data as expected
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Mike Snitzer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-09 17:50 UTC by Zdenek Kabelac
Modified: 2014-06-16 23:29 UTC (History)
19 users (show)

Fixed In Version: kernel-3.14.7-100.fc19
Clone Of:
Environment:
Last Closed: 2014-06-13 22:49:44 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
block trace fstrim (20.39 KB, text/plain)
2014-06-09 17:52 UTC, Zdenek Kabelac
no flags Details
Thin pool metadata after mkfs (417 bytes, text/plain)
2014-06-09 17:56 UTC, Zdenek Kabelac
no flags Details
Metadata after fstrim (692 bytes, text/plain)
2014-06-09 17:58 UTC, Zdenek Kabelac
no flags Details
Block trace on linear device (2.99 KB, text/plain)
2014-06-09 18:01 UTC, Zdenek Kabelac
no flags Details
Proposed patch (632 bytes, patch)
2014-06-11 12:13 UTC, Lukáš Czerner
no flags Details | Diff

Description Zdenek Kabelac 2014-06-09 17:50:07 UTC
Description of problem:

I'm not exactly sure where is the problem - it could be wrong filesystem alignment, incorrect usage or content of experted block device capabilies...

So let's describe symptoms:

When thin volume  i.e.  10MB is created from 20MB thin pool
and on such volume  ext4 filesystem is made - then after fully using
all available space in the filesystem, then removal of all files and fstrim call, not much space is actually released from the thinpool - pool seems to be still occupied in the pretty much same way.

So here are commands to reproduce:
- create 20M thin pool + 10M thin volume

lvcreate -L20 -V10 -T vg/pool -n lv1

- create as much 'initialized' fs before 1st. use:

mkfs.ext4 -E nodiscard,lazy_itable_init=0  /dev/vg/lv1

- fill fs with 'dd' (if=/dev/zero of=/mountpoint/file)

- remove file

- fstrim /mountpoint

- check usage via i.e.  'lvs -a' command


Version-Release number of selected component (if applicable):
lvm2 2.02.106
kernel 3.15
e2fsprogs 1.42.10

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Zdenek Kabelac 2014-06-09 17:52:56 UTC
Created attachment 905457 [details]
block trace fstrim

Here is 'fstrim' - exposes lots of 'small'  128 sector size trims

Comment 2 Zdenek Kabelac 2014-06-09 17:56:26 UTC
Created attachment 905460 [details]
Thin pool metadata after mkfs

Here are thin pool's metadata after mkfs call.

Comment 3 Zdenek Kabelac 2014-06-09 17:58:07 UTC
Created attachment 905464 [details]
Metadata after fstrim

Here are metadata after fstrim call on empty filesystem.

Comment 4 Zdenek Kabelac 2014-06-09 18:01:21 UTC
Created attachment 905467 [details]
Block trace on linear device

For comparation same fstrim on 'common' linear volume (so instead of thin volume its linear volume of same size and same command sequence)

Just 2 discard commands in the trace.

Comment 5 Zdenek Kabelac 2014-06-09 18:08:04 UTC
FS on thin volume

tune2fs 1.42.10 (18-May-2014)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          033b011f-deaa-4c9c-91cf-7a9c48b665c1
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              2560
Block count:              10240
Reserved block count:     512
Free blocks:              8795
Free inodes:              2549
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      39
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         1280
Inode blocks per group:   160
RAID stripe width:        64
Flex block group size:    16
Filesystem created:       Mon Jun  9 20:02:26 2014
Last mount time:          n/a
Last write time:          Mon Jun  9 20:02:26 2014
Mount count:              0
Maximum mount count:      -1
Last checked:             Mon Jun  9 20:02:26 2014
Check interval:           0 (<none>)
Lifetime writes:          1470 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          128
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      57a74517-9b2e-44b1-b8a3-fe609ee6437c
Journal backup:           inode blocks


queue/discard_granularity:4096
queue/discard_max_bytes:65536
queue/discard_zeroes_data:0
discard_alignment:0



FS on linear
tune2fs 1.42.10 (18-May-2014)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          2783bffd-13fd-4544-a9a6-539ff828622f
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              2560
Block count:              10240
Reserved block count:     512
Free blocks:              8795
Free inodes:              2549
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      39
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         1280
Inode blocks per group:   160
Flex block group size:    16
Filesystem created:       Mon Jun  9 20:04:03 2014
Last mount time:          n/a
Last write time:          Mon Jun  9 20:04:03 2014
Mount count:              0
Maximum mount count:      -1
Last checked:             Mon Jun  9 20:04:03 2014
Check interval:           0 (<none>)
Lifetime writes:          1470 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          128
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      b2451bde-1f33-4f6b-9b64-b9916da8a0a3
Journal backup:           inode blocks

queue/discard_granularity:4096
queue/discard_max_bytes:4294966784
queue/discard_zeroes_data:1
discard_alignment:0

Comment 6 Zdenek Kabelac 2014-06-09 18:15:19 UTC
Here is dumpe2fs to see free blocks:

Skupina 0: (Bloky 0x0001-0x2000) [ITABLE_ZEROED]
  Kontrolní součet 0x3089, nepoužitých iuzlů 1268
  Primární superblok v 0x0001, Deskriptory skupin v 0x0002-0x0002
  Rezervované GDT bloky na 0x0003-0x0029
  Bitmapa bloků v 0x002a (+41), Bitmapa iuzlů v 0x003a (+57)
  Tabulka iuzlů v 0x004a-0x00e9 (+73)
  6789 volných bloků, 1269 volných iuzlů, 2 adresářů, 1268 nepoužitých iuzlů
  Volné bloky: 0x057c-0x2000
  Volné iuzly: 0x000c-0x0500
Skupina 1: (Bloky 0x2001-0x27ff) [INODE_UNINIT, ITABLE_ZEROED]
  Kontrolní součet 0x159c, nepoužitých iuzlů 1280
  Záložní superblok v 0x2001, Deskriptory skupin v 0x2002-0x2002
  Rezervované GDT bloky na 0x2003-0x2029
  Bitmapa bloků v 0x002b (bg #0 + 42), Bitmapa iuzlů v 0x003b (bg #0 + 58)
  Tabulka iuzlů v 0x00ea-0x0189 (bg #0 + 233)
  2006 volných bloků, 1280 volných iuzlů, 0 adresářů, 1280 nepoužitých iuzlů
  Volné bloky: 0x202a-0x27ff
  Volné iuzly: 0x0501-0x0a00

Comment 7 Zdenek Kabelac 2014-06-09 18:24:35 UTC
It looks like fstrim is likely able to use much bigger chunks for TRIM however it uses discard_max_bytes - which is 64KB for  thin volume - and if this is not used on exact provisioned block boundaries - thin pool target is not trimming a single byte.

So this puts in couple question - is the 'ext4' filesystem properly aligned on thinly provisioned volumes ?

Is fstrim using bad strategy when using 'max' size for TRIM or thin pool is too limiting ?

How is propagated alignment of discard to be actually discarding something ?

Comment 8 Zdenek Kabelac 2014-06-09 20:49:01 UTC
Another thing I've noticed is - when  stripe size is bigger then think-pool chunk size (discard_max_byte < ext4 stride) then discard is mostly never send.
(adding mkfs option stride=128)

Comment 9 Lukáš Czerner 2014-06-10 09:56:44 UTC
There is no special intelligence in the fstrim tool. It's just calling to the file system via FITRIM ioctl. The file system implementation does not need to care about alignment and discard_max_bytes. This should be taken care by blkdev_issue_discard() which is doing the alignment and dividing the bigger requests into smaller one that the device can handle.

However I can see this in your report. 

discard_alignment:0

Which means that there is no information about alignment, hence the block layer does not align the discard requests. So the problem is probably in the thinp itself - it should export alignment information if that's what it needs.

Thanks!
-Lukas

Comment 10 Lukáš Czerner 2014-06-10 10:11:08 UTC
Looking at set_discard_limits() in dm-thin.c we're not setting the discard_alignment. Moreover we're setting 

	limits->max_discard_sectors = pool->sectors_per_block;

which I do not think is necessary. It will result in a flood of small requests which will hurt the performance unnecessarily.

-Lukas

Comment 11 Joe Thornber 2014-06-10 10:35:46 UTC
There are some patches working their way through that allow arbitrary size discards with thin and cache.  But I don't expect them to go upstream for a few months.

If discard_alignment needs to be set then I think the dm core should set it based on the split io size.

Comment 12 Mike Snitzer 2014-06-10 15:05:32 UTC
(In reply to Lukáš Czerner from comment #9)
> There is no special intelligence in the fstrim tool. It's just calling to
> the file system via FITRIM ioctl. The file system implementation does not
> need to care about alignment and discard_max_bytes. This should be taken
> care by blkdev_issue_discard() which is doing the alignment and dividing the
> bigger requests into smaller one that the device can handle.
> 
> However I can see this in your report. 
> 
> discard_alignment:0
> 
> Which means that there is no information about alignment, hence the block
> layer does not align the discard requests. So the problem is probably in the
> thinp itself - it should export alignment information if that's what it
> needs.

DM device stack limits from underlying devices.  discard_alignment of 0 is very much normal.  No idea why you're expecting discard_alignment to not be 0.  Please read up on discard_alignment:

What:           /sys/block/<disk>/discard_alignment
Date:           May 2011
Contact:        Martin K. Petersen <martin.petersen>
Description:
                Devices that support discard functionality may
                internally allocate space in units that are bigger than
                the exported logical block size. The discard_alignment
                parameter indicates how many bytes the beginning of the
                device is offset from the internal allocation unit's
                natural alignment.

(In reply to Lukáš Czerner from comment #10)
> Looking at set_discard_limits() in dm-thin.c we're not setting the
> discard_alignment. Moreover we're setting 
> 
> 	limits->max_discard_sectors = pool->sectors_per_block;
> 
> which I do not think is necessary. It will result in a flood of small
> requests which will hurt the performance unnecessarily.

Joe already spoke to this in comment#11, but yes it is necessary.  Current thinp cannot handle discards that span multiple thinp blocks.  This only affects discard performancre (we're well aware).  It doesn't impact discard functionality.

Discard functionality is impacted if the filesystem issues a discard starting at a logical address that isn't aligned on a thinp blocksize boundary.  The block layer doesn't take care of re-alignment of discards if the start of the requested discard is not aligned at a multiple of discard_max_bytes (AFAICT - looking at blkdev_issue_discard).  So all said, it looks like nr_sects is assumed to be properly aligned.  If it isn't nothing in blkdev_issue_discard corrects it.  So the entire extent that is to be discarded can easily be split into discard_max_bytes misaligned discard requests.  But really, that _should_ only result in the first, and possibly last, thinp block not being properly discarded.

Comment 13 Lukáš Czerner 2014-06-10 16:01:53 UTC
(In reply to Mike Snitzer from comment #12)
> (In reply to Lukáš Czerner from comment #9)
> > There is no special intelligence in the fstrim tool. It's just calling to
> > the file system via FITRIM ioctl. The file system implementation does not
> > need to care about alignment and discard_max_bytes. This should be taken
> > care by blkdev_issue_discard() which is doing the alignment and dividing the
> > bigger requests into smaller one that the device can handle.
> > 
> > However I can see this in your report. 
> > 
> > discard_alignment:0
> > 
> > Which means that there is no information about alignment, hence the block
> > layer does not align the discard requests. So the problem is probably in the
> > thinp itself - it should export alignment information if that's what it
> > needs.
> 
> DM device stack limits from underlying devices.  discard_alignment of 0 is
> very much normal.  No idea why you're expecting discard_alignment to not be
> 0.  Please read up on discard_alignment:
> 
> What:           /sys/block/<disk>/discard_alignment
> Date:           May 2011
> Contact:        Martin K. Petersen <martin.petersen>
> Description:
>                 Devices that support discard functionality may
>                 internally allocate space in units that are bigger than
>                 the exported logical block size. The discard_alignment
>                 parameter indicates how many bytes the beginning of the
>                 device is offset from the internal allocation unit's
>                 natural alignment.

Fair enough. I assumed that it means something else. In that case, this should be ok.

> 
> (In reply to Lukáš Czerner from comment #10)
> > Looking at set_discard_limits() in dm-thin.c we're not setting the
> > discard_alignment. Moreover we're setting 
> > 
> > 	limits->max_discard_sectors = pool->sectors_per_block;
> > 
> > which I do not think is necessary. It will result in a flood of small
> > requests which will hurt the performance unnecessarily.
> 
> Joe already spoke to this in comment#11, but yes it is necessary.  Current
> thinp cannot handle discards that span multiple thinp blocks.  This only
> affects discard performancre (we're well aware).  It doesn't impact discard
> functionality.
> 
> Discard functionality is impacted if the filesystem issues a discard
> starting at a logical address that isn't aligned on a thinp blocksize
> boundary.  The block layer doesn't take care of re-alignment of discards if
> the start of the requested discard is not aligned at a multiple of
> discard_max_bytes (AFAICT - looking at blkdev_issue_discard).  So all said,
> it looks like nr_sects is assumed to be properly aligned.  If it isn't
> nothing in blkdev_issue_discard corrects it.  So the entire extent that is
> to be discarded can easily be split into discard_max_bytes misaligned
> discard requests.  But really, that _should_ only result in the first, and
> possibly last, thinp block not being properly discarded.

Well, if we would allow discard requests bigger than chunk size, then this would be possibly true. However as it stands now blkdev_issue_discard() will divide the request to a smaller (max_discard_sectors) requests and if the first one is not chunk size aligned, then none of it is aligned, or do you take care of merging and aligning the discard requests internally ? 

So I think that getting rid of this limitation:

limits->max_discard_sectors = pool->sectors_per_block;

will help in this case. As you mentioned first and last chunk (thinp block) might not get properly discarded, but there is not much we can do about this anyway.

-Lukas

Comment 14 Mike Snitzer 2014-06-10 16:39:47 UTC
A new piece of insight on this issue (relative to potential for ext4/XFS to issue discards that aren't aligned on a max_discard_bytes boundary):

dm-thinp does _not_ account for partial block discards.  dm-thinp will pass down these partial block discards to the underlying data device, but again it will _not_ account for them.  Assuming my theory on the FS sending slightly misaligned discards is true: dm-thinp's inability to account got partial block discards can easily explain why no space is being reclaimed in dm-thinp itself.

Comment 15 Lukáš Czerner 2014-06-11 11:36:53 UTC
discard requests are not supposed to be aligned to max_discard_bytes boundary at all. The inability for thinp to discard partial "chunks" is ok, even though it might become a problem when chunk size grows to a bigger sizes (but that's expected).

The real issue is that thinp does not seem to be exporting discard_granularity properly (judging from comment 5 where it is set to 4096 but chunk size should be 64k). If both is fixed then the requests will be divided into smaller requests and those would be properly aligned to the chunk size (discard_granularity) with the exception of firs and the last one which might be still misaligned.

So in Comment 9 I confused discard_alignment with discard_granularity. It will still be useful accept bigger discard ranges, but this particular issue should be fixed by properly exporting discard_granularity which in thinp case should really be chunk size.


Looking at set_discard_limits() the problem is here:

	if (pt->adjusted_pf.discard_passdown) {
		data_limits = &bdev_get_queue(pt->data_dev->bdev)->limits;
		limits->discard_granularity = data_limits->discard_granularity;

we're blindly exporting the discard_granularity of the underlying device. However in ideal case when the bigger one is multiple of the smaller one we should set it to the bigger. If it's not a multiple then I think we should probably disable the discard_passdown because being able to reclaim thinp block seems more important to me ... that that's up for a discussion.

-Lukas

Comment 16 Lukáš Czerner 2014-06-11 11:40:04 UTC
Zdenek, can you confirm that creating thinp on something which is not discard capable, or disabling discard_passdown helps ? And that the discard_granularity is properly set in this case ?

Thanks!
-Lukas

Comment 17 Lukáš Czerner 2014-06-11 12:13:00 UTC
Created attachment 907614 [details]
Proposed patch

Turns out that thinp is already checking whether discard_granularity of the device is a factor of chunk size. So the only thing that remains is to pick the bigger of the two.

Strictly speaking currently there is no real reason to pick the bigger one, because chunk size will be always >= device discard_granularity because otherwise passdown would be disabled anyway. However I believe that this limitation is there only because thinp does not allow requests larger than cunk size, which is hopefully doing to change in the future so this makes it future proof.

-Lukas

Comment 18 Zdenek Kabelac 2014-06-11 12:43:35 UTC
I can confirm that patch proposed in  comment 17 makes  fstrim usable on thin volumes, and spaces is properly released from pool in my testing setup.

Comment 19 Mike Snitzer 2014-06-11 16:33:50 UTC
Pushed Lukas' fix to linux-next (for 3.16 inclusion), see:

https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=4adc9ea8915ff9904c3279b1b36478482d1c0825

Comment 20 Josh Boyer 2014-06-11 20:38:22 UTC
Patch applied in Fedora git after talking with Mike.  Thanks for the pointer!

Comment 21 Heinz Mauelshagen 2014-06-12 10:39:05 UTC
Once we have arbitrary sized discards supported (as referred to in comment #11), Lukas' patch becomes needless.

Comment 22 Lukáš Czerner 2014-06-12 11:27:43 UTC
(In reply to Heinz Mauelshagen from comment #21)
> Once we have arbitrary sized discards supported (as referred to in comment
> #11), Lukas' patch becomes needless.

Does it mean that dm-thin is going to be able to handle sub chunk size discard requests ?

-Lukas

Comment 23 Fedora Update System 2014-06-12 12:13:51 UTC
kernel-3.14.7-200.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.14.7-200.fc20

Comment 24 Fedora Update System 2014-06-12 12:15:47 UTC
kernel-3.14.7-100.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/kernel-3.14.7-100.fc19

Comment 25 Heinz Mauelshagen 2014-06-12 12:49:40 UTC
(In reply to Lukáš Czerner from comment #22)
> (In reply to Heinz Mauelshagen from comment #21)
> > Once we have arbitrary sized discards supported (as referred to in comment
> > #11), Lukas' patch becomes needless.
> 
> Does it mean that dm-thin is going to be able to handle sub chunk size
> discard requests ?
> 
> -Lukas

Yes, it's going to be able to pass them down to the pool device.

Obviously it won't be able to deprovision partial blocks, because
those are our atomic allocation unit.

Comment 26 Lukáš Czerner 2014-06-12 13:53:42 UTC
(In reply to Heinz Mauelshagen from comment #25)
> (In reply to Lukáš Czerner from comment #22)
> > (In reply to Heinz Mauelshagen from comment #21)
> > > Once we have arbitrary sized discards supported (as referred to in comment
> > > #11), Lukas' patch becomes needless.
> > 
> > Does it mean that dm-thin is going to be able to handle sub chunk size
> > discard requests ?
> > 
> > -Lukas
> 
> Yes, it's going to be able to pass them down to the pool device.
> 
> Obviously it won't be able to deprovision partial blocks, because
> those are our atomic allocation unit.

In that case my patch is still going to be needed because you still need to export proper discard_granularity for blkdev_issue_discard() to properly align the requests.

-Lukas

Comment 27 Mike Snitzer 2014-06-12 14:26:51 UTC
(In reply to Lukáš Czerner from comment #26)
> (In reply to Heinz Mauelshagen from comment #25)

> > Yes, it's going to be able to pass them down to the pool device.

We already pass partial block discards down, we just don't record them in dm-thin-pool.  And Heinz's improvements won't change that.

> > Obviously it won't be able to deprovision partial blocks, because
> > those are our atomic allocation unit.
> 
> In that case my patch is still going to be needed because you still need to
> export proper discard_granularity for blkdev_issue_discard() to properly
> align the requests.

Exactly.

Comment 28 Fedora Update System 2014-06-13 05:26:50 UTC
Package kernel-3.14.7-200.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.14.7-200.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-7313/kernel-3.14.7-200.fc20
then log in and leave karma (feedback).

Comment 29 Fedora Update System 2014-06-13 22:49:44 UTC
kernel-3.14.7-200.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 30 Fedora Update System 2014-06-16 23:29:01 UTC
kernel-3.14.7-100.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.