Description of problem: I'm not exactly sure where is the problem - it could be wrong filesystem alignment, incorrect usage or content of experted block device capabilies... So let's describe symptoms: When thin volume i.e. 10MB is created from 20MB thin pool and on such volume ext4 filesystem is made - then after fully using all available space in the filesystem, then removal of all files and fstrim call, not much space is actually released from the thinpool - pool seems to be still occupied in the pretty much same way. So here are commands to reproduce: - create 20M thin pool + 10M thin volume lvcreate -L20 -V10 -T vg/pool -n lv1 - create as much 'initialized' fs before 1st. use: mkfs.ext4 -E nodiscard,lazy_itable_init=0 /dev/vg/lv1 - fill fs with 'dd' (if=/dev/zero of=/mountpoint/file) - remove file - fstrim /mountpoint - check usage via i.e. 'lvs -a' command Version-Release number of selected component (if applicable): lvm2 2.02.106 kernel 3.15 e2fsprogs 1.42.10 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 905457 [details] block trace fstrim Here is 'fstrim' - exposes lots of 'small' 128 sector size trims
Created attachment 905460 [details] Thin pool metadata after mkfs Here are thin pool's metadata after mkfs call.
Created attachment 905464 [details] Metadata after fstrim Here are metadata after fstrim call on empty filesystem.
Created attachment 905467 [details] Block trace on linear device For comparation same fstrim on 'common' linear volume (so instead of thin volume its linear volume of same size and same command sequence) Just 2 discard commands in the trace.
FS on thin volume tune2fs 1.42.10 (18-May-2014) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 033b011f-deaa-4c9c-91cf-7a9c48b665c1 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 2560 Block count: 10240 Reserved block count: 512 Free blocks: 8795 Free inodes: 2549 First block: 1 Block size: 1024 Fragment size: 1024 Reserved GDT blocks: 39 Blocks per group: 8192 Fragments per group: 8192 Inodes per group: 1280 Inode blocks per group: 160 RAID stripe width: 64 Flex block group size: 16 Filesystem created: Mon Jun 9 20:02:26 2014 Last mount time: n/a Last write time: Mon Jun 9 20:02:26 2014 Mount count: 0 Maximum mount count: -1 Last checked: Mon Jun 9 20:02:26 2014 Check interval: 0 (<none>) Lifetime writes: 1470 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 57a74517-9b2e-44b1-b8a3-fe609ee6437c Journal backup: inode blocks queue/discard_granularity:4096 queue/discard_max_bytes:65536 queue/discard_zeroes_data:0 discard_alignment:0 FS on linear tune2fs 1.42.10 (18-May-2014) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 2783bffd-13fd-4544-a9a6-539ff828622f Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 2560 Block count: 10240 Reserved block count: 512 Free blocks: 8795 Free inodes: 2549 First block: 1 Block size: 1024 Fragment size: 1024 Reserved GDT blocks: 39 Blocks per group: 8192 Fragments per group: 8192 Inodes per group: 1280 Inode blocks per group: 160 Flex block group size: 16 Filesystem created: Mon Jun 9 20:04:03 2014 Last mount time: n/a Last write time: Mon Jun 9 20:04:03 2014 Mount count: 0 Maximum mount count: -1 Last checked: Mon Jun 9 20:04:03 2014 Check interval: 0 (<none>) Lifetime writes: 1470 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: b2451bde-1f33-4f6b-9b64-b9916da8a0a3 Journal backup: inode blocks queue/discard_granularity:4096 queue/discard_max_bytes:4294966784 queue/discard_zeroes_data:1 discard_alignment:0
Here is dumpe2fs to see free blocks: Skupina 0: (Bloky 0x0001-0x2000) [ITABLE_ZEROED] Kontrolní součet 0x3089, nepoužitých iuzlů 1268 Primární superblok v 0x0001, Deskriptory skupin v 0x0002-0x0002 Rezervované GDT bloky na 0x0003-0x0029 Bitmapa bloků v 0x002a (+41), Bitmapa iuzlů v 0x003a (+57) Tabulka iuzlů v 0x004a-0x00e9 (+73) 6789 volných bloků, 1269 volných iuzlů, 2 adresářů, 1268 nepoužitých iuzlů Volné bloky: 0x057c-0x2000 Volné iuzly: 0x000c-0x0500 Skupina 1: (Bloky 0x2001-0x27ff) [INODE_UNINIT, ITABLE_ZEROED] Kontrolní součet 0x159c, nepoužitých iuzlů 1280 Záložní superblok v 0x2001, Deskriptory skupin v 0x2002-0x2002 Rezervované GDT bloky na 0x2003-0x2029 Bitmapa bloků v 0x002b (bg #0 + 42), Bitmapa iuzlů v 0x003b (bg #0 + 58) Tabulka iuzlů v 0x00ea-0x0189 (bg #0 + 233) 2006 volných bloků, 1280 volných iuzlů, 0 adresářů, 1280 nepoužitých iuzlů Volné bloky: 0x202a-0x27ff Volné iuzly: 0x0501-0x0a00
It looks like fstrim is likely able to use much bigger chunks for TRIM however it uses discard_max_bytes - which is 64KB for thin volume - and if this is not used on exact provisioned block boundaries - thin pool target is not trimming a single byte. So this puts in couple question - is the 'ext4' filesystem properly aligned on thinly provisioned volumes ? Is fstrim using bad strategy when using 'max' size for TRIM or thin pool is too limiting ? How is propagated alignment of discard to be actually discarding something ?
Another thing I've noticed is - when stripe size is bigger then think-pool chunk size (discard_max_byte < ext4 stride) then discard is mostly never send. (adding mkfs option stride=128)
There is no special intelligence in the fstrim tool. It's just calling to the file system via FITRIM ioctl. The file system implementation does not need to care about alignment and discard_max_bytes. This should be taken care by blkdev_issue_discard() which is doing the alignment and dividing the bigger requests into smaller one that the device can handle. However I can see this in your report. discard_alignment:0 Which means that there is no information about alignment, hence the block layer does not align the discard requests. So the problem is probably in the thinp itself - it should export alignment information if that's what it needs. Thanks! -Lukas
Looking at set_discard_limits() in dm-thin.c we're not setting the discard_alignment. Moreover we're setting limits->max_discard_sectors = pool->sectors_per_block; which I do not think is necessary. It will result in a flood of small requests which will hurt the performance unnecessarily. -Lukas
There are some patches working their way through that allow arbitrary size discards with thin and cache. But I don't expect them to go upstream for a few months. If discard_alignment needs to be set then I think the dm core should set it based on the split io size.
(In reply to Lukáš Czerner from comment #9) > There is no special intelligence in the fstrim tool. It's just calling to > the file system via FITRIM ioctl. The file system implementation does not > need to care about alignment and discard_max_bytes. This should be taken > care by blkdev_issue_discard() which is doing the alignment and dividing the > bigger requests into smaller one that the device can handle. > > However I can see this in your report. > > discard_alignment:0 > > Which means that there is no information about alignment, hence the block > layer does not align the discard requests. So the problem is probably in the > thinp itself - it should export alignment information if that's what it > needs. DM device stack limits from underlying devices. discard_alignment of 0 is very much normal. No idea why you're expecting discard_alignment to not be 0. Please read up on discard_alignment: What: /sys/block/<disk>/discard_alignment Date: May 2011 Contact: Martin K. Petersen <martin.petersen> Description: Devices that support discard functionality may internally allocate space in units that are bigger than the exported logical block size. The discard_alignment parameter indicates how many bytes the beginning of the device is offset from the internal allocation unit's natural alignment. (In reply to Lukáš Czerner from comment #10) > Looking at set_discard_limits() in dm-thin.c we're not setting the > discard_alignment. Moreover we're setting > > limits->max_discard_sectors = pool->sectors_per_block; > > which I do not think is necessary. It will result in a flood of small > requests which will hurt the performance unnecessarily. Joe already spoke to this in comment#11, but yes it is necessary. Current thinp cannot handle discards that span multiple thinp blocks. This only affects discard performancre (we're well aware). It doesn't impact discard functionality. Discard functionality is impacted if the filesystem issues a discard starting at a logical address that isn't aligned on a thinp blocksize boundary. The block layer doesn't take care of re-alignment of discards if the start of the requested discard is not aligned at a multiple of discard_max_bytes (AFAICT - looking at blkdev_issue_discard). So all said, it looks like nr_sects is assumed to be properly aligned. If it isn't nothing in blkdev_issue_discard corrects it. So the entire extent that is to be discarded can easily be split into discard_max_bytes misaligned discard requests. But really, that _should_ only result in the first, and possibly last, thinp block not being properly discarded.
(In reply to Mike Snitzer from comment #12) > (In reply to Lukáš Czerner from comment #9) > > There is no special intelligence in the fstrim tool. It's just calling to > > the file system via FITRIM ioctl. The file system implementation does not > > need to care about alignment and discard_max_bytes. This should be taken > > care by blkdev_issue_discard() which is doing the alignment and dividing the > > bigger requests into smaller one that the device can handle. > > > > However I can see this in your report. > > > > discard_alignment:0 > > > > Which means that there is no information about alignment, hence the block > > layer does not align the discard requests. So the problem is probably in the > > thinp itself - it should export alignment information if that's what it > > needs. > > DM device stack limits from underlying devices. discard_alignment of 0 is > very much normal. No idea why you're expecting discard_alignment to not be > 0. Please read up on discard_alignment: > > What: /sys/block/<disk>/discard_alignment > Date: May 2011 > Contact: Martin K. Petersen <martin.petersen> > Description: > Devices that support discard functionality may > internally allocate space in units that are bigger than > the exported logical block size. The discard_alignment > parameter indicates how many bytes the beginning of the > device is offset from the internal allocation unit's > natural alignment. Fair enough. I assumed that it means something else. In that case, this should be ok. > > (In reply to Lukáš Czerner from comment #10) > > Looking at set_discard_limits() in dm-thin.c we're not setting the > > discard_alignment. Moreover we're setting > > > > limits->max_discard_sectors = pool->sectors_per_block; > > > > which I do not think is necessary. It will result in a flood of small > > requests which will hurt the performance unnecessarily. > > Joe already spoke to this in comment#11, but yes it is necessary. Current > thinp cannot handle discards that span multiple thinp blocks. This only > affects discard performancre (we're well aware). It doesn't impact discard > functionality. > > Discard functionality is impacted if the filesystem issues a discard > starting at a logical address that isn't aligned on a thinp blocksize > boundary. The block layer doesn't take care of re-alignment of discards if > the start of the requested discard is not aligned at a multiple of > discard_max_bytes (AFAICT - looking at blkdev_issue_discard). So all said, > it looks like nr_sects is assumed to be properly aligned. If it isn't > nothing in blkdev_issue_discard corrects it. So the entire extent that is > to be discarded can easily be split into discard_max_bytes misaligned > discard requests. But really, that _should_ only result in the first, and > possibly last, thinp block not being properly discarded. Well, if we would allow discard requests bigger than chunk size, then this would be possibly true. However as it stands now blkdev_issue_discard() will divide the request to a smaller (max_discard_sectors) requests and if the first one is not chunk size aligned, then none of it is aligned, or do you take care of merging and aligning the discard requests internally ? So I think that getting rid of this limitation: limits->max_discard_sectors = pool->sectors_per_block; will help in this case. As you mentioned first and last chunk (thinp block) might not get properly discarded, but there is not much we can do about this anyway. -Lukas
A new piece of insight on this issue (relative to potential for ext4/XFS to issue discards that aren't aligned on a max_discard_bytes boundary): dm-thinp does _not_ account for partial block discards. dm-thinp will pass down these partial block discards to the underlying data device, but again it will _not_ account for them. Assuming my theory on the FS sending slightly misaligned discards is true: dm-thinp's inability to account got partial block discards can easily explain why no space is being reclaimed in dm-thinp itself.
discard requests are not supposed to be aligned to max_discard_bytes boundary at all. The inability for thinp to discard partial "chunks" is ok, even though it might become a problem when chunk size grows to a bigger sizes (but that's expected). The real issue is that thinp does not seem to be exporting discard_granularity properly (judging from comment 5 where it is set to 4096 but chunk size should be 64k). If both is fixed then the requests will be divided into smaller requests and those would be properly aligned to the chunk size (discard_granularity) with the exception of firs and the last one which might be still misaligned. So in Comment 9 I confused discard_alignment with discard_granularity. It will still be useful accept bigger discard ranges, but this particular issue should be fixed by properly exporting discard_granularity which in thinp case should really be chunk size. Looking at set_discard_limits() the problem is here: if (pt->adjusted_pf.discard_passdown) { data_limits = &bdev_get_queue(pt->data_dev->bdev)->limits; limits->discard_granularity = data_limits->discard_granularity; we're blindly exporting the discard_granularity of the underlying device. However in ideal case when the bigger one is multiple of the smaller one we should set it to the bigger. If it's not a multiple then I think we should probably disable the discard_passdown because being able to reclaim thinp block seems more important to me ... that that's up for a discussion. -Lukas
Zdenek, can you confirm that creating thinp on something which is not discard capable, or disabling discard_passdown helps ? And that the discard_granularity is properly set in this case ? Thanks! -Lukas
Created attachment 907614 [details] Proposed patch Turns out that thinp is already checking whether discard_granularity of the device is a factor of chunk size. So the only thing that remains is to pick the bigger of the two. Strictly speaking currently there is no real reason to pick the bigger one, because chunk size will be always >= device discard_granularity because otherwise passdown would be disabled anyway. However I believe that this limitation is there only because thinp does not allow requests larger than cunk size, which is hopefully doing to change in the future so this makes it future proof. -Lukas
I can confirm that patch proposed in comment 17 makes fstrim usable on thin volumes, and spaces is properly released from pool in my testing setup.
Pushed Lukas' fix to linux-next (for 3.16 inclusion), see: https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=4adc9ea8915ff9904c3279b1b36478482d1c0825
Patch applied in Fedora git after talking with Mike. Thanks for the pointer!
Once we have arbitrary sized discards supported (as referred to in comment #11), Lukas' patch becomes needless.
(In reply to Heinz Mauelshagen from comment #21) > Once we have arbitrary sized discards supported (as referred to in comment > #11), Lukas' patch becomes needless. Does it mean that dm-thin is going to be able to handle sub chunk size discard requests ? -Lukas
kernel-3.14.7-200.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/kernel-3.14.7-200.fc20
kernel-3.14.7-100.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/kernel-3.14.7-100.fc19
(In reply to Lukáš Czerner from comment #22) > (In reply to Heinz Mauelshagen from comment #21) > > Once we have arbitrary sized discards supported (as referred to in comment > > #11), Lukas' patch becomes needless. > > Does it mean that dm-thin is going to be able to handle sub chunk size > discard requests ? > > -Lukas Yes, it's going to be able to pass them down to the pool device. Obviously it won't be able to deprovision partial blocks, because those are our atomic allocation unit.
(In reply to Heinz Mauelshagen from comment #25) > (In reply to Lukáš Czerner from comment #22) > > (In reply to Heinz Mauelshagen from comment #21) > > > Once we have arbitrary sized discards supported (as referred to in comment > > > #11), Lukas' patch becomes needless. > > > > Does it mean that dm-thin is going to be able to handle sub chunk size > > discard requests ? > > > > -Lukas > > Yes, it's going to be able to pass them down to the pool device. > > Obviously it won't be able to deprovision partial blocks, because > those are our atomic allocation unit. In that case my patch is still going to be needed because you still need to export proper discard_granularity for blkdev_issue_discard() to properly align the requests. -Lukas
(In reply to Lukáš Czerner from comment #26) > (In reply to Heinz Mauelshagen from comment #25) > > Yes, it's going to be able to pass them down to the pool device. We already pass partial block discards down, we just don't record them in dm-thin-pool. And Heinz's improvements won't change that. > > Obviously it won't be able to deprovision partial blocks, because > > those are our atomic allocation unit. > > In that case my patch is still going to be needed because you still need to > export proper discard_granularity for blkdev_issue_discard() to properly > align the requests. Exactly.
Package kernel-3.14.7-200.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.14.7-200.fc20' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-7313/kernel-3.14.7-200.fc20 then log in and leave karma (feedback).
kernel-3.14.7-200.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.
kernel-3.14.7-100.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.