Hide Forgot
Created attachment 1522331 [details] strace output of change of cache to writethrough Description of problem: LVM cache in writeback mode cannot be changed to writethrough, nor it can be flushed to data_corig lv (endless loop). Setting the CachePolicy 'cleaner' doesn't force lvm cache to be flushed. lvremove fails also due to endless flush buffer Version-Release number of selected component (if applicable): libblockdev-lvm-2.18-3.el7.x86_64 lvm2-2.02.180-10.el7_6.2.x86_64 lvm2-libs-2.02.180-10.el7_6.2.x86_64 udisks2-lvm2-2.7.3-8.el7.x86_64 How reproducible: Always. Steps to Reproduce: 1. Follow official documentation for creation of lvm cache (use '--cachemode writeback') 2. Fill in some data - you will notice that cache_dirty_blocks won't get flashed 3. Set CachePolicy to 'cleaner' 4. Try to change the cache mode from 'writeback' to 'writethrough': lvchange --cachemode writethrough VG/LV 5. Try to uncache: lvconvert --uncache VG/LV 6. Try to remove LV: lvremove VG/LV Actual results: cache_dirty_blocks are not flushed to data_corig lv, an endless loop of 'Flushing xxxx blocks for cache VG/LV.' is being reported,but the count doesn't drop nor ever finishes. Cleaner policy fails to flush the cache. Lvremove cannot remove the LV. Expected results: All used blocks to be flushed during cache change/ uncache operations and during lv removal. Additional info: Other OS reports: https://www.reddit.com/r/archlinux/comments/9owc15/lvm_cache_not_flushing_after_unclean_shutdown/ https://marc.info/?l=linux-lvm&m=152948734523317&w=2
Created attachment 1522332 [details] strace of lvconvert uncache
Please provide basic info about cache chunksize (lvs -o+chunksize) The assumption could be - the cache chunksize is >= 1MiB and there is unspecified bigger migration_threshold and remained at defaul value 2048 (1MiB). This prevents kernel from flush blockes. There are several bugs about this. The quick workaround solution is to set higher threshold: lvchange --cachesettings migration_threshold=16384 vg/cacheLV There are already upstream patches where lvm2 guards this settings now and does ensure threshold is at least 8 chunks bigs. https://www.redhat.com/archives/lvm-devel/2019-January/msg00032.html https://www.redhat.com/archives/lvm-devel/2019-January/msg00031.html
Thanks for your fast reply. The chunk size was quite large: [root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data LV CachePolicy CacheSettings Chunk CacheUsedBlocks CacheDirtyBlocks gluster_lv_data cleaner migration_threshold=16384 <9.41m 2332 2332 Once I set the migration_threshold to 32768, the cleaner policy immediately started flushing the buffer: [root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data LV CachePolicy CacheSettings Chunk CacheUsedBlocks CacheDirtyBlocks gluster_lv_data cleaner migration_threshold=32768 <9.41m 2332 2304 [root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data LV CachePolicy CacheSettings Chunk CacheUsedBlocks CacheDirtyBlocks gluster_lv_data cleaner migration_threshold=32768 <9.41m 2332 2271 It's nice to know that this is a known problem and it's being worked on it. Maybe it's a good candidate for a Red Hat Solution ? Thanks for the hint with the migration_threshold. Didn't know it affects caches. Best Regards, Strahil Nikolov
Hi If it wouldn't be big problem - For now we try upstream to select default chunksize as power-or-2 - from performance reasons it's like better to stick with power-of-2 size even if the chunk itself will likely be bigger i.e. 8M chunks -> 16M chunks https://www.redhat.com/archives/lvm-devel/2019-January/msg00053.html Would be probably nice to have some testing trials whether this make sense - or would should rather focus on maybe 1MiB or 512KiB boundaries? Would it be a big problem to perform some testings what are the best results ? i.e. is chunksize 9.5MiB better -or 10MiB ? or 16MiB ? You could probably easily try yourself without applying any patch - just when creating a cache - set chunksize accordingly: -c 9728K -c 10240K -c 16384K What are states for underlying caching device in-use: i.e. grep "" /sys/block/sda/queue/*
Created attachment 1525028 [details] bonnie++ results with different chunk sizes
I have made some testing with bonnie++ and using a SATAIII SSD. The device in use : [root@ovirt1 ~]# grep "" /sys/block/sda/queue/* /sys/block/sda/queue/add_random:0 /sys/block/sda/queue/discard_granularity:512 /sys/block/sda/queue/discard_max_bytes:2147450880 /sys/block/sda/queue/discard_zeroes_data:1 /sys/block/sda/queue/hw_sector_size:512 grep: /sys/block/sda/queue/iosched: Is a directory /sys/block/sda/queue/iostats:1 /sys/block/sda/queue/logical_block_size:512 /sys/block/sda/queue/max_hw_sectors_kb:32767 /sys/block/sda/queue/max_integrity_segments:0 /sys/block/sda/queue/max_sectors_kb:512 /sys/block/sda/queue/max_segments:168 /sys/block/sda/queue/max_segment_size:65536 /sys/block/sda/queue/minimum_io_size:4096 /sys/block/sda/queue/nomerges:0 /sys/block/sda/queue/nr_requests:128 /sys/block/sda/queue/optimal_io_size:0 /sys/block/sda/queue/physical_block_size:4096 /sys/block/sda/queue/read_ahead_kb:128 /sys/block/sda/queue/rotational:0 /sys/block/sda/queue/rq_affinity:1 /sys/block/sda/queue/scheduler:[noop] deadline cfq /sys/block/sda/queue/unpriv_sgio:0 /sys/block/sda/queue/write_same_max_bytes:0 Note: The SSD was removed from the VG and "blkdiscard"-ed before each test.Sadly I didn't have enough time - so only a single test was performed. It seems that the maximum chunks for a cpool is 1,000,000 - so with larger devices a larger chunk size will be needed.
Hmm interesting - unsure how bonnie++ is consistent in those results - but there is quite interesting that 2MiB somehow 'break' the line, but it looks like cache chunks above 512K are not really all that much useful - as they possibly do not fully use capabilities of caching drive. The limit of 1.000.000 chunks is ATM pretty strongly enforce - and origins come from age before V2 format support for cache were start and shutdown could take significant amount of time. The other reason is - the cache that maintains more chunks will eat quite some portion of RAM, but typically if people can afford TiB for caching - RAM is almost never an issue as well. So ATM - user CAN raise this limit in lvm.conf allocation/cache_pool_max_chunks currently on your own. It needs some examination what borders are more pricey then others. It seems to look like 16M chunks aren't productive for some workloads. I don't have good answers to these questions. ATM I'd probably recommend in your case to try to go just with 512G cache with 512K chunks - and see if it doesn't work actually better/faster this way, that with bigger cache but bigger chunks. It also could be (if you have lot of RAM) raising limit to 2 or maybe even 4M chunks per cache also result in still acceptable memory occupancy and lets you use bigger cache pool size with 512K chunks size. It would be nice to share some result of usability and performance after couple days of use. (since hotspot cache takes a while to 'optimize' for your device's hotspot)
I forgot to mention what flags were used: bonnie++ -d /mnt -s 50G -n 0 -m TEST-chunk_size_k_chunk-writeback -f -b -u root In my case the systems will be a Hyperconverged oVirt with GlusterFS lab and the cache available will be around 150G. As the gluster will be limited by the network (sadly the workstations have only 1 Gbit/s interfaces)I cannot expect writes above 123GiB/s , but I have noticed that the reads are quite good. Do you think that Chunk size of 4M can lead to better reads (generally speaking) when having in mind that most SSDs report 4096 bytes optimal ? This one is from my SSD: Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Chunk size should be likely matching best IO for SSD and also not be disruptive for your 'origin' LV i.e. if it would be raid array - it should be following whole stripe so there are no additional read-modify-write operations involved - but this is tricky to get it right since many drives provide rather random stuff for minimal and optimal io sizes. So it would be good to see comparative table - if you i.e. raise the bar of max chunks - how much less memory you can use - and does it pay of possibly bigger performance - but I surely not this is longer-term evaluation (as the hot-spot cache takes its time to tune well). It's IMHO probably better to go with 'smaller' ones - as small as 512K or 1M From your charts it seemed for many workloads to be likely degraded already with 4M size. Also - maybe not using all the size for cache - so it's actually not that big - but have those 1 million 512K chunks - can give you same performance and you can use remaining space for something else ?
I've used the whole size , but set the chunk size to 1M.
This has been already fixed with lvm2 version >= 2.02.184 lvm2 ensures the migration_threshold is always at least as big as 8 chunks.