Bug 1668163 - LVM cache cannot flush buffer,change cache type or lvremove LV (CachePolicy 'cleaner' also doesn't work) [NEEDINFO]
Summary: LVM cache cannot flush buffer,change cache type or lvremove LV (CachePolicy '...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2
Version: 7.6
Hardware: x86_64
OS: Linux
medium
unspecified
Target Milestone: rc
: ---
Assignee: Zdenek Kabelac
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-22 05:46 UTC by Strahil Nikolov
Modified: 2019-08-20 17:09 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
zkabelac: needinfo? (hunter86_bg)


Attachments (Terms of Use)
strace output of change of cache to writethrough (437.02 KB, text/plain)
2019-01-22 05:46 UTC, Strahil Nikolov
no flags Details
strace of lvconvert uncache (437.05 KB, text/plain)
2019-01-22 05:47 UTC, Strahil Nikolov
no flags Details
bonnie++ results with different chunk sizes (7.40 KB, application/zip)
2019-01-30 12:52 UTC, Strahil Nikolov
no flags Details


Links
System ID Priority Status Summary Last Updated
CentOS 0015729 None None None 2019-01-22 05:46:40 UTC

Description Strahil Nikolov 2019-01-22 05:46:40 UTC
Created attachment 1522331 [details]
strace output of change of cache to writethrough

Description of problem:
LVM cache in writeback mode cannot be changed to writethrough, nor it can be flushed to data_corig lv (endless loop).
Setting the CachePolicy 'cleaner' doesn't force lvm cache to be flushed.
lvremove fails also due to endless flush buffer

Version-Release number of selected component (if applicable):
libblockdev-lvm-2.18-3.el7.x86_64
lvm2-2.02.180-10.el7_6.2.x86_64
lvm2-libs-2.02.180-10.el7_6.2.x86_64
udisks2-lvm2-2.7.3-8.el7.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Follow official documentation for creation of lvm cache (use '--cachemode writeback')
2. Fill in some data - you will notice that cache_dirty_blocks won't get flashed
3. Set CachePolicy to 'cleaner'
4. Try to change the cache mode from 'writeback' to 'writethrough':
lvchange --cachemode writethrough VG/LV
5. Try to uncache:
lvconvert --uncache VG/LV
6. Try to remove LV:
lvremove VG/LV

Actual results:
cache_dirty_blocks are not flushed to data_corig lv, an endless loop of 'Flushing xxxx blocks for cache VG/LV.' is being reported,but the count doesn't drop nor ever finishes.
Cleaner policy fails to flush the cache.
Lvremove cannot remove the LV.

Expected results:
All used blocks to be flushed during cache change/ uncache operations and during lv removal.

Additional info:
Other OS reports:
https://www.reddit.com/r/archlinux/comments/9owc15/lvm_cache_not_flushing_after_unclean_shutdown/ 
https://marc.info/?l=linux-lvm&m=152948734523317&w=2

Comment 2 Strahil Nikolov 2019-01-22 05:47:24 UTC
Created attachment 1522332 [details]
strace of lvconvert uncache

Comment 3 Zdenek Kabelac 2019-01-22 09:04:31 UTC
Please provide basic info about cache chunksize   (lvs -o+chunksize)

The assumption could be -  the cache chunksize is >= 1MiB and there is unspecified bigger migration_threshold and remained at defaul value 2048  (1MiB).

This prevents kernel from flush blockes.

There are several bugs about this.


The quick workaround solution is to set higher threshold:

lvchange --cachesettings migration_threshold=16384 vg/cacheLV



There are already upstream patches where lvm2 guards this settings now and does ensure threshold is at least 8 chunks bigs.

https://www.redhat.com/archives/lvm-devel/2019-January/msg00032.html
https://www.redhat.com/archives/lvm-devel/2019-January/msg00031.html

Comment 4 Strahil Nikolov 2019-01-22 14:29:14 UTC
Thanks for your fast reply.

The chunk size was quite large:
[root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data 
  LV              CachePolicy CacheSettings             Chunk  CacheUsedBlocks  CacheDirtyBlocks
  gluster_lv_data cleaner     migration_threshold=16384 <9.41m             2332             2332

Once I set the migration_threshold to 32768, the cleaner policy immediately started flushing the buffer:

[root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data 
  LV              CachePolicy CacheSettings             Chunk  CacheUsedBlocks  CacheDirtyBlocks
  gluster_lv_data cleaner     migration_threshold=32768 <9.41m             2332             2304
[root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data 
  LV              CachePolicy CacheSettings             Chunk  CacheUsedBlocks  CacheDirtyBlocks
  gluster_lv_data cleaner     migration_threshold=32768 <9.41m             2332             2271

It's nice to know that this is a known problem and it's being worked on it.

Maybe it's a good candidate for a Red Hat Solution ?

Thanks for the hint with the migration_threshold. Didn't know it affects caches.

Best Regards,
Strahil Nikolov

Comment 5 Zdenek Kabelac 2019-01-29 10:26:47 UTC
Hi

If it wouldn't be big problem - 

For now we try upstream to select default chunksize as power-or-2 - from  performance reasons it's like better to stick with power-of-2 size even if the chunk itself will likely be bigger  i.e.  8M chunks  -> 16M chunks

https://www.redhat.com/archives/lvm-devel/2019-January/msg00053.html

Would be probably nice to have some testing trials whether this make sense - or would should rather focus on maybe  1MiB or 512KiB  boundaries?

Would it be a big problem to perform some testings what are the best results ?

i.e. is chunksize  9.5MiB  better  -or  10MiB ? or 16MiB ?

You could probably easily try yourself without applying any patch -  just when creating a cache - set  chunksize accordingly:


-c 9728K
-c 10240K
-c 16384K

What are states for underlying caching device in-use:

i.e.
grep "" /sys/block/sda/queue/*

Comment 6 Strahil Nikolov 2019-01-30 12:52:36 UTC
Created attachment 1525028 [details]
bonnie++ results with different chunk sizes

Comment 7 Strahil Nikolov 2019-01-30 12:53:42 UTC
I have made some testing with bonnie++ and using a SATAIII SSD.
The device in use :
[root@ovirt1 ~]# grep "" /sys/block/sda/queue/*
/sys/block/sda/queue/add_random:0
/sys/block/sda/queue/discard_granularity:512
/sys/block/sda/queue/discard_max_bytes:2147450880
/sys/block/sda/queue/discard_zeroes_data:1
/sys/block/sda/queue/hw_sector_size:512
grep: /sys/block/sda/queue/iosched: Is a directory
/sys/block/sda/queue/iostats:1
/sys/block/sda/queue/logical_block_size:512
/sys/block/sda/queue/max_hw_sectors_kb:32767
/sys/block/sda/queue/max_integrity_segments:0
/sys/block/sda/queue/max_sectors_kb:512
/sys/block/sda/queue/max_segments:168
/sys/block/sda/queue/max_segment_size:65536
/sys/block/sda/queue/minimum_io_size:4096
/sys/block/sda/queue/nomerges:0
/sys/block/sda/queue/nr_requests:128
/sys/block/sda/queue/optimal_io_size:0
/sys/block/sda/queue/physical_block_size:4096
/sys/block/sda/queue/read_ahead_kb:128
/sys/block/sda/queue/rotational:0
/sys/block/sda/queue/rq_affinity:1
/sys/block/sda/queue/scheduler:[noop] deadline cfq
/sys/block/sda/queue/unpriv_sgio:0
/sys/block/sda/queue/write_same_max_bytes:0



Note: The SSD was removed from the VG and "blkdiscard"-ed before each test.Sadly I didn't have enough time - so only a single test was performed.
It seems that the maximum chunks for a cpool is 1,000,000 - so with larger devices a larger chunk size will be needed.

Comment 8 Zdenek Kabelac 2019-01-30 13:59:42 UTC
Hmm interesting -  unsure how bonnie++ is consistent in those results - but there is quite interesting that  2MiB somehow 'break' the line,
but it looks like cache chunks above 512K are not really all that much useful - as they possibly do not fully use capabilities of caching drive.

The limit of  1.000.000 chunks is ATM pretty strongly enforce - and origins come from age before V2 format support for cache were start and
shutdown could take significant amount of time.  The other reason is - the cache that maintains more chunks will eat quite some portion of RAM,
but typically if people can afford  TiB for caching - RAM is almost never an issue as well.

So ATM - user CAN raise this limit in lvm.conf   allocation/cache_pool_max_chunks  currently on your own.

It needs some examination what borders are more pricey then others.
It seems to look like 16M chunks aren't productive for some workloads.

I don't have good answers to these questions.

ATM I'd probably recommend in your case to try to go just with  512G cache with 512K chunks - and see if it doesn't work actually better/faster this way, that with bigger cache but bigger chunks.

It also could be (if you have lot of RAM) raising limit to 2 or maybe even 4M chunks per cache also result in still acceptable memory occupancy and lets you use bigger cache pool size with 512K chunks size.

It would be nice to share some result of usability and performance after couple days of use.

(since hotspot cache takes a while to 'optimize' for your device's hotspot)

Comment 9 Strahil Nikolov 2019-01-30 14:20:32 UTC
I forgot to mention what flags were used: 
bonnie++ -d /mnt -s 50G -n 0 -m TEST-chunk_size_k_chunk-writeback -f -b  -u root

In my case the systems will be a Hyperconverged oVirt with GlusterFS lab and the cache available will be around 150G.
As the gluster will be limited by the network (sadly the workstations have only 1 Gbit/s interfaces)I cannot expect writes above 123GiB/s , but I have noticed that the reads are quite good.

Do you think that Chunk size of 4M can lead to better reads (generally speaking) when having in mind that most SSDs report 4096 bytes optimal ?

This one is from my SSD:
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Comment 10 Zdenek Kabelac 2019-01-31 15:46:34 UTC
Chunk size should be likely matching best IO for SSD and also not be disruptive for your 'origin' LV  i.e. if it would be raid array - it should be following whole stripe so there are no additional  read-modify-write operations involved - but this is tricky to get it right since many drives provide rather random stuff for minimal and optimal io sizes.

So it would be good to see comparative table - if you i.e. raise the bar of max chunks - how much less memory you can use - and does it pay of possibly bigger performance - but I surely not this is longer-term evaluation (as the hot-spot cache takes its time to tune well).

It's IMHO probably better to go with 'smaller' ones - as small as 512K or 1M
From your charts it seemed for many workloads to be likely degraded already with 4M size.

Also - maybe not using all the size for cache - so it's actually not  that big - but have those 1 million 512K chunks    - can give you same performance and you can use remaining space for something else ?


Note You need to log in before you can comment on or make changes to this bug.